Re: JSH: The will to lie
- From: David Bernier <david250@xxxxxxxxxxxx>
- Date: Mon, 04 Sep 2006 10:50:50 -0400
Jesse F. Hughes wrote:
David Bernier <david250@xxxxxxxxxxxx> writes:[snip]
David Bernier wrote:
William Hughes wrote:
I think it would be interesting to see by how much the sequence:
5-3*floor(5/3), ... 1000000007-3*floor(1000000007/3)
can be compressed by Winzip, Gzip, bzip2, etc.
In the primes mod 3 sequence, for primes>3 and < 10^7, we get a finite
subsequence of 664,577 numbers in {1,2}.
I used a simplistic encoding 1 -> a byte '1', 2 -> a byte '2'.
A 664,577-byte file didn't even compress by a factor of 8 using the
Winzip default. I think I got 84% reduction with the PPmD option;
but that leaves 16%, or 0.16 > 1/8.
The obvious thing is to code in binary, and then compress. But I've
put that on hold. Another possibility is to develop a custom
compressor based on all of Tim Peters pretty considerable
computations and comments. It's not clear to me how much
pseudo-entropy is lost for subsequences of length ~= pi(10^9) using
the correlations he found. I think a realistic probabilistic model
for these correlations could be based on Markov chains, but I don't
know how to compute entropies of Markov chains or processes.
Isn't the obvious compression algorithm for this sequence the one you
give above?
The sequence is given by a simple algorithm. All you need to know is
how many times you want to repeat that algorithm to generate the
initial segment. So it seems to me that you can compress a sequence
of length n to one of log_2(n) + C or so (where C accounts for the
code that calculates the next prime and then mods by 3).
Right. What I would like to know is how well
general purpose compressors (that don't try to find
correlations with sequences of primes) can do.
In the best of worlds, we would like to pack blocks of 8 consecutive
values of James's sequence into one byte, which is 8 bits. After
that, we can compress this packed file using any number of
compression programs. What I did amounted to getting a byte
which prints to a '1' or a '2'. In ASCII, a character '1' is
49 in decimal, 31 in hexadecimal and 00110001 as a block of 8 bits.
Similarly, a character of '2' is 50 decimal, 32 hex. or 00110010 as a
block of 8 bits. So any compressor looking at my 664,577-byte file
will see bunches of '00110001's and '00110010's , where each
bunch represents one term of James's sequence. I'd expect a file
with 100,000,000 terms of James's sequence to be compressed
to a greater extent. But I am making some progress towards
packing 8 terms in one byte. Maybe PPM or whatever can do
better if it's all binary:
http://en.wikipedia.org/wiki/Prediction_by_partial_matching
Or am I just mistaken on what you're doing. I assume you're looking
at the Chaitin/Kolmogorov/Solovay complexity of the sequence to see
how random it is in their sense. But since the sequence is defined by
a simple algorithm, the answer is "not at all".
(see above)
David Bernier
.
- Follow-Ups:
- Re: JSH: The will to lie / compression
- From: James Waldby
- Re: JSH: The will to lie / compression
- References:
- Re: JSH: The will to lie
- From: Jesse F. Hughes
- Re: JSH: The will to lie
- Prev by Date: Re: Am I a crank?
- Next by Date: Euqlidian Geometry
- Previous by thread: Re: JSH: The will to lie
- Next by thread: Re: JSH: The will to lie / compression
- Index(es):
Relevant Pages
|