Re: Anyone wanna help with a compression routine (new type)



On 7 Dec, 12:03, WM <mueck...@xxxxxxxxxxxxxxxxx> wrote:
But I did not find a final answer to my question whether random
strings of every finite lengths n are existing.

Pi is an infinite pseudorandom string. It will pass the standard tests
of randomness. But we know it is pseudorandom. In fact the Kolmogorov
content of Pi is low, it is just "Pi". This provides a lead in to my
comments on language.

Perhaps that is my fault. Please correct me if I am in error. But as
far as I understood, we know: For every language a part of at least 1
- 2^-d of all bit sequences of length n has a Kolmogorow-complexity of
at least n - d. This yields the result (for d =3D 1) that at least half
of the strings have a Kolmogorow-complexity of at least n - 1, and
(for d =3D 0) that at least the empty set of strings has the complexity
of at least n - 0 =3D n. But it does not answer the question whether
there is at least one string for every n which has complexity n. This
is a problem because no string has a complexity of more than n (which
can be proven by simply programming its bit sequence in a Turing
machine).

We can treat language in a very naive way. We can assign a number to
every word and write the language down as a series of numbers. If we
do this we find that the minimum representation is a string of


Log(2) (N!/(a!b!....)) bits
However language makes sense in the AI sense. We know that
probability of the next word being c is not simply c/N but is some
value depending
on the surrounding words. The actual Kolmogorov value of Natural
Language is unknown. Anecdotal evidence suggests it is something like
1 bit per character, or about 6 bits per word as this is the value
that humans gets down to with "guessing".

How would we compress. Well a "gedanken" algorithm is the following.
At each point in the text we order the probabilities and thereby
relabel words. We then have:-

Log(2) (N!/(p1!p2!....)) < Log(2) (N!/(a!b!....)) bits

The problem is an important one as the compression of an algorithm
tells us how well AI is "understanding" a language. Google affects
translation by having a language pair. We could also attempt to
translate using a compression algorithm. If every language gave us 6
bits per word translation would be a trivial exercise, as would be
textural analysis. The algorithm, of course gives us the probabilities
of each word.

In fact translation would be a relaxation exercise. You put in the
translation that maximizes probability, and you converge iteratively.
Google in fact is after speech, and the only approach to speech that I
know of involves relaxation. "They are working day and knight" -
"knight" sounds like "night". In fact the difference between human and
machine appreciation of speech lies in "relaxation and meaning". The
neural networks that recognize phonemes are as good, if not better
than human recognition. Of course translating "dia y caballero"
represents the monolingual (unpaired) model.

Of course you are relying all the time on the fact that a loss free
compression exists < Log(2) (N!/(a!b!....)). If it did not then it
would not be possible to understand speech.

I feel I should end with one amusing fact. In translating "AI" into
Arabic (the circumstances clearly meant "Artificial Intelligence")
Google gave "Amnesty International". "Amnesty International" is simply
"c/N" as above, the value of "c" reflecting the less than perfect
human rights record of the Arab world. As I said Google takes pairs of
human translated texts. An alternative would be compressed monolingual
texts. Artificial Intelligence simply does not occur in the language
pair texts.


- Ian Parker

.



Relevant Pages

  • Re: =?iso-8859-1?q?Re:_Kolmorgorov_Complexity_and_Kim_=D8yhus?=
    ... There is no compression possible, ... compressed expression is shorter than the expressed string), ... makes no sense to ask whether the DNA-base sequence is shorter than the ... > language system code needed to compress and decompress the sequence. ...
    (talk.origins)
  • Re: What is the state-of-the-art analysing hardware impact on achievable compression rat
    ... To have a language you need a computer or a human capable to execute the ... "the information content of a string is an intrinsic property that is ... allowing compression by addressing the already existing hidden data patterns ...
    (comp.compression)
  • Re: Regular Expression Help
    ... language translation. ... For any new language, ... The translator will provide the language string ... and again I will have python utility to read the excel file target ...
    (comp.lang.python)
  • Regular Expression Help
    ... language translation. ... For any new language, ... The translator will provide the language string ... and again I will have python utility to read the excel file target ...
    (comp.lang.python)
  • Re: Language Selcection Philosophy
    ... m the actual string. ... translation under the default language message. ... Deliver the default/translated message pairs to the system at language ... first calculate the hash code for the string and find the ...
    (comp.arch.embedded)

Loading