Re: Word count of minimum vocabulary





Richard Wordingham wrote:

Mok-Kong Shen wrote:


Compression schemes on ASCII texts could indeed achieve such
efficiencies. However, how many ASCII characters does an
average word in a common text have?


I was drawing an analogy. The ASCII character set uses 7 bits per
character. A Huffman encoding uses an average of 5 bits per character.
Your 1024 word set needs 10 bits per word - a Huffman encoding would
need less on average. Incidentally, have you worked out how to handle
punctuation?

Maybe I gravely misunderstood you. But according to a post
of Lee, an "average" word has 5 characters, so it would
follow that Huffman generates 5*5=25 bits per word on the
average, which is far more than 10 bits, right?

I suppose punctuation sign must be treated as independent
entities in a system treating words as units. (Are there
better ways?) On the other hand, most spaces in a text
need not be considered in that system. One can namely
adopt the convention that a word is always to be followed
by a space, unless the next unit is a punctuation sign.

There's an interesting discussion at
http://www.stanford.edu/class/cs276a/projects/reports/dalmassi-sammysy.html

Many thanks for the valuable link. I'll look at it sometime later.

M. K. Shen

.



Relevant Pages