Re: Word Frequency Distributions

From: Aleks Jakulin (a_jakulin_at_@hotmail.com)
Date: 11/16/04


Date: Tue, 16 Nov 2004 13:01:35 +0100

Ross Clement wrote:
> Looking at histograms of the data for individual words, most of them
> are skewed. Most of them show a longer high tail, but a few show a
> longer short tail. I only looked at the 10 most frequent words, but
> the pattern of non-normality seemed clear.

There has been considerable work on statistical word-frequency models
in text mining. Popular choices are Poisson, Negative Binomial, Gamma,
mixtures of Poissons, etc. Some words have one, other words another
distribution:
http://www1.cs.columbia.edu/~jaxin/nlpmeetings/2004-02-12-jansche.html

Another approach is to use the non-parametric Good-Turing estimates:
http://kodiak.ucsd.edu/alon/pub.html

-- 
mag. Aleks Jakulin
http://www.ailab.si/aleks/
Artificial Intelligence Laboratory,
Faculty of Computer and Information Science,
University of Ljubljana, Slovenia.