Re: Word Frequency Distributions
From: Aleks Jakulin (a_jakulin_at_@hotmail.com)
Date: 11/16/04
- Next message: Tommi: "variable selection in logistic regression"
- Previous message: Aleks Jakulin: "Re: Independent random variables versus non correlated variables"
- In reply to: Ross Clement: "Word Frequency Distributions"
- Messages sorted by: [ date ] [ thread ]
Date: Tue, 16 Nov 2004 13:01:35 +0100
Ross Clement wrote:
> Looking at histograms of the data for individual words, most of them
> are skewed. Most of them show a longer high tail, but a few show a
> longer short tail. I only looked at the 10 most frequent words, but
> the pattern of non-normality seemed clear.
There has been considerable work on statistical word-frequency models
in text mining. Popular choices are Poisson, Negative Binomial, Gamma,
mixtures of Poissons, etc. Some words have one, other words another
distribution:
http://www1.cs.columbia.edu/~jaxin/nlpmeetings/2004-02-12-jansche.html
Another approach is to use the non-parametric Good-Turing estimates:
http://kodiak.ucsd.edu/alon/pub.html
-- mag. Aleks Jakulin http://www.ailab.si/aleks/ Artificial Intelligence Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia.
- Next message: Tommi: "variable selection in logistic regression"
- Previous message: Aleks Jakulin: "Re: Independent random variables versus non correlated variables"
- In reply to: Ross Clement: "Word Frequency Distributions"
- Messages sorted by: [ date ] [ thread ]