Re: clustering data by probability distribution
- From: illywhacker <illywacker@xxxxxxxxx>
- Date: Thu, 5 Feb 2009 00:26:10 -0800 (PST)
On Feb 4, 10:56 pm, RichUlrich <rich.ulr...@xxxxxxxxxxx> wrote:
On Wed, 4 Feb 2009 02:20:35 -0800 (PST), illywhacker
<illywac...@xxxxxxxxx> wrote:
On Feb 3, 9:41 pm, RichUlrich <rich.ulr...@xxxxxxxxxxx> wrote:
On Mon, 2 Feb 2009 03:16:31 -0800 (PST), Morfys <morfyss...@xxxxxxxxx>
wrote:
I have a data set that consists of several thousand sub-data sets.
Each sub-data set likely follows some probability distribution (which
is not known a priori).
Is there a way to cluster these sub-data sets by their probability
distribution?
For example, I would like to find which sub-data sets follow the same
normal distribution, which ones follow the same Pareto distribution,
etc.
Perhaps with k-means where the distance metric is the result of the
Kruskal-Wallis Test?
It seems like you have several possible starting points.
The Kruskal Wallis test starts compares the *average* rank,
which puts emphasis on comparing the location. Is that your
major concern? For a generic comparison of differences, you
could consider the Kolmogorov-Smirnov maximum distance; or
the sum of squared distances, etc. All of those would be
working with the ranks.
If you wanted to consider that the sources are generated in
families, you might start, instead, with assorting them by
width or shape, that is, emphasizing the SD or skewness and
kurtosis. Using the raw values, which should be important
for saying which are actual the *same*, you could do a
clustering based on the several central moments --
mean, SD, skewness, kurtosis.
Keep in mind that standard clustering reflects the
scale of the variables -- if you multiple everying by 10,
you will change the range and importance of the mean
and SD, relative to the usual skewness and kurtosis.
I would probably do a factor analysis on those central
moments, and assort the sets according to those components.
If you know what you are doing with clustering, you might
find that your multiple solution with k-means show you
pretty much the same final conclusions, with lines drawn
for you by the program.
And in all of these possibilities (and there are an infinity of
others), the assumptions that are being made are implicit and
therefore the failure or not of the method cannot be understood, and
models cannot be improved. There is an infinite number of probability
distributions that can be associated with a given finite set of data.
Using these, it is possible to devise methods that cluster the sub-
data sets *arbitrarily*!
That's safe to say.
Jolly good.
So, how does one choose? By making models relevant to *this* context,
based on the *real process* that generated the data, using our
existing knowledge of that process.
A useful part of statistical consulting is the process of evoking
models from the client - even, convincing the knowledgeable but
naive client that he does *have* some prior model in mind -- at
least to the extent that there are classes of models that can be
excluded.
Yes. So I wonder why you did not ask what the data was before making
suggestions. Is it star spectra or bacteria DNA? Holiday snapshots or
returns on investments? Occurrences of ideograms in 19thC Chinese
prose or the heights of children? Numbers of arguments in Shakespeare
plays or numbers of proton collisions in an accelerator?
illywhacker;
.
- Follow-Ups:
- Re: clustering data by probability distribution
- From: RichUlrich
- Re: clustering data by probability distribution
- References:
- clustering data by probability distribution
- From: Morfys
- Re: clustering data by probability distribution
- From: RichUlrich
- Re: clustering data by probability distribution
- From: illywhacker
- Re: clustering data by probability distribution
- From: RichUlrich
- clustering data by probability distribution
- Prev by Date: Re: calculating P(X>=Y)
- Next by Date: Re: scale equivariance
- Previous by thread: Re: clustering data by probability distribution
- Next by thread: Re: clustering data by probability distribution
- Index(es):
Relevant Pages
|