Re: PCA for cluster detection?




Liviu Ene wrote:
Hej,
I want to appologize for my ignorace since I'm just a student trying to get known with statistics. I have a question regarding Principal component analysis.
I've been addressed a quite dangerous question: "Explain how PCA, cluster analysis and the correlation matrix TOGETHER can be used to improved knowledge about the relationships between different variables od a data set."
I think that PCA by reducing dimensionality of the data set, it should NOT be recommanded BEFORE cluster analysis, I am wrong? One can superimpose the few factors after running PCA on the initial data set hoping that it will reveal some clustering in the dataset, but the space spanned by the PCs contain less variance that the variance of the original dataset, is this superimposing method a reliable way to eventually recognize some cluster pattern in the original dataset? Am I missing something?!?
I know that PCA can be very tricky and I don't want to get it wrong from the beginning, so I will really appreciate your opinion!
Thank you,
Liviu

Via eigenanalysis of the mixture covariance matrix PCA finds the
orthogonal directions in sample space that yield the
largest total variance (spread/scatter) in data. Given a cluster
criterion
(e.g., k- means), there are several concepts to consider when
trying to determine the cluster patterns within a mixture of data
containing groups.
1. Scaling. The scales used for the different variables can
change perceptions of importance and relationships. It
is necessary to choose these wisely in the contexr of the
ultimate problem to be solved. Often it is wise to use
standardized variables to concentrate on correlation
structure. The covariance matrix of the standardized
variables is just the correlation matrix.
2. Between cluster scatter: The scatter of the cluster
means about the mixture mean.
3. Within cluster scatter: The sum over clusters of the
scatter, within a cluster, about the mean of that cluster.

The total scatter is the sum of the between cluster scatter
and the within cluster scatter, The total variance is
proportional to the total scatter. The between and within
cluster variances are proportional to the corresponding
scatters. The proportionality constants are reciprocals
of the degrees of freedom use to estimate the mixture
and cluster means.( I don't want to mislead you by
quoting equations from my aging memory).

If reduced variable PCA is used to represent the
cluster mixture some of the clustering may be missed.
For example, consider two closely spaced parallel
cigar-shaped distributions in 2-D. The dominant
PC is in the direction of the cigar lengths and
projections on that direction will obscure the parallel
separation.

Other repliers will add to what I have written.

Hope this helps.

Greg

.



Relevant Pages

  • Re: Clustering / Classification
    ... I have a huge NxN correlation matrix, ... to cluster this data to X groups of the most similar ... this can be done using K-Means & friends though it would take ...
    (comp.dsp)
  • Clustering / Classification
    ... I have a huge NxN correlation matrix, ... to cluster this data to X groups of the most similar ... this can be done using K-Means & friends though it would take ...
    (comp.dsp)