Re: PCA C code contradictions




shay wrote:
I have three seperate codes that run PCA. When I run my data through
them (or even a small synthetic data set of orthogonal variable
vectors) I get three different results.

Unfortunately there are many degrees of freedom, combinations of which
lead to a large set of possibilities and so slim chances of any two
implementations giving the same answer.

Those that this non-statistician can think of:

- diagonalise covariance matrix C= E[ (x - mu)(x - mu)']; ' = transpose

- use correlation matrix R = E[ x x'];

- looking at Murtagh's PCAcorr.java (which I assume is a direct port of
his C code), I see that in addition to subtracting the means he
'standardises', i.e. scales each component such that it has unit
standard deviation; in other words the covariance matrix of /scaled/
data.

No doubt there are variations. And in the example of PCA in Venables
and Ripley, Modern Applied Statistics with S, 4th ed., Springer, I see
that they use the iris data set, but take logs. I attempted to
replicate Murtagh's results in R (free clone of S), but I haven't
worked out how to 'standardise' the data (I'm an infrequent user of R).

The different codes are:
1. Accelrys Cerius2 version 4.11
2. http://astro.u-strasbg.fr/~fmurtagh/mda-sw/
3.
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf

Can anyone explain why differences exist and more importantly which is
the "correct" one?

Best regards,

Jon C.

.


Quantcast