Re: k-means or not?



It would be more usual to analyze the data with hierarchical cluster
analysis (HCA) -- e.g., average-linkage or single-linkage). Here's
how:

1. From raw data, construct a matrix of co-occurrence frequencies, F,
between each pair of methods, where f(i,j) is the number of times
method i occurs with method j. This matrix is symmetrical.

2. From F, produce a proximity matrix, P, by adjusting each element
for the marginal row and column frequencies. That is, adjust f(i,j) by
the numbers of times method i and method j occur overall--f(i) and
f(j).

Note: This step is where the 'art' comes in. There are several ways
to make the adjustment and you need select one suitable for your goals.
Some examples are:

p(i,j) = f(i,j) / sqrt[f(i) * f(j)]
p(i,j) = f(i,j) / [f(i) + f(j)]
p(i,j) = f(i,j) / min[f(i), f(j)]

You might get ideas by checking the literature on cluster analysis
and/or multidimensional scaling of co-occurrence matrices.

3. Use HCA to analyze the matrix P. You need software that lets you
supply a proximity matrix rather than raw data. SAS will let you do
this. If you don't have too many methods (< 50) I also have a program
at StatLib for this.

If for some reason you strongly prefer k-means, there is a trick you
could use: First submit the P matrix to multidimensional scaling. The
scaling would convert the proximities into sets of coordinates for each
method. These coordinates could then be used in k-means clustering.

--
John Uebersax PhD

.