Calculating centroids in k-means clustering (for different distance metrics)



Hi,

I'm working on a software project that uses clustering in N-
dimensional space and I've been investigating the use of k-means
clustering in place of a custom clustering method that is currently
used. I have successfully implemented k-means using Euclidean distance
which has lead me to wonder if other distance metrics would be better
suited to
my problem, specifically Manhattan distance. I see that others have
come to the same conclusion e.g. this link discusses using k-means
along with Manhattan distance to cluster data from mass spectrometers:

http://www.cs.brown.edu/%7Earitz/files/SDMpresentation.pdf (Generating
Normalized Cluster Centers with KMedians)

The motivation for Manhattan over Euclidean distance is that the
dimensions are largely independent of each other - to get from point A
to point B you have to travel along the axes; there is no direct
'Euclidean' path from A to B.

A central part of k-means clustering is that a centroid is calculated
from the members of each cluster. The centroid is then used to
determine which cluster a given point is closest to. In standard k-
means the centroid is the mean of the cluster points, e.g. for two
points in 2D space centroid_x = (x1+x2)/2, centroid_y = (y1+y2)/2

This results in a point that minimizes both the mean distance from the
centroid and the mean squared distance (MSD). So we minimize intra-
cluster variance.

What I'm trying to figure out is what is the correct centroid
calculation to use when using Manhattan distance instead of Euclidean
distance. To minimize /mean/ distance to the centroid we use the /
median/ value of each component, but this does not minimize MSD, as
such I suspect the mean is still the correct centroid calculation.

However, teh only two instances I've found of k-means with Manhattan
distance (including the above link) use the /median/ centroid. One
motivation being that using the median handles outlier points better -
but I'm not sure if that's a primary concern or just a secondary
benefit of having to use median centroids because they are in fact the
correct way of minimizing squared error for the k-means method as a
whole.

Thanks.

Colin Green

BTW my project is sharpneat.sourceforge.net. It's to do with evolving
neural networks and the clustering is how genomes in a population are
speciated to allow evolution to progress on multiple fronts.
.



Relevant Pages

  • Re: Chebyshev distance for K-means
    ... I want to try k-means with the chebyshev distance. ... There are lots of distance measures, but few have a simple computation for the centroid. ...
    (comp.soft-sys.matlab)
  • Re: Does Hubble have to die?
    ... distances to the pleiades open star cluster (tight star cluster visible ... Calibrating Stellar Models with the Pleiades: Resolving the Distance ... ZAMS stars in our sample without the additional uncertainties ... roll constraints at times of maximum parallax factor. ...
    (sci.space.shuttle)
  • Maximum distance for VMS clusters
    ... I contacted Digital Networks, http://www.digitalnetworks.net. ... to them about long distance clustering and the concept of continetal ... The official party line from HP on maximum distance between VMS cluster ...
    (comp.os.vms)
  • Off topic - Spot the error
    ... Here the spreadsheet takes the place of the calculator, and allows the student to see the details of what is happening when a cluster analysis is undertaken. ... Euclidean measures of distance are based on the objects’ values on each of k variables under study. ... These researchers were trying to understand the descent of the modern domestic dog from an examination of the jaws of various specimens of dog-related species found in Thailand. ... The measurement numbers from the above table are used in all the tables ...
    (uk.philosophy.humanism)
  • Re: How PROC CLUSTER deals with Distance data sets ?
    ... My question is about Proc CLUSTER. ... centroid is strongly dependant on the distance used. ... click (left pane) ...
    (sci.stat.consult)

Loading