Calculating centroids in k-means clustering (for different distance metrics)
- From: Colin Green <wellhavenospamhere@xxxxxxxxxxxxxx>
- Date: Tue, 8 Sep 2009 10:25:15 -0700 (PDT)
Hi,
I'm working on a software project that uses clustering in N-
dimensional space and I've been investigating the use of k-means
clustering in place of a custom clustering method that is currently
used. I have successfully implemented k-means using Euclidean distance
which has lead me to wonder if other distance metrics would be better
suited to
my problem, specifically Manhattan distance. I see that others have
come to the same conclusion e.g. this link discusses using k-means
along with Manhattan distance to cluster data from mass spectrometers:
http://www.cs.brown.edu/%7Earitz/files/SDMpresentation.pdf (Generating
Normalized Cluster Centers with KMedians)
The motivation for Manhattan over Euclidean distance is that the
dimensions are largely independent of each other - to get from point A
to point B you have to travel along the axes; there is no direct
'Euclidean' path from A to B.
A central part of k-means clustering is that a centroid is calculated
from the members of each cluster. The centroid is then used to
determine which cluster a given point is closest to. In standard k-
means the centroid is the mean of the cluster points, e.g. for two
points in 2D space centroid_x = (x1+x2)/2, centroid_y = (y1+y2)/2
This results in a point that minimizes both the mean distance from the
centroid and the mean squared distance (MSD). So we minimize intra-
cluster variance.
What I'm trying to figure out is what is the correct centroid
calculation to use when using Manhattan distance instead of Euclidean
distance. To minimize /mean/ distance to the centroid we use the /
median/ value of each component, but this does not minimize MSD, as
such I suspect the mean is still the correct centroid calculation.
However, teh only two instances I've found of k-means with Manhattan
distance (including the above link) use the /median/ centroid. One
motivation being that using the median handles outlier points better -
but I'm not sure if that's a primary concern or just a secondary
benefit of having to use median centroids because they are in fact the
correct way of minimizing squared error for the k-means method as a
whole.
Thanks.
Colin Green
BTW my project is sharpneat.sourceforge.net. It's to do with evolving
neural networks and the clustering is how genomes in a population are
speciated to allow evolution to progress on multiple fronts.
.
- Follow-Ups:
- Prev by Date: &★☆&Paypal paymen Wholesale Cheap Boots brand shoes ( www.salewto.com )
- Next by Date: ✄✄ Cheap price wholesale Chanel handbag and purse, AAA True Leather ect at WEBSITE --- www.fjrjtrade.cn (paypal payment)
- Previous by thread: &★☆&Paypal paymen Wholesale Cheap Boots brand shoes ( www.salewto.com )
- Next by thread: Re: Calculating centroids in k-means clustering (for different distance metrics)
- Index(es):
Relevant Pages
|
Loading