Re: datasets to test clustering algorithm

From: Ross Clement (clemenr_at_wmin.ac.uk)
Date: 08/13/04


Date: 13 Aug 2004 00:01:14 -0700

Rajarshi Guha <rajarshi@presidency.com> wrote in message news:<pan.2004.08.12.19.58.15.292994@presidency.com>...
> Hello, does anybody know of any freely available datasets which I could
> use in the comparison of a few clustering algorithms? I know of the UCI
> machine learning repository but I'm not sure which dataset would be useful
> for such a comparison. I'm looking for a large dataset ( > 1000
> observations) with 2 to 4 clusters.
>
> Any pointers would be appreciated.

My extreme personal bias suggests the following:

Look up David Hoover's paper:

Statistical Stylistics and Authorship Attribution: an Empirical
Investigation
Hoover D.L.
Literary and Linguistic Computing, december 2001, vol. 16, iss. 4, pp.
421-444(24)
Oxford University Press

This is a paper that attempts to discover the author of a paper by
clustering texts based on a fixed size set of word counts in a set of
documents.

You'd have to do some programming. But, if you can, you can download a
whole lot of books from project gutenberg (http://www.promo.net/pg/),
process them to turn them into these fixed-length numerical vectors
(as described in more detail in his paper), and then cluster away to
see if you end up clustering books by the same author together, or
not.

And, if your clustering algorithms get better results than Hoover got,
you could try submitting a paper on your results in that journal!

Cheers,

Ross-c



Relevant Pages

  • Re: Feature selection and K-means clustering
    ... Function 'sequentialfs' can be applied for both supervised learning ... unsupervised algorithms (such as clustering algorithms). ... Selecting features for clustering is not easy in general. ...
    (comp.soft-sys.matlab)
  • Re: 2 Questions: Manova and Selecting features
    ... Computes a Multivariate Analysis of Variance for equal or unequal ... Statistical power of a performed single-factor MANOVA. ... Many clustering algorithms are ... Thus, although clustering algorithms are ...
    (comp.soft-sys.matlab)
  • Re: 2 Questions: Manova and Selecting features
    ... Also you should verify the following MANOVA files ... Computes a Multivariate Analysis of Variance for equal or unequal ... Many clustering algorithms are ... Thus, although clustering algorithms are ...
    (comp.soft-sys.matlab)
  • Re: assumption of Classification
    ... Most clustering algorithms (k-means, ... > clustering algorithms which assumes that each cluster is multivariate ... A classification tree that tries to break at every value ... normality is not the only assumption to be checked. ...
    (sci.stat.edu)