Re: Computerised authorship attribution




On 25-Jan-2006, Matt B <mattb333@xxxxxxxxxxxxx> wrote:

> I'm far from 100% on it, but the way I envisage it working is this:
>
> Each training text for an author would be a vector, to be plotted in a
> multidimensional vector space. Each different word in the document
> would be a dimension of the vector. So if the word "person" occurred 5
> times, the value for the dimension represented by the word "person"
> would be 5.
>
> With the training texts "plotted" in the multidimensional space, I
> could then calculate the nearest neighbours to the document to be
> tested, based on an analysis of the differences between the vectors,
> using their dimension values. The class (ie author) of those which are
> nearest to the document in question determine which class it should be
> attributed to.
>
> However, the problem is, this would produce massive vectors, of
> potentially hundreds of dimensions - basically however many different
> words were in the document. Is there any way of reducing the number of
> words to be incorporated before creating the vector for the document?

A vector with hundreds of dimensions is not a problem for many modeling
methods.

Take a look at this picture of an SVM model being used to distinguish the
authorship of the Federalist Papers:

http://www.dtreg.com/SvmCube2.jpg

Then take a look at the description of SVM at http://www.dtreg.com/svm.htm

Have you collected any data yet? If you have, I'll be happy to build SVM,
TreeBoost and Decision Tree Forest models for you, and you can see how well
they perform at the classification task. All of these methods can accept
input data with hundreds of dimensions.

--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree and SVM predictive modeling)
http://www.nlreg.com (nonlinear regression)
.



Relevant Pages

  • Re: how to compute distance metrics with multi dimensional data
    ... Lou Pecora wrote: ... > It generalizes to any dimension. ... formula for multi dimentional time series, ... ie 2d input data or i should've called it multivariate input data ...
    (sci.nonlinear)
  • Get Value from dimension.
    ... I'm trying to get the last input data in a dimension to use it in a ... calculated member, but i don't know how can i get the data it's just a ... I don't want to do a sum of values i really need only the las member of ...
    (microsoft.public.sqlserver.olap)
  • Re: How to handle matrix resizing?
    ... Since input data is generated by another program and the dimension ... unknown, normally I define A as large as possible, say ... A, to accomodate them. ...
    (comp.soft-sys.matlab)
  • How to handle matrix resizing?
    ... I define a matrix A. Then I store some data in A. ... Since input data is generated by another program and the dimension is ... A, to accomodate them. ...
    (comp.soft-sys.matlab)

Loading