Re: Computerised authorship attribution
- From: "Phil Sherrod" <phil.sherrod@xxxxxxxxxxxxxxxxxxx>
- Date: Wed, 25 Jan 2006 22:35:26 GMT
On 25-Jan-2006, Matt B <mattb333@xxxxxxxxxxxxx> wrote:
> I'm far from 100% on it, but the way I envisage it working is this:
>
> Each training text for an author would be a vector, to be plotted in a
> multidimensional vector space. Each different word in the document
> would be a dimension of the vector. So if the word "person" occurred 5
> times, the value for the dimension represented by the word "person"
> would be 5.
>
> With the training texts "plotted" in the multidimensional space, I
> could then calculate the nearest neighbours to the document to be
> tested, based on an analysis of the differences between the vectors,
> using their dimension values. The class (ie author) of those which are
> nearest to the document in question determine which class it should be
> attributed to.
>
> However, the problem is, this would produce massive vectors, of
> potentially hundreds of dimensions - basically however many different
> words were in the document. Is there any way of reducing the number of
> words to be incorporated before creating the vector for the document?
A vector with hundreds of dimensions is not a problem for many modeling
methods.
Take a look at this picture of an SVM model being used to distinguish the
authorship of the Federalist Papers:
http://www.dtreg.com/SvmCube2.jpg
Then take a look at the description of SVM at http://www.dtreg.com/svm.htm
Have you collected any data yet? If you have, I'll be happy to build SVM,
TreeBoost and Decision Tree Forest models for you, and you can see how well
they perform at the classification task. All of these methods can accept
input data with hundreds of dimensions.
--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree and SVM predictive modeling)
http://www.nlreg.com (nonlinear regression)
.
- References:
- Computerised authorship attribution
- From: Matt B
- Re: Computerised authorship attribution
- From: Ross Clement (Email address invalid - do not use)
- Re: Computerised authorship attribution
- From: Matt B
- Computerised authorship attribution
- Prev by Date: Re: Computerised authorship attribution
- Next by Date: Re: Residual Plot Question
- Previous by thread: Re: Computerised authorship attribution
- Next by thread: Re: Computerised authorship attribution
- Index(es):
Relevant Pages
|
|