Re: Computerised authorship attribution
- From: Matt B <mattb333@xxxxxxxxxxxxx>
- Date: Wed, 25 Jan 2006 22:09:27 GMT
On 24 Jan 2006 11:46:47 -0800, "Ross Clement (Email address invalid -
do not use)" <clemenr@xxxxxxxxxx> wrote:
>John Burrows' "Delta" technique is very simple and easy to understand,
>and works very well. Same for Naive Bayes. Support Vector Machines also
>work well. If your assignments allows it you could adapt the libSVM
>code available for download from the university of Taiwan. In my
>experience, simple nearest neighbours approaches do not work well.
>However "Delta" is a nearest neighbour approach. CUSUM is "contentious"
>to say the least. In any case it's a technique more suited to literary
>experts as it requires you to edit the language of the original
>documents ... the cause of much of the contention as it's possible to
>edit the original text until it gives the result you expect. See papers
>by David Holmes criticising CUSUM, or the book by Farringdon if you
>want to see the case for it and a lot of details about it. See papers
>in the journals "Literary and Linguistic Computing", "Computers and the
>Humanities", and similar journals.
>
>Cheers,
>
>Ross-c
Ross -
Thanks very much. You certainly seem to know a lot about the topic.
I've already had a look at Burrows's Delta - in particular Hoover's
testing of it, but it didn't stike me as being that simple (though
that's probably my limited maths background showing itself). Also I'd
read that it was more suited to cases where the number of potential
authors was greater than 2, whereas I am looking to distinguish
between 2 authors only.
I've been thinking of using the K-Nearest Neighbour technique. Do you
know anything about this technique?
I'm far from 100% on it, but the way I envisage it working is this:
Each training text for an author would be a vector, to be plotted in a
multidimensional vector space. Each different word in the document
would be a dimension of the vector. So if the word "person" occurred 5
times, the value for the dimension represented by the word "person"
would be 5.
With the training texts "plotted" in the multidimensional space, I
could then calculate the nearest neighbours to the document to be
tested, based on an analysis of the differences between the vectors,
using their dimension values. The class (ie author) of those which are
nearest to the document in question determine which class it should be
attributed to.
However, the problem is, this would produce massive vectors, of
potentially hundreds of dimensions - basically however many different
words were in the document. Is there any way of reducing the number of
words to be incorporated before creating the vector for the document?
Does what I've said make sense? Or have I got this arse-about-face?
What do you think to it?
Matt (confused)
.
- Follow-Ups:
- Re: Computerised authorship attribution
- From: John Uebersax
- Re: Computerised authorship attribution
- From: Ross Clement (Email address invalid - do not use)
- Re: Computerised authorship attribution
- From: Phil Sherrod
- Re: Computerised authorship attribution
- References:
- Computerised authorship attribution
- From: Matt B
- Re: Computerised authorship attribution
- From: Ross Clement (Email address invalid - do not use)
- Computerised authorship attribution
- Prev by Date: Re: Residual Plot Question
- Next by Date: Re: Computerised authorship attribution
- Previous by thread: Re: Computerised authorship attribution
- Next by thread: Re: Computerised authorship attribution
- Index(es):
Relevant Pages
|