Re: Computerised authorship attribution



On 26 Jan 2006 11:16:52 -0800, "Ross Clement (Email address invalid -
do not use)" <clemenr@xxxxxxxxxx> wrote:

<snip>

>I tried using a heuristic based on information theory to select words
>for use as dimensions. It didn't improve attribution accuracy. That
>doesn't mean that your experiment won't give better results than
>choosing the most frequent words.

Well, I'll give it a try and see what happens.

I think, as John commented, the choice of variables may be the most
important factor here. I think in reducing the words in this way, it
will help to distinguish authors more soundly than the standard
K-Nearest Neighbour technique.

I think this will aid correct classification - what do you think?

>
>If you're doing experiments like this please make sure that you
>understand what the key words "overfitting" and "cross-validation"
>mean.

Sure will. Like I say, I'm not a statistician in any way; I'm just a
Software Engineering student, so all of these techniques are foreign
waters for me. But I will endeavour to grasp as much as I can.

The calculations I included in my previous post - did this look like
correct use of the K-Nearest Neighbour? I just need to check before I
start to code the thing.

Matt


.