Re: Computerised authorship attribution



Delta is a very simple technique. It is basically a 1-nearest neighbour
technique. Try and get some friend of yours with a maths background to
explain it, as once you understand it, you'll see how simple it is.

Basically, measure the frequency of a word such as "the" in a number of
documents. The more the better. Calculate the sample average and sample
standard deviation. The z-score for a particular document is the
frequency of "the" in that text minus the sample average for the, and
then divide the result of that subtraction by the sample standard
deviation. If you have an unknown document and a number of docments by
known authors, then you attribute the document to the closest author.
If you're just considering "the", then you calculate the z-score for
the in all documents, including the unknown one. Then you attribute the
text to the author of the known authorship document with the most
similar z-score (smallest absolute difference). To use more words,
calculate z-scores and absolute differences for a number of words. Then
to know how "different" two documents are, sum the absolute differences
in the z-scores. Then attribute the unknown document to the document of
known authorship with the smallest such sum. That's delta.

Generally word frequencies are given as percentages, so if a document
has 5000 words,and "the" appears 215 times in that document, then the
raw frequency for the is 0.043 or 4.3%. If the average occurence of
"the" in all documents in 3.7% with a sample standard deviation of
0.8%, then the estimated z-score for "the" in our document is
(4.3-3.7)/0.8 or 0.75. If another document has a z-score for "the" of
0.5, then the absolute difference in z-scores is 0.25. If we have
document A with z-scores for "the" and "a" of 0.75 and 0.3, and a
document B with z-scores for "the" and "a" of 0.5 and 1.2, then the
delta score for the difference between the documents is:

delta(A,B) = abs(0.75-0.5) + abs(0.3-1.2) = 0.25 + 0.9 = 1.15

Delta, SVMs, and Naive Bayes are fairly similar in performance on large
documents. More generic K-nearest neighbour techniques (e.g. using
Euclidean distance rather than Burrows sum of absolute difference of
z-scores for word frequencies) don't work as well. I've tried all
possible "genericisations" of Burrows' technique that I could think of
to make it more generic k-nearest neighbours and all such modified
versions performed worse.

For "plotting" texts on a 2-dimensional graph, it's typical to use PCA
on the vectors, and take the first two components as the x and y
coordinates of texts. Attribution is then made by visual inspection of
the graph.

Hundreds of dimensions are not a problem. In experiments on large
texts, I found that Burrows' Delta worked best with about the 6000 most
common words.

All of the techniques mentioned will work for two potential authors.
Adding more authors to choose from always makes the problem harder.

Cheers,

Ross-c

.



Relevant Pages

  • Re: Computerised authorship attribution
    ... >Delta is a very simple technique. ... then the absolute difference in z-scores is 0.25. ... More generic K-nearest neighbour techniques (e.g. using ...
    (sci.stat.math)
  • Re: Computerised authorship attribution
    ... simple nearest neighbours approaches do not work well. ... In any case it's a technique more suited to literary ... I've already had a look at Burrows's Delta - in particular Hoover's ... would be a dimension of the vector. ...
    (sci.stat.math)
  • A method for differential cryptanalysis
    ... leads to output delta 0, at least some of the time. ... fxor f(base xor delta) = 0 ... simplified avalanche matrix. ... I have applied this technique to EnRUPT, ...
    (sci.crypt)
  • linear programming
    ... technique is Linear programming method. ... delta v',delta ...
    (comp.soft-sys.matlab)