Re: Can any one help me calculate a statistical probability



On Mar 18, 1:01 pm, flame.d...@xxxxxxxxx wrote:
Here is the question. This concerns a claim of plagarism. There are
two indexes of a similar text numbering about 750,000 words. The first
index has 27,740 terms in it, while the second index has 3,500 terms
in it. The authors of the first index claim that the authors of the
second plagarized their index, but it turns out the indexes are mostly
different, and only a few terms are similar. Can anyone calculate what
the random similarity would be, i.e., if we assume that there was no
plagarism and that index 1 (27740 terms) and index 2 (3500 terms) were
independently derived, what would be the probability that some of the
terms would still be identical if the text to which the indexes refer
is 80%-90% similar.

Er... is this a legal claim, which is going to be
ruled on in court?

I don't think "somebody said this on Usenet" is
going to weigh very much as "expert testimony"
if that's what you're looking for.

It seems to me impossible to answer without some
description of how indexing is done. But it also
seems to me that if the same person were going to
use a similar algorithm (with some sort of threshold)
to make a 3500-term index and a 27000-term index
from the same text, then every term in the 3500-term
index would appear in the 27000-term index. That's
just my naive impression of how indexing would be done,
that the 3500 most significant terms are a proper
subset of the 27000 most significant terms.

But to really make an informed answer would require,
as I said, some description of "indexing".

- Randy
.


Loading