Re: Can any one help me calculate a statistical probability



We'll need some additional information is to be satisfactory. How do
we go about choosing a word from the text for an index? Even if we
choose randomly, we aren't going to choose "and" as often as we choose
"integral equations," so we can't assume that the indices are
uniformly distributed over the texts. I mean, we could, but this
wouldn't be useful.

Here's a naive stab at the problem. Let's assume, wildly, that you
know the probabilities of any author choosing a phrase from a text to
be in the index; i.e. P('and') = .00 and P('integral equations') = .
99. Is this valid? Only somewhat, but it's probably the best
approximation we'll be able to get, and booksellers like Amazon have
found some use for it.

Let's then assume that author A has chosen all of the phrases for
index A. For each phrase in text A, create a variable like which
indicates if the phrase was chosen for inclusion; i.e. i_A_and = 0
tells us that 'and' wasn't included by author A, while
i_A_integralequations = 1 tells us that 'integral equations' was
included by author A.

Then consider the set of all phrases in text A intersected with the
set of all phrases in text B. Here we're only considering what they
have in common because B isn't so stupid that he would plagiarize
about the wrong thing. This gives us a set of indicators like...

i_A_aardvark = 0, i_B_aardvark = 1
i_A_and = 0, i_B_and = 0

....

i_A_zebra = 0, i_B_zebra = 0

If we take the correlations of these variables, we'll probably find a
very high correlation. This makes sense if their underlying texts have
basically the same topic. However, what happens when we calculate the
partial correlation given the probabilities that the words would have
appeared in the first place? That is, if both texts include the phrase
'integral equations', this means very little to us because the
probability of that happening was high in the first place. However,
what if both texts include 'zebra' when P('zebra') = 0.01? This is a
highly unlikely event and will lend weight to the correlation. If this
correlation is high, then when one deviates from the underlying
probability distribution, they tend to deviate together. The strength
of this correlation is informative and we can measure the likelihood
of this correlation; i.e. is it meaningfully different from
independence?

We can also measure the partial correlation when we consider only
those phrases for which i_A_phrase = 1. The former case allowed that B
might choose not include a phrase when A didn't. This case is far more
simple - B only decided to include phrases when A did.

--- Alternatively ---

To answer your question strictly, though, we'll need to make a model
of the form

H_0(i_B_phrase) = p_0_phrase * i_A_phrase + (1 - p_0_phrase) *
p(phrase)

which says that, with probability p_0, author B directly copies what
author A did. With probability 1-p_0, author B randomly chooses
whether or not to include the phrase, given that prior distribution on
phrases we discussed. Then the probability you're seeking is product(1
- p_0_phrase) where the product is taken over all phrases, since this
gives you the probability that B never borrowed from A.

This isn't satisfactory, since we need to estimate so many of the
p_0_phrases and we'll end up with a nearly 0 probability that B never
borrowed. We can fix this by considering only the cases where
p(phrase) is low and i_A_phrase = 1. We can further simplify by
assuming that p_0_phrase = p_0 for all phrases, so that B has a
universal tendency to plagiarize. We can estimate this fairly easily
and it will stand in as a proxy for evidence of B's actually having
plagiarized.

--- Finally ---

These are really sophisticated ways out there of doing this. These are
just two very simple models which should be taken with a grain of salt
- any actual application will need access to large databases of
similar texts and will have enough resources that it won't be relying
on something as duct-tape'd as these.
.



Relevant Pages