Re: Estimating the mean of the cumulative hypergeometric?



On Jan 13, 7:43 am, petertwocakes <petertwoca...@xxxxxxxxxxxxxx>
wrote:
This is my routine for calculating the cumulative _Binomial_
distribution, with relacement, given n and p:

double q = 1-p, c = pow(q,n); // i.e. when k=0
double cp = c; // cumulative probability
for(double k=1; k<=n; k++)
{
c = ((n-(k-1))/k)*c*(p/q);
cp += c;

}

It's fast, and (for me) clear to understand, and as it's iterative,
the cost of calculating the cumulative probability is little more than
for calculating a specific probability.
So... I REALLY thought I would be able to modify this to work with no-
replacement, after all the only difference is that p&q are changing as
a result of N and m being reduced.
(m = number successes in N; k = number successes in sample(n))
Surely, in the loop above, if I could recalculate p&q by how N and m
are changing, then this would give the same result as the
hypergeometric??
But after much trying, I can't get it to work. I've trawled the
internet too, and there seems to be no shortcut other than using
h(N, m,n,k) = C(m,k)*C(N-m, n-k)/C(N,n)
and summing over 0->k.
I have found some sieving optimisations, but it's basically the same
method.
(sorry, that was all more of a rant than a question ;-) )

If only there was a reliable way to start anywhere near the middle of
the distribution; working backward or forward from this point would
mean I could skip the lower/upper ends where the value soon becomes
almost zero.

For the hypergeometric distribution with N = whole population, m =
"marked" objects (like the blondes before), n = sample size (like M
before) and k = number of marked in sample, the probabilities p(k)
obey the recursion: p(k+1) = r(k)*p(k), where r(k) = [(m-k)/(k+1)]*
[(n-k)/(N-m-n+k+1)]. So, you can iterate upward using p(k+1) = r(k)*p
(k) for k >= k0 and downward using p(k-1) = p(k)/r(k-1) for k <= k0,
starting as though p(k0) = 1. Stop when the resulting p(k) are
negligible, let S = sum(p), then re-scale as p(k) <-- p(k)/S. If n is
only a few hundred to a few thousand, that should be quick and
accurate.


One last real world example:

N = total number of people that have rented DVDs = 400,000
m = num people have rented a specific movie 'A': 0 < m 200,000
n = num people have rented a specific movie 'B': 0 < n < m
k = num people that have rented both
(As there is symmetry, it's always quicker to make m the larger of A
or B, and n the smaller of A and B)

What is the probability that k people have rented both?

This question cannot be answered unless you specify a probability
model. For example, there might be some weird reason that the only
people who rent B are left-handed female redheads older than 40 years
who rented A, or right-handed males under 30 who did not rent A. Then,
to answer the question we would need to know the composition of the
people who rented A and the composition of those who did not rent A.
OK, this is an artificial illustration, but it makes a valid point:
you need to specify a probability model.

Then: h(N, m,n,k) = C(NumRentedA,k)*C(NumHaveNotRentedA, NumRentedA-k)/
C(N, NumRentedB)

Shouldn't this be C(NumRentedA,k)*C(NumHaveNotRentedA, NumRentedB-k)/
C(N, NumRentedB)? Your use of the hypergeometric ASSUMES that the
renters of B are a random sample from the whole population. Why should
that be the case?


What is the probability that B and A are highly correlated in some
unknown way?

Surely that must depend on the probability model used. For example, if
I specify that P{rent A|rent B} = 0.5, then on average one half of the
renters of B will see both. However, I guess you are assuming a
hypergeometric model, and in that case the number renting both is the
number of A's in a random sample of NumRentedB people drawn from a
population of size N and in which NumRentedA have rented A.

e.g. if A = 20,000, B = 5,000, the probability that >=294 have seen
both is so small as to make it likely that there is an underlying
reason: cumulativeP = 1 - SUM(k=0, k=310)[ C(20000,k)*C(380000, 20000-
k)/C(N, 5000) ] = .0005

Where on Earth does the number '310' come from?


The median & mean are approx

Here you say "approx" the same, then below you say they are exactly
equal. You can't have both!

the same in this case: when k = median =
mean = 250, P = 0.5
When k = 208: P = .0005.

Intuitively, doesn't it feel astonishing that for a range of k=0 to
k=5000, it's 99.9% likely that k will be between 208 to 294!!

Unless, of course, this is all screwed!

Given that N is fixed, and N > m*2, and m > n can anyone spot any
optimisations?

Do you mean to ask whether the code can be improved? Improved compared
to what? What method have you used so far to do the computations? For
example, there are may ways to compute binomial coefficients; which
one (if any) did you use? Anyway, I have already suggested a method of
starting in the middle and iterating up and down, then re-scaling, and
I would bet that in any moderate-to-large problem it beats most direct
method hands down.

R.G. Vickson



I've been testing this with lots of values to find when the median is
almost equal to the mean. Can anyone suggest, from the limitations on
the numbers given, when the WORST case scenario might be?

Cheers

Steve

.



Relevant Pages

  • Re: Pigeons, People, and Priors
    ... the variance of the probability generator go to zero you have a continuum ... a random-interval 60 s schedule is not. ... The Exponential Distribution ... I probably should have used the phrase "statistical learning theory" rather ...
    (comp.ai.philosophy)
  • Re: So called "stimulus/response" models
    ... Instead of answering to each misunderstood, ironic and out of context ... Sorry, you exhibit a simplistic view of probability theory, and an even more ... of acquiring the consequences of responses. ... distribution over consequences of a given act. ...
    (comp.ai.philosophy)
  • Re: behavior as mapping
    ... estimating a probability distribution, the distribution ... sequence with equal probability - since you have microsecond temporal ... reduction of the entropy Pto the entropy P ... If there were 4 genes we would need 2 bits of binding site info. ...
    (comp.ai.philosophy)
  • Re: Bill Reid, Kelly Criterion
    ... about logs; if a person is talking about a percentage change in the ... probability of going broke the more they trade. ... adjustment (which is the one which allows any distribution which is ...
    (misc.invest.stocks)
  • Re: Hardy-weinberg Equilibrium
    ... Mating is random. ... while panmixis means equal probability of any ... But suppose we assumed a normal distribution? ... Are you claiming that statistical randomness requires a uniform ...
    (talk.origins)

Quantcast