Re: Estimating the mean of the cumulative hypergeometric?



This is my routine for calculating the cumulative _Binomial_
distribution, with relacement, given n and p:

double q = 1-p, c = pow(q,n); // i.e. when k=0
double cp = c; // cumulative probability
for(double k=1; k<=n; k++)
{
c = ((n-(k-1))/k)*c*(p/q);
cp += c;
}

It's fast, and (for me) clear to understand, and as it's iterative,
the cost of calculating the cumulative probability is little more than
for calculating a specific probability.
So... I REALLY thought I would be able to modify this to work with no-
replacement, after all the only difference is that p&q are changing as
a result of N and m being reduced.
(m = number successes in N; k = number successes in sample(n))
Surely, in the loop above, if I could recalculate p&q by how N and m
are changing, then this would give the same result as the
hypergeometric??
But after much trying, I can't get it to work. I've trawled the
internet too, and there seems to be no shortcut other than using
h(N, m,n,k) = C(m,k)*C(N-m, n-k)/C(N,n)
and summing over 0->k.
I have found some sieving optimisations, but it's basically the same
method.
(sorry, that was all more of a rant than a question ;-) )

If only there was a reliable way to start anywhere near the middle of
the distribution; working backward or forward from this point would
mean I could skip the lower/upper ends where the value soon becomes
almost zero.

One last real world example:

N = total number of people that have rented DVDs = 400,000
m = num people have rented a specific movie 'A': 0 < m 200,000
n = num people have rented a specific movie 'B': 0 < n < m
k = num people that have rented both
(As there is symmetry, it's always quicker to make m the larger of A
or B, and n the smaller of A and B)

What is the probability that k people have rented both?
Then: h(N, m,n,k) = C(NumRentedA,k)*C(NumHaveNotRentedA, NumRentedA-k)/
C(N, NumRentedB)

What is the probability that B and A are highly correlated in some
unknown way?
e.g. if A = 20,000, B = 5,000, the probability that >=294 have seen
both is so small as to make it likely that there is an underlying
reason: cumulativeP = 1 - SUM(k=0, k=310)[ C(20000,k)*C(380000, 20000-
k)/C(N, 5000) ] = .0005

The median & mean are approx the same in this case: when k = median =
mean = 250, P = 0.5
When k = 208: P = .0005.

Intuitively, doesn't it feel astonishing that for a range of k=0 to
k=5000, it's 99.9% likely that k will be between 208 to 294!!

Unless, of course, this is all screwed!


Given that N is fixed, and N > m*2, and m > n can anyone spot any
optimisations?

I've been testing this with lots of values to find when the median is
almost equal to the mean. Can anyone suggest, from the limitations on
the numbers given, when the WORST case scenario might be?

Cheers

Steve




.



Relevant Pages

  • Re: Estimating the mean of the cumulative hypergeometric?
    ... for calculating a specific probability. ... zero when that happens. ... Hypergeometric distribution. ...
    (sci.math)
  • Re: distribution of the minimum value of a random walk
    ... time t' is dependent on the distribution at time t'-1, ... Given the description of the problem, the distribution N ... And this probability accounts for the ... + 2, then we know that xwas greater than c, as we're calculating ...
    (sci.stat.math)
  • Re: distribution of the minimum value of a random walk
    ... time t' is dependent on the distribution at time t'-1, ... Given the description of the problem, the distribution N ... And this probability accounts for the ... + 2, then we know that xwas greater than c, as we're calculating ...
    (sci.stat.math)
  • Re: Shannons information theory
    ... > I have a question about calculating the entropy of an integer ... > Therefore my mind tells me: no entropy. ... What actually matters is the probability for each possible ... distribution of values will not enable you to make an accurate ...
    (comp.theory)
  • Re: Pigeons, People, and Priors
    ... the variance of the probability generator go to zero you have a continuum ... a random-interval 60 s schedule is not. ... The Exponential Distribution ... I probably should have used the phrase "statistical learning theory" rather ...
    (comp.ai.philosophy)