Re: Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes <petertwocakes@xxxxxxxxxxxxxx>
- Date: Tue, 13 Jan 2009 07:43:05 -0800 (PST)
This is my routine for calculating the cumulative _Binomial_
distribution, with relacement, given n and p:
double q = 1-p, c = pow(q,n); // i.e. when k=0
double cp = c; // cumulative probability
for(double k=1; k<=n; k++)
{
c = ((n-(k-1))/k)*c*(p/q);
cp += c;
}
It's fast, and (for me) clear to understand, and as it's iterative,
the cost of calculating the cumulative probability is little more than
for calculating a specific probability.
So... I REALLY thought I would be able to modify this to work with no-
replacement, after all the only difference is that p&q are changing as
a result of N and m being reduced.
(m = number successes in N; k = number successes in sample(n))
Surely, in the loop above, if I could recalculate p&q by how N and m
are changing, then this would give the same result as the
hypergeometric??
But after much trying, I can't get it to work. I've trawled the
internet too, and there seems to be no shortcut other than using
h(N, m,n,k) = C(m,k)*C(N-m, n-k)/C(N,n)
and summing over 0->k.
I have found some sieving optimisations, but it's basically the same
method.
(sorry, that was all more of a rant than a question ;-) )
If only there was a reliable way to start anywhere near the middle of
the distribution; working backward or forward from this point would
mean I could skip the lower/upper ends where the value soon becomes
almost zero.
One last real world example:
N = total number of people that have rented DVDs = 400,000
m = num people have rented a specific movie 'A': 0 < m 200,000
n = num people have rented a specific movie 'B': 0 < n < m
k = num people that have rented both
(As there is symmetry, it's always quicker to make m the larger of A
or B, and n the smaller of A and B)
What is the probability that k people have rented both?
Then: h(N, m,n,k) = C(NumRentedA,k)*C(NumHaveNotRentedA, NumRentedA-k)/
C(N, NumRentedB)
What is the probability that B and A are highly correlated in some
unknown way?
e.g. if A = 20,000, B = 5,000, the probability that >=294 have seen
both is so small as to make it likely that there is an underlying
reason: cumulativeP = 1 - SUM(k=0, k=310)[ C(20000,k)*C(380000, 20000-
k)/C(N, 5000) ] = .0005
The median & mean are approx the same in this case: when k = median =
mean = 250, P = 0.5
When k = 208: P = .0005.
Intuitively, doesn't it feel astonishing that for a range of k=0 to
k=5000, it's 99.9% likely that k will be between 208 to 294!!
Unless, of course, this is all screwed!
Given that N is fixed, and N > m*2, and m > n can anyone spot any
optimisations?
I've been testing this with lots of values to find when the median is
almost equal to the mean. Can anyone suggest, from the limitations on
the numbers given, when the WORST case scenario might be?
Cheers
Steve
.
- Follow-Ups:
- Re: Estimating the mean of the cumulative hypergeometric?
- From: Matt
- Re: Estimating the mean of the cumulative hypergeometric?
- From: Ray Vickson
- Re: Estimating the mean of the cumulative hypergeometric?
- References:
- Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes
- Re: Estimating the mean of the cumulative hypergeometric?
- From: Matt
- Re: Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes
- Estimating the mean of the cumulative hypergeometric?
- Prev by Date: A statement about Rsa2048.
- Next by Date: Help needed on combinatorial interpretation
- Previous by thread: Re: Estimating the mean of the cumulative hypergeometric?
- Next by thread: Re: Estimating the mean of the cumulative hypergeometric?
- Index(es):
Relevant Pages
|