Re: Estimating the mean of the cumulative hypergeometric?
- From: Ray Vickson <RGVickson@xxxxxxx>
- Date: Tue, 13 Jan 2009 11:15:46 -0800 (PST)
On Jan 13, 7:43 am, petertwocakes <petertwoca...@xxxxxxxxxxxxxx>
wrote:
This is my routine for calculating the cumulative _Binomial_
distribution, with relacement, given n and p:
double q = 1-p, c = pow(q,n); // i.e. when k=0
double cp = c; // cumulative probability
for(double k=1; k<=n; k++)
{
c = ((n-(k-1))/k)*c*(p/q);
cp += c;
}
It's fast, and (for me) clear to understand, and as it's iterative,
the cost of calculating the cumulative probability is little more than
for calculating a specific probability.
So... I REALLY thought I would be able to modify this to work with no-
replacement, after all the only difference is that p&q are changing as
a result of N and m being reduced.
(m = number successes in N; k = number successes in sample(n))
Surely, in the loop above, if I could recalculate p&q by how N and m
are changing, then this would give the same result as the
hypergeometric??
But after much trying, I can't get it to work. I've trawled the
internet too, and there seems to be no shortcut other than using
h(N, m,n,k) = C(m,k)*C(N-m, n-k)/C(N,n)
and summing over 0->k.
I have found some sieving optimisations, but it's basically the same
method.
(sorry, that was all more of a rant than a question ;-) )
If only there was a reliable way to start anywhere near the middle of
the distribution; working backward or forward from this point would
mean I could skip the lower/upper ends where the value soon becomes
almost zero.
For the hypergeometric distribution with N = whole population, m =
"marked" objects (like the blondes before), n = sample size (like M
before) and k = number of marked in sample, the probabilities p(k)
obey the recursion: p(k+1) = r(k)*p(k), where r(k) = [(m-k)/(k+1)]*
[(n-k)/(N-m-n+k+1)]. So, you can iterate upward using p(k+1) = r(k)*p
(k) for k >= k0 and downward using p(k-1) = p(k)/r(k-1) for k <= k0,
starting as though p(k0) = 1. Stop when the resulting p(k) are
negligible, let S = sum(p), then re-scale as p(k) <-- p(k)/S. If n is
only a few hundred to a few thousand, that should be quick and
accurate.
One last real world example:
N = total number of people that have rented DVDs = 400,000
m = num people have rented a specific movie 'A': 0 < m 200,000
n = num people have rented a specific movie 'B': 0 < n < m
k = num people that have rented both
(As there is symmetry, it's always quicker to make m the larger of A
or B, and n the smaller of A and B)
What is the probability that k people have rented both?
This question cannot be answered unless you specify a probability
model. For example, there might be some weird reason that the only
people who rent B are left-handed female redheads older than 40 years
who rented A, or right-handed males under 30 who did not rent A. Then,
to answer the question we would need to know the composition of the
people who rented A and the composition of those who did not rent A.
OK, this is an artificial illustration, but it makes a valid point:
you need to specify a probability model.
Then: h(N, m,n,k) = C(NumRentedA,k)*C(NumHaveNotRentedA, NumRentedA-k)/
C(N, NumRentedB)
Shouldn't this be C(NumRentedA,k)*C(NumHaveNotRentedA, NumRentedB-k)/
C(N, NumRentedB)? Your use of the hypergeometric ASSUMES that the
renters of B are a random sample from the whole population. Why should
that be the case?
What is the probability that B and A are highly correlated in some
unknown way?
Surely that must depend on the probability model used. For example, if
I specify that P{rent A|rent B} = 0.5, then on average one half of the
renters of B will see both. However, I guess you are assuming a
hypergeometric model, and in that case the number renting both is the
number of A's in a random sample of NumRentedB people drawn from a
population of size N and in which NumRentedA have rented A.
e.g. if A = 20,000, B = 5,000, the probability that >=294 have seen
both is so small as to make it likely that there is an underlying
reason: cumulativeP = 1 - SUM(k=0, k=310)[ C(20000,k)*C(380000, 20000-
k)/C(N, 5000) ] = .0005
Where on Earth does the number '310' come from?
The median & mean are approx
Here you say "approx" the same, then below you say they are exactly
equal. You can't have both!
the same in this case: when k = median =
mean = 250, P = 0.5
When k = 208: P = .0005.
Intuitively, doesn't it feel astonishing that for a range of k=0 to
k=5000, it's 99.9% likely that k will be between 208 to 294!!
Unless, of course, this is all screwed!
Given that N is fixed, and N > m*2, and m > n can anyone spot any
optimisations?
Do you mean to ask whether the code can be improved? Improved compared
to what? What method have you used so far to do the computations? For
example, there are may ways to compute binomial coefficients; which
one (if any) did you use? Anyway, I have already suggested a method of
starting in the middle and iterating up and down, then re-scaling, and
I would bet that in any moderate-to-large problem it beats most direct
method hands down.
R.G. Vickson
I've been testing this with lots of values to find when the median is
almost equal to the mean. Can anyone suggest, from the limitations on
the numbers given, when the WORST case scenario might be?
Cheers
Steve
.
- Follow-Ups:
- Re: Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes
- Re: Estimating the mean of the cumulative hypergeometric?
- References:
- Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes
- Re: Estimating the mean of the cumulative hypergeometric?
- From: Matt
- Re: Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes
- Re: Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes
- Estimating the mean of the cumulative hypergeometric?
- Prev by Date: Re: JSH "problem"
- Next by Date: Radius of largest n+1 balls in n-dimensional unit cube?
- Previous by thread: Re: Estimating the mean of the cumulative hypergeometric?
- Next by thread: Re: Estimating the mean of the cumulative hypergeometric?
- Index(es):
Relevant Pages
|