Re: Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes <petertwocakes@xxxxxxxxxxxxxx>
- Date: Wed, 14 Jan 2009 23:56:09 -0800 (PST)
On 15 Jan, 03:47, Matt <matt271829-n...@xxxxxxxxxxx> wrote:
On Jan 14, 6:00 pm, petertwocakes <petertwoca...@xxxxxxxxxxxxxx>
wrote:
On 13 Jan, 20:19, Matt <matt271829-n...@xxxxxxxxxxx> wrote:
On Jan 13, 3:43 pm, petertwocakes <petertwoca...@xxxxxxxxxxxxxx>
wrote:
This is my routine for calculating the cumulative _Binomial_
distribution, with relacement, given n and p:
double q = 1-p, c = pow(q,n); // i.e. when k=0
double cp = c; // cumulative probability
for(double k=1; k<=n; k++)
{
c = ((n-(k-1))/k)*c*(p/q);
cp += c;
}
It's fast, and (for me) clear to understand, and as it's iterative,
the cost of calculating the cumulative probability is little more than
for calculating a specific probability.
So... I REALLY thought I would be able to modify this to work with no-
replacement, after all the only difference is that p&q are changing as
a result of N and m being reduced.
(m = number successes in N; k = number successes in sample(n))
Surely, in the loop above, if I could recalculate p&q by how N and m
are changing, then this would give the same result as the
hypergeometric??
I don't see that this is an efficient way to approach it. p and q are
changing, sure, but they're changing in a way that depends on the
history of the previous selections, and you don't want to be tracking
all those possibilities individually.
Your loop code for the hypergeometric should not be that different
from the binomial. You keep a running count of the product of the
binomial coefficients involving k: this is what I was alluding to
earlier. Each iteration involves multiplying by two ratios, something
like c = c*(m - k)/(k + 1)*(n - k)/(N - m - n + k + 1), depending on
exactly how you implement it. Of course, you can pre-calculate N - m -
n + 1. You do also have the overhead of initialising this multiplier
to (N - m)C(n) and calculating the denominator Ncn.
Thanks Matt, I've got your's and Ray's iterative method working now.
It's so fast compared to the brute force method of summing across all
ks that the overhead you mention isn't a problem. The only catch is
that you have to ensure that none of the early values of k from k0
onward are zero, as naturally the rest of the series will be locked at
zero when that happens. I solved this by doing a binary search for the
first non-zero value of k.
Not that it matters if you've got it working to your satisfaction, but
I don't follow what you mean here. I assume you don't actually mean
zero values "of" k (since there's only one). If you mean out-of-range
values of k then there's no need to do a binary search; you can
calculate the valid range easily using max and min. If you mean
vanishingly small probabilities for certain values of k then I don't
see how a binary search can profitably be used when you are
calculating the probabilities recursively. Or maybe in the binary
search phase you aren't calculating them recursively? Is that really
worth doing? Doesn't seem so to me... but I could be be missing
something.
"If you mean
vanishingly small probabilities for certain values of k"
Yes.
The essence of the forward loop (stripping away all ckecks pre-calcs
and initialisation for clarity) is:
pk = MyHypergeometricFunction(N,m,n,0);
(...directly calculate the initial hypergeaometric value for when k=0)
cumulative_pk = pk;
rk = m * (n/(N-m-n+0+1); // initial rk;
start = 0;
for(k=start; k<=maxK; k++)
{
pk = rk*pk;
rk = ((m-K)/(K+1)) * ((n-K)/(N-m-n+K+1));
cumulative_pk += pk;
}
What I meant is that if the first value of pk I calculate (when k=0)
is zero, then because in the loop pk = rk*pk, pk will remain at zero.
With the size of numbers I'm working with where n and k are often in
the thousands; using double precision, if I directly calculate pk
(using the hypergeometric formula) it will often return zero for
k=500+.
So the problem becomes: How do I find the first value of k ("start")
that would return a non-zero result?
Well, knowing that it will be somewhere between zero and the mean,
using binary search it typically only takes 4 or 5 direct
hypergeometric calculations to find the smallest value of k that gives
a non-zero result.
(I then have to calculate the first multiplier and set the
cumulative_pk)
Yes, those 4 or 5 are a nuisance (grateful for any suggestions!), but
it's nothing compared to doing it thouasands of times (when n is
large) using the brute-force method.
It's definitely working because I'm cross checking the results against
the brute-force method, and the http://stattrek.com/Tables/Hypergeometric.aspx
calculator. But maybe it could be more efficient still.
You do also have the overhead of initialising this multiplier
to (N - m)C(n) and calculating the denominator Ncn
As you can see, I'm explicitly calculating the hypergeometric, but it
comes to the same thing when k=0 doesn't it? ( MyHypergeometricFunction
() is also very fast because it's calculating the binomials
recursively, but not fast enough that I want to do thousands of times)
you can calculate the valid range easily using max and min.
I don't understand this at all, could you elaborate?
(I'm worried this is key to the initialization bottleneck)
Thanks
Steve
.
- Follow-Ups:
- Re: Estimating the mean of the cumulative hypergeometric?
- From: Matt
- Re: Estimating the mean of the cumulative hypergeometric?
- From: Ray Vickson
- Re: Estimating the mean of the cumulative hypergeometric?
- References:
- Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes
- Re: Estimating the mean of the cumulative hypergeometric?
- From: Matt
- Re: Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes
- Re: Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes
- Re: Estimating the mean of the cumulative hypergeometric?
- From: Matt
- Re: Estimating the mean of the cumulative hypergeometric?
- From: petertwocakes
- Re: Estimating the mean of the cumulative hypergeometric?
- From: Matt
- Estimating the mean of the cumulative hypergeometric?
- Prev by Date: Re: prime numbers formula please.
- Next by Date: Re: -- f(k) = k^(n-1) (mod n)
- Previous by thread: Re: Estimating the mean of the cumulative hypergeometric?
- Next by thread: Re: Estimating the mean of the cumulative hypergeometric?
- Index(es):
Relevant Pages
|
Loading