Re: Need Help Determining the "True" Mean of a Sample
From: Richard Ulrich (Rich.Ulrich_at_comcast.net)
Date: 07/24/04
- Next message: Osher Doctorow: "Bayesian Predictive Median Model Selection Explained by PI"
- Previous message: Ian Jermyn: "Re: Conditional dist' of a Gaussian dist' with exponentially dist' variance"
- In reply to: Mafro: "Need Help Determining the "True" Mean of a Sample"
- Next in thread: Mafro: "Re: Need Help Determining the "True" Mean of a Sample"
- Reply: Mafro: "Re: Need Help Determining the "True" Mean of a Sample"
- Messages sorted by: [ date ] [ thread ]
Date: Sat, 24 Jul 2004 11:27:38 -0400
On 23 Jul 2004 20:26:07 -0700, mafro@excite.com (Mafro) wrote:
> All,
>
> I'm a software engineer, not a statistician, so please forgive my
> ignorance.
>
> I would like to be able to determine the range of the true mean for
> several data samples with a 95% level of certainty. I know the size of
> each sample (usually between 10 and 1000 data points), the mean, and
> the standard deviation.
>
> The distribution for these samples is such that about 75% of the
> values are 0, about 20% are between 0 and 1, and the rest, about 5%,
> are much higher (usually between 8 and 12). Thus, there are two peaks
> - the highest one at 0 that slopes quickly down until it builds again
> to a much smaller one that usually peaks around 10. I'm not sure what
> kind of distribution this represents, nor how this distribution
> effects the formula needed to determine the true mean.
>
> I'm hoping someone more knowledgeable than I can suggest a formula
> with the sample size, the mean, and the standard deviation as
> variables, or at least point me in the direction of a reference source
It is hardly worth considering anything more unless you can put
a stricter definition on that "5% ... (usually between 8 and 12)."
What is the generating mechanism? Does some mechanism
exist which keeps the 5% near to 5%? Or, could there be
autocorrelation, or something such as 'cycles' which implies
that the fraction could be much higher than 5% for a time?
Consider 100 data points, and the *total*, which is easier
to talk about than the mean but essentially the same problem.
(It is also very often the statistic of interest, explicitly, when
considering sets of numbers whose averages, separately,
are very heterogeneous.)
If those "about 5%" occur as Poisson, then the count, in 100,
is between 1 and 10 for the 95% CI: In other words, they
contribute somewhere between 8 and 120 points to the total.
That implies the CI for the 5% is, by itself, as broad as (8,120).
Is "about 5%" closer to exactly-5 than that, or is the tail even
longer than Poisson, which I just described? What is the shape
of distribution between 8 and 12, to narrow down whether the
maximum of *this* 95% CI deserves a contribution of 80 points,
or 120 points?
Similarly, to much less effect, the "about 20%" contributes
between 0 and 20 points to the total. If the "about 20% is
fairly constant, you might be satisfied with modeling this as
a fixed "10 points" or whatever the observed average is --
since the "about 5%" potentially contributes so much more
heavily to the eventual outcome. That is seen directly if
you considered that it is the square of the SDs - the variance -
that is additive in creating the SD for the total. The SD for
"about 5%" is about 25 points if that was Poisson, and it is (at
most) one fifth of that, 5, for "about 20%", so the latter contributes
only 4% as much of the total variance.
> that I can use to determine the formula myself. I intend to use such a
> formula as a basis for writing a function in T-SQL that will take
> those variables and return the high and low ends of the confidence
> interval.
>
> Thanks in advance for any insights you can offer!
What I gave is not the complete, formal analysis. If there
is any time-series effect, or other dependence, a model would
have to account for that. But if you have to account for the
5% as random and Poisson (or multinomial), that will be the
dominant factor for the variance, to get the CI which will
be asymmetrical for small samples, since the minimum mean
is zero. Oh, the "about 20%" could play a role in stating more
precision for CIs where the lower limit is less than 8 -- instead
of dropping immediately to zero when the N is under 58.
Hope this helps.
-- Rich Ulrich, wpilib@pitt.edu http://www.pitt.edu/~wpilib/index.html
- Next message: Osher Doctorow: "Bayesian Predictive Median Model Selection Explained by PI"
- Previous message: Ian Jermyn: "Re: Conditional dist' of a Gaussian dist' with exponentially dist' variance"
- In reply to: Mafro: "Need Help Determining the "True" Mean of a Sample"
- Next in thread: Mafro: "Re: Need Help Determining the "True" Mean of a Sample"
- Reply: Mafro: "Re: Need Help Determining the "True" Mean of a Sample"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|