Re: Need Help Determining the "True" Mean of a Sample
From: Mafro (mafro_at_excite.com)
Date: 07/26/04
- Next message: Pii: "Comparing fits"
- Previous message: Richard Ulrich: "Re: Derivation of AIC"
- In reply to: Richard Ulrich: "Re: Need Help Determining the "True" Mean of a Sample"
- Next in thread: Richard Ulrich: "Re: Need Help Determining the "True" Mean of a Sample"
- Reply: Richard Ulrich: "Re: Need Help Determining the "True" Mean of a Sample"
- Messages sorted by: [ date ] [ thread ]
Date: 26 Jul 2004 10:54:06 -0700
Richard,
Thanks very much for your detailed and insightful response!
To answer your questions, the model represents a user visiting a given
web page that sells products and the values represent the gross
revenue that each user session generates. The model attempts to
determine the "true" value of user sessions on a given web page. As to
the mechanisms, when a user visits a page one of three specific events
take place:
1) The user simply leaves (about 75% of the time, which generates
$0.00).
2) The user leaves by clicking an advertisement (about 20% of the
time, which usually generates between $0.05 and $0.50, with a mean of
$.25).
3) The user purchases a product (about 5% of the time, which usually
generates between $8.00 and $12.00, with a mean of $10.00, depending
on the value of the product).
All three of these mechanisms are variable. If a page has a
particularly good deal, the percentage of people that might purchase
the product might be higher than 5%. At the same time, if there are
few or irrelevant advertisements on a given page, the percentage of
people leaving that page without clicking an ad might be less than
20%. So, the exact size and location of the peaks that result from
these various mechanisms are different for each data set, but their
existence is pretty ubiquitous.
In short, it is easy to determine the mean revenue per user session
that each web page has generated in the past. However, I'd like to be
able to determine the confidence interval for each page, based on the
number of data points, so I can accurately give an upper and lower
estimate of the "true" value of user sessions on that page with a 95%
level of confidence. This range will then be used as a statistical
basis for estimating the value of future user sessions to that page.
Now, some of these web pages have had very few users visit them -
between 10 and 100 user sessions. Intuitively I know that with such a
small data set the confidence interval is going to be extremely broad
and any estimates of future activity very inaccurate.
Other pages have had thousands - perhaps even tens of thousands - of
user sessions. Again, intuitively I know that with these larger data
sets the confidence interval is going to be smaller and therefore I
can offer a base prediction of the value of future user sessions that
will be much more accurate.
My hope is to find a formula (or formulae) that can be applied to
thousands of web pages and their varying data sets that I have sitting
in a database. Thanks again for any further insights you can offer....
Matthew
Richard Ulrich <Rich.Ulrich@comcast.net> wrote in message news:<ldt4g0d60ngej5iabejvhkn20htl5dc6ik@4ax.com>...
> On 23 Jul 2004 20:26:07 -0700, mafro@excite.com (Mafro) wrote:
>
> > All,
> >
> > I'm a software engineer, not a statistician, so please forgive my
> > ignorance.
> >
> > I would like to be able to determine the range of the true mean for
> > several data samples with a 95% level of certainty. I know the size of
> > each sample (usually between 10 and 1000 data points), the mean, and
> > the standard deviation.
> >
> > The distribution for these samples is such that about 75% of the
> > values are 0, about 20% are between 0 and 1, and the rest, about 5%,
> > are much higher (usually between 8 and 12). Thus, there are two peaks
> > - the highest one at 0 that slopes quickly down until it builds again
> > to a much smaller one that usually peaks around 10. I'm not sure what
> > kind of distribution this represents, nor how this distribution
> > effects the formula needed to determine the true mean.
> >
> > I'm hoping someone more knowledgeable than I can suggest a formula
> > with the sample size, the mean, and the standard deviation as
> > variables, or at least point me in the direction of a reference source
>
> It is hardly worth considering anything more unless you can put
> a stricter definition on that "5% ... (usually between 8 and 12)."
> What is the generating mechanism? Does some mechanism
> exist which keeps the 5% near to 5%? Or, could there be
> autocorrelation, or something such as 'cycles' which implies
> that the fraction could be much higher than 5% for a time?
>
> Consider 100 data points, and the *total*, which is easier
> to talk about than the mean but essentially the same problem.
> (It is also very often the statistic of interest, explicitly, when
> considering sets of numbers whose averages, separately,
> are very heterogeneous.)
>
> If those "about 5%" occur as Poisson, then the count, in 100,
> is between 1 and 10 for the 95% CI: In other words, they
> contribute somewhere between 8 and 120 points to the total.
> That implies the CI for the 5% is, by itself, as broad as (8,120).
>
> Is "about 5%" closer to exactly-5 than that, or is the tail even
> longer than Poisson, which I just described? What is the shape
> of distribution between 8 and 12, to narrow down whether the
> maximum of *this* 95% CI deserves a contribution of 80 points,
> or 120 points?
>
> Similarly, to much less effect, the "about 20%" contributes
> between 0 and 20 points to the total. If the "about 20% is
> fairly constant, you might be satisfied with modeling this as
> a fixed "10 points" or whatever the observed average is --
> since the "about 5%" potentially contributes so much more
> heavily to the eventual outcome. That is seen directly if
> you considered that it is the square of the SDs - the variance -
> that is additive in creating the SD for the total. The SD for
> "about 5%" is about 25 points if that was Poisson, and it is (at
> most) one fifth of that, 5, for "about 20%", so the latter contributes
> only 4% as much of the total variance.
>
> > that I can use to determine the formula myself. I intend to use such a
> > formula as a basis for writing a function in T-SQL that will take
> > those variables and return the high and low ends of the confidence
> > interval.
> >
> > Thanks in advance for any insights you can offer!
>
> What I gave is not the complete, formal analysis. If there
> is any time-series effect, or other dependence, a model would
> have to account for that. But if you have to account for the
> 5% as random and Poisson (or multinomial), that will be the
> dominant factor for the variance, to get the CI which will
> be asymmetrical for small samples, since the minimum mean
> is zero. Oh, the "about 20%" could play a role in stating more
> precision for CIs where the lower limit is less than 8 -- instead
> of dropping immediately to zero when the N is under 58.
>
> Hope this helps.
- Next message: Pii: "Comparing fits"
- Previous message: Richard Ulrich: "Re: Derivation of AIC"
- In reply to: Richard Ulrich: "Re: Need Help Determining the "True" Mean of a Sample"
- Next in thread: Richard Ulrich: "Re: Need Help Determining the "True" Mean of a Sample"
- Reply: Richard Ulrich: "Re: Need Help Determining the "True" Mean of a Sample"
- Messages sorted by: [ date ] [ thread ]