Re: QUERY: Sample proportion and prediction

From: Bob Ehrlich (bobehrlich_at_comcast.net)
Date: 08/17/04


Date: Tue, 17 Aug 2004 10:19:41 -0600

Sangdon Lee wrote:

> Dear All,
>
> I'm trying to develop a model in order to predict the number of
> necessary test vehicles (Y) based on the complexity (X) of a vehicle
> program (# of powertrain combinations, # of body styles, etc). One
> of the input attributes is newness of a vehicle program (annual,
> mid-cycle enhancement, major or completely new program).
>
> As expected, the collected data show that majority of programs are
> either annual or mid cycle enhancement programs. The major and
> completely new programs are about 10 % of data collected. I'm just
> wondering what kind of things I have to worry about because of the
> very small portion of major or new programs.
>
> By the way, I had applied PLS (partial least square). I think that
> SEM (structural equation modeling) is another good method but don't
> have in-depth knowledge though.
>
> My situation is analogous to medical data where, for example, the
> proportion of people who have cancer is very small compared to that of
> healthy people from collected data.
>
> Any suggestion would be appreciated.
>
> Sangdon Lee, Ph.D.,
> GM Tech. Center.
   Dr. Lee:

A lot depends on the total number of observations. If the number is
very small (say <100) then you have to make a bunch of assumptions on
which to base your data analysis, this defines the statistical tool to
be used. If your number of observations is very large (say >5000) then
another set of analytical tools may be appropriate. That is, a category
may include a small percentage of the data but may still be numerically
large if the total number of observations is large.