Re: is the design correct ?
- From: "Phil Sherrod" <phil.sherrod@xxxxxxxxxxxxxxxxxxx>
- Date: Fri, 16 Sep 2005 17:36:15 GMT
On 16-Sep-2005, Eric <nospam@xxxxxxxxx> wrote:
> The problem is to find significant predictor variables which
> have an effect in the apparition of disease.
>
> The problem is that this disease is very rare:it is an affection of the
> eye
> which affect only 20/10000 people.
>
> There are currently around 10 variables that we expect to have an effect.
>
> We have collected in an hospital all the patient files corresponding to
> the
> disease (around 50 cases).
>
> In order to perform a test we need also to have normal people.
> We have randomly collected the same number of patient files
> which don't have the disease (around 50).
>
> Is it a good design in order to perform a discriminant analysis between
> the two groups, or is it better to respect the proportion of the disease
> and thus to collect 500*50 normal patient ?
Analyzing data with a highly disproportionate distribution of cases is
difficult. If you do a Google search for "imbalanced data" or "unbalanced
data" you will find many research papers and more than a few dissertations.
Many (most) modeling methods that attempt to minimize overall error will
simply classify all cases into the majority category; keeping the minority
cases is a challenge.
There are several common approaches to analyzing highly unbalanced data:
1. Under sample (subset) the majority case to force the frequency to
approximate that of the minority. This method loses significant data.
2. Over sample (replicate) the minority cases to force balance. This works
sometimes, but since the replicated cases all have the same predictor
values, it doesn't do a great job.
3. Give greater weight to the minority cases so that their weighted count
matches that of the majority cases. This is similar in effect to method 2
but better because the majority cases can have their weights adjusted down.
Another approach, which I prefer, is to adjust the predictions after the
model has been created. For this to work the predictive model must be able
to generate a probability score for the categories for each case rather than
a simple A/B category assignment. You then shift the cutoff probability
threshold so that it is no longer 0.5 but rather a value that picks out the
minority cases.
For example, let's assume that we run 100 cases through the model and look
at the probability that each case is category 'A' which is the minority
category. The probability scores might range from 0.0001 to 0.45. If we
simply assigned cases to the most probable category, all cases would be
given category 'B' since it is the most likely category for every case. But
if we shift the cutoff threshold to A=0.25 then those cases whose
probability of being A is between 0.25 and 0.45 will be classified as A. Of
course, as you lower the threshold you will get more and more cases
classified as A and the error rate will increase for cases that are actually
B being misclassified as A. You have to examine how the error rate and the
classification rate vary and consider the misclassification cost factor to
decide on the optimal cutoff threshold.
I am the author of a predictive modeling program called DTREG
(http://www.dtreg.com). DTREG builds models using Decision Trees, Support
Vector Machines (SVM) and Logistic Regression. It has the ability to adjust
the class assignment probability threshold, and it provides statistics and
graphs showing how a threshold would affect the classification and error
rate. If you can send me your data (with patient identifications removed)
via e-mail along with a description of the variables, I'll be happy to run
it through DTREG for free, and we can see what sort of accuracy we can get.
--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree and SVM predictive modeling)
http://www.nlreg.com (nonlinear regression)
.
- References:
- is the design correct ?
- From: Eric
- is the design correct ?
- Prev by Date: is the design correct ?
- Next by Date: Re: is the design correct ?
- Previous by thread: is the design correct ?
- Next by thread: Re: is the design correct ?
- Index(es):
Relevant Pages
|