Re: Choosing the right method
- From: "Phil Sherrod" <phil.sherrod@xxxxxxxxxxxxxxxxxxx>
- Date: Thu, 4 Aug 2005 14:54:38 GMT
On 4-Aug-2005, Eric <nospam@xxxxxxxxx> wrote:
> A frined of mine, who is ophtalmic surgeon needs to make
> a study regarding factors affecting a disease.
> He has got a dataset of appromimately 300 samples (patients),
> with around 6 or 7 explanatory variables (quantitative or qualitative),
> and one response variable which is the occuring of the disease.
>
> The problem is that in the dataset the disease is quite rare so that
> there are few cases of diseases.
>
> What would be the best method in order to discriminate the populations.
>
> We have been suggested : Artificial Neural Net, K nearest neighbors or
> Decision tree.
An analysis involving highly unbalanced data can be tricky because there is
a
tendency of many classification programs to classify all cases as the most
common category. If there are 1000 Alpha cases and 1 Beta case, a
classification can be 99.9% accurate by simply saying everything is an
Alpha.
There are a couple of ways of dealing with this: First, the cases can be
weighted so that the sum of the weights for each category are equal. This
will
effectively tell the classification program that it is _really_ important to
correctly classify the few infrequent cases. This should help, but it may
not
be sufficient.
The second thing you can do is to compute the probability of each case
having
each category and adjust the cutoff threshold. If there are two categories,
normally one would assign a case to the category that has the larger
probability for that case. But you can bias the classifications by shifting
the
probability threshold. This technique is very effective at balancing
classifications.
I am the author of a decision tree based modeling program called DTREG
(http://www.dtreg.com). DTREG does automatic weight adjustment to
compensate
for unbalanced data, and you can shift the probability threshold to balance
classifications. DTREG can generate single tree, TreeBoost models (series of
trees) and Decision Tree Forest models (many trees that vote on the
outcome).
I have successfully used DTREG to analyze data with highly unbalanced data.
Your friend can download a demonstration copy of DTREG from
http://www.dtreg.com Of, if your friend can send his data to me via e-mail
at
phil.sherrod 'at' sandh.com, I will be happy to run it through DTREG for
him.
--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree modeling)
http://www.nlreg.com (nonlinear regression)
.
- Follow-Ups:
- Re: Choosing the right method
- From: Eric
- Re: Choosing the right method
- References:
- Choosing the right method
- From: Eric
- Choosing the right method
- Prev by Date: Re: consistent estimator?
- Next by Date: Re: Choosing the right method
- Previous by thread: Choosing the right method
- Next by thread: Re: Choosing the right method
- Index(es):
Relevant Pages
|