Re: Choosing the right method



On 4-Aug-2005, Eric <nospam@xxxxxxxxx> wrote:

> A frined of mine, who is ophtalmic surgeon needs to make
> a study regarding factors affecting a disease.
> He has got a dataset of appromimately 300 samples (patients),
> with around 6 or 7 explanatory variables (quantitative or qualitative),
> and one response variable which is the occuring of the disease.
>
> The problem is that in the dataset the disease is quite rare so that
> there are few cases of diseases.
>
> What would be the best method in order to discriminate the populations.
>
> We have been suggested : Artificial Neural Net, K nearest neighbors or
> Decision tree.

An analysis involving highly unbalanced data can be tricky because there is
a
tendency of many classification programs to classify all cases as the most
common category. If there are 1000 Alpha cases and 1 Beta case, a
classification can be 99.9% accurate by simply saying everything is an
Alpha.

There are a couple of ways of dealing with this: First, the cases can be
weighted so that the sum of the weights for each category are equal. This
will
effectively tell the classification program that it is _really_ important to
correctly classify the few infrequent cases. This should help, but it may
not
be sufficient.

The second thing you can do is to compute the probability of each case
having
each category and adjust the cutoff threshold. If there are two categories,
normally one would assign a case to the category that has the larger
probability for that case. But you can bias the classifications by shifting
the
probability threshold. This technique is very effective at balancing
classifications.

I am the author of a decision tree based modeling program called DTREG
(http://www.dtreg.com). DTREG does automatic weight adjustment to
compensate
for unbalanced data, and you can shift the probability threshold to balance
classifications. DTREG can generate single tree, TreeBoost models (series of
trees) and Decision Tree Forest models (many trees that vote on the
outcome).
I have successfully used DTREG to analyze data with highly unbalanced data.

Your friend can download a demonstration copy of DTREG from
http://www.dtreg.com Of, if your friend can send his data to me via e-mail
at
phil.sherrod 'at' sandh.com, I will be happy to run it through DTREG for
him.

--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree modeling)
http://www.nlreg.com (nonlinear regression)
.



Relevant Pages

  • Re: Deformed frogs back in the news
    ... Occam's Razor doesn't claim that a simpler explanation or theory is always correct or better - simply that, all other things being equal, you go with the least complex. ... "Webb fed a computer different sets of real-life data, such as credit ratings and medical records, with some containing more than 3,000 examples. ... The computer would then create a decision tree with the fewest branches and finally use that tree to try to classify the remaining 20 percent of the examples. ... Nobody is claiming that introducing "additional decision-making criteria if doing so would *help* in the classification" wouldn't give you a better decision-making tree. ...
    (rec.sport.football.college)
  • Re: Finding Statistically Significant Rules
    ... I have used a C4.5 decision tree to make an analysis. ... The dataset consists of 700 cases ... Look at the distribution of accuracies. ...
    (sci.stat.edu)
  • Re: Question about Decision Trees and Neural Networks
    ... Our decision tree uses entropy to score the contribution of each attribute ... since the average-amount attribute gets highest socre. ... information for classification). ... Second question, if I run the same data in a Neural Network model, the ...
    (microsoft.public.sqlserver.datamining)
  • Re: Question about statistical significance
    ... I have used a C4.5 decision tree to make an analysis. ... Classification trees: Generally, ... level of depth of replication, ... are not categorical variables. ...
    (sci.stat.consult)
  • 1995: [The classification of Lyme borreliosis (Lyme disease)]
    ... [The classification of Lyme borreliosis ] ... A new version of Lyme's disease classification based on the authors' ...
    (sci.med.diseases.lyme)