Re: Finding Statistically Significant Rules



On May 11, 1:22 pm, Ray Koopman <koop...@xxxxxx> wrote:
On May 11, 1:03 am, hgwelec <hgwe...@xxxxxxxxx> wrote:



Hi Ray and thanks for your reply.

If i understood well, basically you say to do some sort of cross-
validations and keep track of how the 85% accuracy changes in each
fold. Of course i am not a statistician but this seems to me something
like an "empirical" rule.

With a chi-square test you are able to quantify statistical
significance and present your findings -say on a scientific paper-
but how can i quantify the significance in such a way you described?.

Again, sorry if i am totally mistaken about this

Thanks,

Hgwelec

Have I misunderstood something? You developed a classifier, that
turned out to be 85% accurate in its development sample. If you
reanalyze your data R = 5000 or so times, each time following the
same procedure you used initially, but each time randomly permuting
the N = 700 values of the client variable (i.e., randomly reassigning
the N observed values to different clients), you will get R different
accuracies, one for each redeveloped classifier of the randomized
data. Each time you use all N cases. There is no hold-out sample.
OK so far?

The "significance" of your classifier's accuracy -- the quantity that
corresponds to the p-value from a statistic such as chi-square --
is the proportion of the R accuracies that equal or exceed the
accuracy of the classifier that was developed on the unpermuted data.
This is known as a permutation test and is perfectly respectable
scientifically.




Hi again Ray,

As already discussed, i am not a statistician. What you said made it
crystal clear but unfortunately, 5000 repeats of this procedure cannot
be performed.

I am thinking of discretizing AGE and NUM_OF_CHILDREN and then
performing a chi-square test.



Thanks,


Hgwelec

.