Re: Software suggestions



BernardZ wrote:
In article <1125140487.375978.136460@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>, clemenr@xxxxxxxxxx says...

From what you write, I think that the following is the case:

You have a table of data.

You want a formula that relates some variable "Results" to some other
variables "var1", "var2", "var3" ... "varn"

If you were interested in linear regression, then you'd want to find
weights b0, b1, b2, b3, ..., bn such that:

Results = b0 + b1 * var1 + b2 * var2 + b3 * var3 ... + bn * varn

minimises some goodness of fit measure. E.g. least squares.

However, you don't know the form the expression should take. Hence you
want a program that finds the formula for you. You're disappointed by
what you find in R because the various procedures in R can typically
only be used if you give the structure of the formula, and R then finds
a "good" set of weights.

Is this the case?

If this is the case, then you are in trouble. Because finding such
formulae in data, at least in the relatively unrestricted (in terms of
possible structures present in the formula, over and above weights)
form that this problem is usually attempted in the field of machine
learning is far from a solved problem. Unless the relationship you're
searching for in your data is quite simple, even if you get hold of
such a program, it's unlikely to be much use to you. And if the
relationship is simple, you can probably find it by hand.

I think you need to be clearer about what it is that you want.


You are correct about what I want. I want the computer to find it in an unrestricted structure.

I am surprised that no one has written such a program too.

Say I put in 20 variables in a table with say 1000 entries. A computer program tried to solve it for say 2,000,000 different equations and then returns the equation that gives the best fit.

Would it be useful? To me yes.

Yes, we'd all like something like that: it means we wouldn't have to think.

The problem is that you will be able to find a good model for the data, but you have no idea about how well the model describes the mechanisms that create the data (rather than the data). If you do the analysis on two replicate data sets, will you get the same model? If not, then what do you do? The best model for one data set might be awful for the other.

And what criterion do you use to determine the best model? Different criteria will give you different answers.

This is the sort of idea that sounds good, but sends us professinals into spasms: there are so many problems with it.

Bob

--
Bob O'Hara
Department of Mathematics and Statistics
P.O. Box 68 (Gustaf Hällströmin katu 2b)
FIN-00014 University of Helsinki
Finland

Telephone: +358-9-191 51479
Mobile: +358 50 599 0540
Fax:  +358-9-191 51400
WWW:  http://www.RNI.Helsinki.FI/~boh/
Journal of Negative Results - EEB: www.jnr-eeb.org
.