Re: Comparing predictors

From: Richard Ulrich (Rich.Ulrich_at_comcast.net)
Date: 06/25/04


Date: Fri, 25 Jun 2004 10:48:37 -0400

On Fri, 25 Jun 2004 09:42:42 +0100, Lionel Barnett <mail@lionelb.com>
wrote:

> Richard Ulrich wrote:
>
> > On Thu, 24 Jun 2004 17:08:35 +0100, Lionel B <me@privacy.net> wrote:
> >
> >>Greetings,
> >>
> >>Suppose I have (jointly distributed, real-valued) random variables Y,
> >>X_1, ..., X_n and a real function f(x_1, ..., x_n) which is used to
> >>define the "predictor" Y' = f(X_1, ..., X_n) for Y. It is then standard
> >>to measure the quality of Y' by the mean square error E((Y'-Y)^2).
> >>
> >>Now I have the following situation: I have two such predictors; the
> >>first, derived from f(x_1, ..., x_n), say, is actually quadratic in the
> >>x_i (it is, in fact a simple least squares fit to a 2nd order
> >>polynomial). The other, derived from g(x_1, ..., x_n), say, is the
> >>output of a (trained) neural network.
> >>
> >>I now have the suspicion that the neural network predictor Y'' = g(X_1,
> >>..., X_n) is actually doing "more-or-less the same thing" as the
> >>quadratic fit predictor Y' = f(X_1, ..., X_n). But how do I test this
> >>suspicion?
> >>
> >>So far the only way sensible way I can think of for comparing predictors
> >>is to measure their correlation. Indeed, in my case the correlation
> >>corr(Y',Y'') comes out at a significantly high (approx.) 0.8. But
> >>correlation alone doesn't somehow quite seem to confirm that the neural
> >>network really is doing more-or-less the same as the quadratic fit - it
> >>misses out on the joint distribution of the respective predictors with
> >>the independents X_1, ..., X_n.
> >>
> >>If this sounds confused, it is ... my real question is probably: what
> >>question should I be asking here?
> >
> > Does the problem reduce to this:
> >
> > You have m examples of a criterion, Y, each
> > with an associated data vector Xn;
> > you have a regression prediction, Y', based on 2n degrees
> > of freedom (with possible overfitting?); and
> > a NN prediction, Y", using at least n degrees of freedom.
>
> Pretty much, yes. Perhaps I should be more explicit: in fact n is small
> - n = 2, while m is very large - m = 24800. [this is very noisy
> financial time-series data, as it happens]. The regression for Y' is
> based on a "known" model of the form:
>
> Y' = f(X) = b_1*X_1 + b_2*X_2 + b_3*X_1*X_2 (+ uncorrelated noise)
>
> so 3 degrees of freedom, while the neural network might be viewed as a
> "black box":
>
> Y'' = g(X) (+ uncorrelated noise)
>
> with 68 degrees of freedom (the number of weights to adapt).
>
> > Now, Y' correlates 0.80 with Y"; and you want to
> > know whether the two predict substantially the same
> > variance of Y.
>
> No, I know that they do...
>
> corr(Y',Y'') = 0.8851
> corr(Y'-Y,Y''-Y) = 0.9991 (!)
>
> (see my reply to Ross Clement in this thread). What I want to know might
> be best phrased as: Do Y' and Y'' predict Y "in a similar way".

I liked Ross's suggestion about comparing residuals.
Being *uncorrelated* says that the predictors are different;
being correlated is less definitive, but 0.999 says there is
not *much* different going on, quantitatively. On the other
hand, this is a time series, which is raises a couple of problems --
there is serial correlation so (a) the values are not independent,
and the ordinary tests are not valid, and (b) (related to that)
correlations of 0.99 are not necessarily surprising or useful.

>
> I realise that this is not a precise question - I would like to make it
> precise!

Given these details -- it seems that if the 'black box' does not
predict a tad better, with 68 d.f., it is *inferior*, whether it is
similar or not. That is what you would get using AIC or BIC
to compare models. - that would imply, I guess, that it fails
to represent the X_1*X_2 term. That could be tested directed,
in part, by seeing if that term predicts the residuals of the
black box : Not-predicting would say that the effect was taken
care of.

I think that I might make the easy assumption that the polynomial
was nested within the black box, and do an ANOVA with the
68-3 d.f., using the R^2.
 - This is an exploratory analysis, since it misuses a time series
(which, effectively, acts like a smaller set of data because of
the high serial correlation).
 - However, if the black box does *not* add to the prediction,
that answer if fairly robust, since the bias of the time series is
in the opposite direction.

Now that you have identified this as time series, maybe there
will be comments by people with pertinent experience with those.

But I have a couple of notions about those models.
If you are not using software designed for time series, models
of differences are usually the place to start; and aggregation
gets rid of nuisance variance, when scores are represented in
too much detail. - An N of 24800 suggests to me that these
numbers represent prices (say) sampled by day, if not more
often than that; whereas I expect the 'effects' of the predictors,
if not the measurement of the predictors themselves, would
be properly assessed by longer intervals.

[ snip, rest]

-- 
Rich Ulrich, wpilib@pitt.edu
http://www.pitt.edu/~wpilib/index.html