Re: r-Squared Question



Predictor wrote:
Let's assume some observed data, which I hope makes my question
clearer:

X   Y
1   101
2   102
3   103
4   104
5   105
6   106
7   107
8   108
9   109
10  110

The relationship here is obvious, but bare with me.  Assume that some
regression procedure (obviously not least squares) produces a linear
model, YHat:

X   Y     YHat
1   101    97
2   102    99
3   103   101
4   104   103
5   105   105
6   106   107
7   107   109
8   108   111
9   109   113
10  110   115

YHat has a correlation ( r ) with Y of 1.0.  r-squared is hence 1.0.
What I'm getting at is: the r-squared is at its best possible value,
yet the model is obviously suboptimal.  Have I gone wrong somewhere, or
is this a fundamental weakness of r-squared?


It depends how you defined R2. If you define it as the square of the correlation between observed and predicted, then it's a weakness. However if you define it as 1 - ResSS/TSS, then, for an arbitrary model fitting procedure, R2 isn't even constrained to the interval [0,1], since ResSS might exceed TSS.


Here
> X   Y     YHat Y-Yhat
> 1   101    97    4
> 2   102    99    3
> 3   103   101    2
> 4   104   103    1
> 5   105   105    0
> 6   106   107   -1
> 7   107   109   -2
> 8   108   111   -3
> 9   109   113   -4
> 10  110   115   -5

Here, TSS=82.5 and ResSS=85, so R^2 = 1-85/82.5 = -0.03, and the fitted line predicts worse than always using the sample mean.
.