Re: ridge regression, good for prediction or representation?
From: Herman Rubin (hrubin_at_odds.stat.purdue.edu)
Date: 02/01/05
- Next message: Marc Schwartz: "Re: Methodology to compute a mean with for an incomplete set please?"
- Previous message: danadams_at_hagios.us: "random numbers from a lognormal distribution"
- Maybe in reply to: Paige Miller: "Re: ridge regression, good for prediction or representation?"
- Messages sorted by: [ date ] [ thread ]
Date: 1 Feb 2005 12:33:23 -0500
In article <1106980540.021327.32740@f14g2000cwb.googlegroups.com>,
b83503104@yahoo.com <b83503104@yahoo.com> wrote:
>I saw a discussion which uses principle component analysis to explain
>why ridge regression works. In Trevor Hastie's 2001 book (p.63),
>"Ridge regression ... shrinks the coefficients of the low-variance
>components more than the high-variance components."
>This made me confused.
>Isn't regression used for prediction/discrimination, and PCA used for
>representation? And isn't it a common sense that methods like PCA that
>find good representation are NOT necessarily good for prediction
>(LDA/logistic regression are good for discrimination)? But the book
>then implies that ridge regression serves as BOTH a prediction (since
>it is regression) and representation (since has the effect of PCA)
>method?
This is a mathematical result, which does add to the
intuition if used correctly.
The standard means of computing a regression is to
compute the estimate of b as (X'X)^{-1}X'Y. If there is
a prior distribution of the parameters with mean 0 and
covariance matrix T, the best linear estimator with
quadratic form loss is, instead, (X'X + T^{-1})^{-1}X'Y.
Ridge regression chooses the matrix T to be a I/h, with
h possibly to be determined.
Now what does PCA do? It uses an orthogonal transformation
Q to reduce X'X to diagonal form. This in effect replaces
X by XQ and b by c = Q'b. The identity matrix is preserved,
and so the regression equation, and the ridge equation, are
both preserved. But now X'X is a diagonal matrix with the
i-th element being d_i, so the i-th element of c from the
original equation is m_i/d_i, and the i-th element of c
from the ridge equation is m_i/(d_i + h).
Again looking at the PC representation, d_i is the variance
of the i-th principal component, and h is constant. So if
d_i is small in comparison to h, the shrinkage is large.
In fact, changing h to 0 if d_i >= h and to infinity if
d_i < h at most doubles the Bayes risk for squared error loss.
This is also intuitive. In the directions with d_i large,
the likelihood function changes rapidly, so the prior is
not of great importance. In the directions with d_i small,
the likelihood function changes slowly, so it is not of
much importance, while the prior is.
-- This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Department of Statistics, Purdue University hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558
- Next message: Marc Schwartz: "Re: Methodology to compute a mean with for an incomplete set please?"
- Previous message: danadams_at_hagios.us: "random numbers from a lognormal distribution"
- Maybe in reply to: Paige Miller: "Re: ridge regression, good for prediction or representation?"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|