Re: ridge regression, good for prediction or representation?

From: Herman Rubin (hrubin_at_odds.stat.purdue.edu)
Date: 02/01/05


Date: 1 Feb 2005 12:33:23 -0500

In article <1106980540.021327.32740@f14g2000cwb.googlegroups.com>,
b83503104@yahoo.com <b83503104@yahoo.com> wrote:
>I saw a discussion which uses principle component analysis to explain
>why ridge regression works. In Trevor Hastie's 2001 book (p.63),
>"Ridge regression ... shrinks the coefficients of the low-variance
>components more than the high-variance components."
>This made me confused.
>Isn't regression used for prediction/discrimination, and PCA used for
>representation? And isn't it a common sense that methods like PCA that
>find good representation are NOT necessarily good for prediction
>(LDA/logistic regression are good for discrimination)? But the book
>then implies that ridge regression serves as BOTH a prediction (since
>it is regression) and representation (since has the effect of PCA)
>method?

This is a mathematical result, which does add to the
intuition if used correctly.

The standard means of computing a regression is to
compute the estimate of b as (X'X)^{-1}X'Y. If there is
a prior distribution of the parameters with mean 0 and
covariance matrix T, the best linear estimator with
quadratic form loss is, instead, (X'X + T^{-1})^{-1}X'Y.
Ridge regression chooses the matrix T to be a I/h, with
h possibly to be determined.

Now what does PCA do? It uses an orthogonal transformation
Q to reduce X'X to diagonal form. This in effect replaces
X by XQ and b by c = Q'b. The identity matrix is preserved,
and so the regression equation, and the ridge equation, are
both preserved. But now X'X is a diagonal matrix with the
i-th element being d_i, so the i-th element of c from the
original equation is m_i/d_i, and the i-th element of c
from the ridge equation is m_i/(d_i + h).

Again looking at the PC representation, d_i is the variance
of the i-th principal component, and h is constant. So if
d_i is small in comparison to h, the shrinkage is large.
In fact, changing h to 0 if d_i >= h and to infinity if
d_i < h at most doubles the Bayes risk for squared error loss.

This is also intuitive. In the directions with d_i large,
the likelihood function changes rapidly, so the prior is
not of great importance. In the directions with d_i small,
the likelihood function changes slowly, so it is not of
much importance, while the prior is.

-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University
hrubin@stat.purdue.edu         Phone: (765)494-6054   FAX: (765)494-0558


Relevant Pages

  • Re: ridge regression, good for prediction or representation?
    ... And isn't it a common sense that methods like PCA that ... > find good representation are NOT necessarily good for prediction ... Ridge regression is not PCA. ...
    (sci.stat.math)
  • Re: Logistic Regression
    ... I haven't looked into correlations between word usage carefully. ... > hindrance to standard multiple regression analysis. ... I have played around with using a Mahalanobis distance based K-Nearest ... PCA was popular, as an example the work of John Burrows of Newcastle ...
    (sci.stat.math)
  • Re: Principle Component Analysis
    ... I read a book about clearing multicollinearity of the independent variables by PCA. ... The result could be in the form of latent roots or latent vector but the problem is how do i use this PCA in regression? ... HOWEVER -- there is a serious problem here -- some of the principal components may not be predictive of the Y variables. ...
    (sci.stat.math)
  • Re: Linear regression vs. vectors from principal components analysis
    ... Why isn't PCA used rather than linear regression? ... you compute the eigenvalues and eigenvectors of the ...
    (sci.stat.consult)
  • Linear regression vs. vectors from principal components analysis
    ... Why isn't PCA used rather than linear regression? ... you compute the eigenvalues and eigenvectors of the ...
    (sci.stat.consult)