Re: theta = (X'X)^-1*X'y




Jack Tomsky wrote:

Jack Tomsky wrote:

Once, my company had a program where they were
fitting a nonlinear least-squares curve through a
bunch of points in which the number of points were
less than the number of coefficients. I was wondered
how the algorithm was able to converge to a unique
solution. There should be infinite number of sets of
coefficients that would fit exactly through the data
points.

In examining the code, I saw that at each step in
the algorithm, the diagonals of XX' were multipied by
a factor, 1+delta. This added condition imposed a
unique solution.

I've not seen the kind of constraint you are talking
about in what's
called Ridge Regression.

First of all, in Ridge Regression, the X'X matrix is
NOT singular, and
often far from singular.
Instead of inverting X'X. there are TWO kinds of
Ridge regression:

The ordinary Ridge subtracts a fixed lambda from the
DIAGONAL of the
X'X matrix
before inverting it, i.e., it's no longer the inverse
of X'X. but
inverse (X'X -lambda*I),
where I is the identity matrix, and the (X'X - lambda
I)^(-1) * X'Y is
then called the
Ridge regression estimated betas.

The second kind is the so-called generalized ridge
regression where the

diagonal matrix (lambda1, lambda2, ..., lambda p) is
substracted from
the pxp
matrix X'X before that matrix is inverted.

In these settings, there is a similarity between
Stein's shrinkage
estimator, for
neither imposes any unique solution to the problem --
in fact there are
always
infinitely many Ridge Regression solutions!


In an attempt to gain some insight into how to
interpret the solution, I stripped down the problem
to the simple case where we have a single point (x1,
y1) and are fitting a two-parameter regression line:
y = a+bx.

In this case, the X matrix would be (1, x), and X'X
would be
the matrix with 1 x in the first row and x 1 in the
second.

The X'X + kI matrix would be ( 1+k x )
x
x 1+k

the determinant of which is (1+k)^2 - x^2 and the
inverse

would be x -(1+k) .divided by the
determinant,
-(1+k) x

and finally the inverted matrix is multiplied by X'Y
which is (y xy)'

none of which is at all like what you said in the
lines below:
In particular your XX' would be a 1 x 1 matrix (1 +
x^2) and
multiplying it by (1 +delta)
will only give a scalar (1+ x^2)(1+ delta), ...

Multiplying the diagonals of XX' by 1+delta, we end
up with the unique regression line,

y = [y1 + (y1/x1)*x]/(2+delta)

This line varies with different delta. As delta
goes to zero, this line connects the data point
(x1,y1) with (0,y1/2).

I don't think that's what a Ridge Regression does.

Obviously when you start off setting delta to zero,
XX' is a singular matrix.

In a Ridge Regression, setting the ridge parameter k
(or lambda) to
zero is
doing an ordinary least squares regression, which for
most regressions,
multicollinearity notwithstanding is seldom ever
SINGULAR.

So, I am quite sure we are talking about rather
different regressions
that
somehow got the same name "Ridge".

-- Reef Fish Bob.

This means that there is no unique solution.
However, when you
multiply the diagonals by 1+delta and set delta equal
to zero at the
end, you end up with this unique line.

I never really understood why you should end up
with this particular line.

Jack



In ridge regression, a constant is added to the diagonals.

before (X'X + kI) is inverted, and multiplied by X'Y. That's why when
k=0,
it reduces to the OLS solution. I am 100% sure this is called Ridge
Regression, because I had directed a doctoral student on the subject
and had rejected papers submitted to journals AND submitted to NSF
for funding, and my only reason for rejection was HOW can the proposer
justify what SIGN they can expect of a particular coefficient when
there
are usually quite a few in their models.

In all those cases, they CAN'T. They were committing the same
"expected sign" fallacy in multiple regression as most of the social
scientists do (thinking of only simple correlation). These same
folks,
after committed their expected-sign fallacy, THEN through Ridge
Regression had come to their rescue by forcing any sign the direction
they want, not knowing why they shouldn't!

This type of regression has a different name, but I've forgotten it and am too lazy to look it up. The diagonals are multiplied by the same factor.

The original nonlinear model had five coefficients and was intended to be applied to fit a curve through six absorbance rates from six calibrators. They then applied it to a different product which had only four calibrator readings. The fascinating thing to me was finding out why the regression didn't blow up.

Jack

That may be the difference also. Ridge Regression (the only kind
I know) only deal with multiple LINEAR regression problems, hence
the inversion of (X'X) matrix to get the regression estimates. There
is no such analogue in NONLINEAR regression problems.

-- Reef Fish Bob.

.



Relevant Pages

  • Re: theta = (XX)^-1*Xy
    ... Regression is ... given for the use of Ridge Regression. ... Instead of inverting X'X. there are TWO kinds of Ridge regression: ... The ordinary Ridge subtracts a fixed lambda from the DIAGONAL of the ...
    (sci.stat.math)
  • Re: theta = (XX)^-1*Xy
    ... I should have mentioned Ridge Regression also, ... of getting the "wrong sign" for some of the regression coefficients, ... But there is a lot of "contaiminated and missing data" ...
    (sci.stat.math)
  • Re: theta = (XX)^-1*Xy
    ... Regression is ... given for the use of Ridge Regression. ... Instead of inverting X'X. there are TWO kinds of Ridge regression: ... The ordinary Ridge subtracts a fixed lambda from the DIAGONAL of the ...
    (sci.stat.math)
  • Re: theta = (XX)^-1*Xy
    ... First of all, in Ridge Regression, the X'X matrix is ... The ordinary Ridge subtracts a fixed lambda from the ... Multiplying the diagonals of XX' by 1+delta, ...
    (sci.stat.math)
  • Re: multiple linear regression
    ... I standardized the DV and IV before running the regression. ... standardize IV1 and IV2 and then multiply to return the interaction ... my regression results look a lot more like the ANOVA results. ... before multiplying them is a vital step that gives a ...
    (sci.stat.consult)