Re: Using Ridge Regression to disentangle highly correlated explanatory variables
- From: Old Mac User <chendrixstats@xxxxxxxxx>
- Date: Fri, 14 Mar 2008 09:15:32 -0700 (PDT)
On Mar 14, 12:12 pm, Old Mac User <chendrixst...@xxxxxxxxx> wrote:
On Mar 13, 9:41 pm, Richard Ulrich <Rich.Ulr...@xxxxxxxxxxx> wrote:
On Thu, 13 Mar 2008 13:37:38 -0700 (PDT), Old Mac User
<chendrixst...@xxxxxxxxx> wrote:
On Mar 13, 2:25 pm, JohnF <jf...@xxxxxxxxxxx> wrote:
Folks,
Need your advice and any practical solutions.
We recently conducted a retrospective regression analysis where 3
variables were highly correlated (high VIFs). Decided to use a
principal components approach to create a factor score for input into
the regression model, which did it's job at reducing the VIF greatly.
However, the three highly correlated variables were each of great
interest. A colleague suggested using Ridge Regression to disentangle
the relative impact of each of the three explanatory variables. This
did show that one of the three variables was much more impactful.
Now I'm left wondering if this makes sense, given they were so highly
correlated to begin with. Wouldn't we conclude that they are all
equally contributing - i..e, the factor loading can be divided in
terms of relative impact equally among the three variables?
What's your opinion on this type of issue. I need some practical
advice, point of view, and/or alternate approach to consider.
Remember that the three variables are each of particular interest, so
need to somehow cull out their relative impact.
Very much appreciate any and all help. Thanks!
John
The fact that they are highly correlated means (among other things)
that you may as well pick one variable to use as a predictor and
ignore the others. However, recognize that in doing so you are
assuming that those high correlations are "permanent and enduring" and
will not disappear in future data. In other words, those high
correlations are structurally associate, and not just due to chance.
When all is said and one, there's really nothing you can do to get
valid estimates of the effects of those three variables separately
from one another. My suggestion is to forget about ridge regression.
It's an endless swamp. OMU
"Endless swamp" suggests a big waste of time and energy,
so that may be a pretty good warning.
However, I will offer a couple of additional comments.
Ridge Regression has been characterized, mathematically,
as being a particular, weighted average of all the subsets
of possible regressions, taking one, two, ..., n, variables at a
time. The final regression differs from the individual
regressions most especially when some variables act as
suppressors, that is, enter with the 'wrong sign'.
RR is also characterized as practically including a set of
outcomes that are exactly at the middle, in order to
nullify some of the effects that are present. This makes
it a biased procedure, which throws away information.
Its erstwhile popularity relies on sometimes producing
equations that have better replication than OLS regression.
Ridge Regression will not make much useful difference
when all the variables are entered into their natural
direction -- that is, when the b's have the same signs
as the zero-order correlation. In this simple case, *if*
you want an arbitrary index of "importance", you might
try using the decomposition of R^2 into the sum of
the quantities, r_0 times beta_n (using the initial
correlation and the final, standardized beta). This *ought*
not tell you much that you do not see by looking at
the standardized beta, or else be careful of it. If there
are suppressor variables, then there are apt to be (depending
on the careful definition of "suppressor") terms
that are negative or terms that are greater than 1.0.
If there *is* a suppressor variable, the way to achieve
robust prediction is to get rid of the variable by creating
the logical combination that it (implicitly) suggests,
assuming that it suggests one. If a *difference* is what
is being predictive, then it makes sense to state that
difference, to measure *that* importance, and to track that
score ... in order to make sure (for one thing) that the
difference itself is what stays inside the usual bounds, or
looks like an outlier that ought to be eyeballed closely.
--
Rich Ulrich
http://www.pitt.edu/~wpilib/index.html
Rich...
It's always good to read your comments.
Re: Ridge Regression...
My contention is that... when all is said and done... the "answer" you
get via RR ultimately depends upon how you treated the data. Hence
"the answer" is arbitrary. Of all of the multiplicity of possible
"answers", you (the analyst) has picked the one he/she likes best.
My further contention is this. I see entirely too many instances
(posts here, for example) in which someone has gained access to
upscale statistical software. There's an awful tendency to try this
and try that (PCA, Regression on PCA, RR, etc. come to mind) with
minimal understanding of the possible consequences of their choices.
Moreover, if I am engaged as a statistician/analyst at whatever stage
(planning, design, ultimately analysis) then the burden is on me to
explain how we arrived at "the answer".
If I can't explain it to my peers, customers, "the people who are
paying the bill", then it's for naught. Somewhere along the way I
came to this: "Statistics is about communicating formation from one
person to another person". If I can distort "the answer" by using
weights, a particular choice of methodology, or by trying several
"methods" until I evolve to the one I like the best... then a full
disclosure of what I did will be somewhere between confusing a lie.
Harsh? It sure is. Perhaps I'm jaded from seeing too many instances
in which well-intentioned people (armed with the latest and greatest
software) severely wounded themselves up to and including degrading
their careers. One of my "better" examples: An engineer who gave a
presentation in which he proclaimed that "variables that have high p-
values are significant and those with low p-values are not
significant." I have a file... a collection... of real-life examples
like this. Some appeared in R/D Project reports including "final
reports" that were widely distributed and were put into permanent
library files. (In short, no way to go back and correct them.)
As always, your comments are always welcome and respected.
OMU
Small corrections...
This...
"Statistics is about communicating formation from one
person to another person". If I can distort "the answer" by using
weights, a particular choice of methodology, or by trying several
"methods" until I evolve to the one I like the best... then a full
disclosure of what I did will be somewhere between confusing a lie.
Should have said...
"Statistics is about communicating information from one
person to another person". If I can distort "the answer" by using
weights, a particular choice of methodology, or by trying several
"methods" until I evolve to the one I like the best... then a full
disclosure of what I did will be somewhere between confusing and a
lie.
OMU
.
- References:
- Using Ridge Regression to disentangle highly correlated explanatory variables
- From: JohnF
- Re: Using Ridge Regression to disentangle highly correlated explanatory variables
- From: Old Mac User
- Re: Using Ridge Regression to disentangle highly correlated explanatory variables
- From: Richard Ulrich
- Re: Using Ridge Regression to disentangle highly correlated explanatory variables
- From: Old Mac User
- Using Ridge Regression to disentangle highly correlated explanatory variables
- Prev by Date: Re: Using Ridge Regression to disentangle highly correlated explanatory variables
- Next by Date: Re: Using Ridge Regression to disentangle highly correlated explanatory variables
- Previous by thread: Re: Using Ridge Regression to disentangle highly correlated explanatory variables
- Next by thread: Re: Using Ridge Regression to disentangle highly correlated explanatory variables
- Index(es):
Relevant Pages
|