Re: Fitting Data to a Multiple Regression Model: A Challenge
- From: "Reef Fish" <Large_Nassau_Grouper@xxxxxxxxx>
- Date: 28 Jun 2005 20:39:41 -0700
Reef Fish wrote:
> G Robin Edwards wrote:
> > In article <1119851864.775843.222450@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
> > Reef Fish <Large_Nassau_Grouper@xxxxxxxxx> wrote:
> > > On June 19, Robin Edwards wrote, regarding a data set in SPSS
> > > I discussed, involving "model building",
> >
> > Many thanks for providing this data set, Bob. Seems like I started
> > quite a hare with my simple request!
> >
> > I have snipped all the comments, the data and everything. It has been
> > much repeated in the other postings.
> >
> > As I wrote a few days ago I have had a first look at the data, and I
> > kept a log of my operations. I should point out that I deliberately
> > avoided reading any replies to Bob's post before doing anything with
> > the data.
>
> Excellent!!
>
> i was somewhat worried about the unfair influence (good or bad) by
> others. I tried to refrain from saying much, but what others had
> about what they found is hard to ignore unless you don't read them.
>
>
> > My journal, below, thus knows nothing of all the words
> > posted after RF's data arrived. I've now read the ones I downloaded
> > yesterday evening (27 June). As I write it is Tuesday evening, 28
> > June, and I have not downloaded anything today.
>
> Nothing happened to day (in non time-series) except your post now.
> I'll withold comments until I get something more from Jerry Dallal
> (I hope he'll find time to add more to what he had already done)
> and anyone else.
>
> >
> > Here's my journal:-
> >
> > *********************************
> >
> > Data provided by Bob on 27 June 05
> >
> > I shall look at this before reading others' posts.
> >
> > 1. Scan (eyeball) the data. No missing values. Good! Clearly time
> > series, so could mean trouble.
>
> Both good observations.
> >
> > 2 Notice that it is reminiscent of the famous Longley data set.
>
> Not in the multicollinearity sense. BTW, here's a "side lesson":
>
> Highly correlated independent variables (such as r > .9) does
> not NECESSARILY imply collinearity problems. On the other
> hand you MAY have a singular correlation matrix even if ALL
> of the pairwise correlations are < .2 say.
>
> >
> > 3 Import into 1st (my stats software).
> >
> > 4 Run naive multiple regression. Note that the software produces a
> > warning message. "Very possibly correlated independent variables.
> > Check regression diagnostics." Thus viewed the inital run as of
> > doubtful value.
>
> See my "side lesson" above. You PROGRAM may be issuing FALSE to
> MISLEADING warnings. ANOTHER abuse of the "correlation coeff". :-)
>
> The software must examine the EIGENVALUES of X'X to correctly
> detect multicollinearity conditions/problems!!
>
> I'll stop my comments here. Will resume with the rest of your
> analysis/results at the conclusion of the Million $ Challenge. :-)
>
> Thanks for your effort and interest. I think almost everyone
> will learn SOMETHING from it. The misunderstanding of the
> detection and effects of multicollinearity ranks among the
> highest "regression abuses" I know (next to the "expected
> sign" fallacy).
>
> Stay tuned under the LESSON 2 thread.
It doesn't appear there'll be any more entries. I suspect Jerry
is either too busy or his knot and spline software has a feature
for him to get prediction intervals or PRACTICAL significance
assessments.
So, I'll going ahead and finish commenting on your analysis here,
then continue my "lessons" without Jerry. He can always do more
after I show what I did (30 year ago) which was a better fit than
his.
>
> > 5 So, computed regression diagnostics. Warnings about very high
> > multiple correlation coeffs and the equivalent variance inflation
> > factors. GNP and C.PROF have VIFs of 14.3 and 12.2, so one of them is
> > effectively redundant as a potential predictor. C.DIVD has VIF of 3.64
> > in this company. Note that Row 26 is a highly influential point ("HAT"
> > value 2.14, with next highest 1.199). Looks like an "outlier".
> >
> > 6 Look at Row 26. Ha ha! There it is. 15849. Clearly a typo.
> > Should be 25849.
> >
> > 7 Repair data set.
You were a bit late here. I had corrected that value at about 10 am
the same morning I posted the original data at 1:59 am. So, you must
have stopped looking as soon as you saw my original data.
In any event data examination and graphical displays should have
been done first, before doing any computation such as correlations.
> >
> > 8 Repeat operations with new data. Similar (but of course different)
> > diagnostics on VIFs. Row 26 is no longer influential. The most
> > influential points are 31 and 32.
If you had done some scatter plots of INVDEX vs the other variables,
you would have notice the obvious "elbow" Jerry found, and the same
elbow AUTOBOX found using time-series methods. See LESSON 2 for
details.
> >
> > 9 Look at INVDEX as a time series by Cusum plot. Noted possible steps
> > at 1950, 1954, 1960 and possibly 1963.
> >
> > 10 Generate multi-plot of all four variables plus Year (10 plots on
> > one diagram). This shows clearly that the predictor most likely to be
> > useful as a model for INVDEX is C.DIVD. The other two (closely similar
> > as noted in
> > the regression diagnostics) have a "hook", or, dare I say it, a hockey
> > stick, shape when INVDEX is plotted against them. Year gives a similar
> > but less angular plot.
Ah, you DID notice the "hockey stick" which I called the "elbow", but
you let the golden goose walk by and grabbed the quacking duck
instead! :-) Again, see my continuation of LESSON 2.
> >
> > 11 Try a regression of INVDEX on C.DIVD. Produces adj R-Sq 0.8736, t
> > value for C.DIV 14.65
> > Forecast for 337.75 (Mean of C.DIV) of 176.9, L and U 95% interval for
> > a further single point is
> > 94.46 to 259.4. For C.DIV = 700 (a reasonable extrapolation) values
> > are 400.8, 494.3 and 587.9. The regression plot looks fine.
> >
> > 12 Try a regression with C.DIV and GNP. Adj R-Sq 0.93585 - looks
> > good! But is it? Forecast value for mean of GNP and C.DIVD gives
> > 118.2, 176.9 and 235.7, noticeably better than the simple regrn.
> > Now try C.DIVD 700 with GNP at its mean of 19314. Values are 262.86
> > (lower 95%), forecast 348.7 and upper 95% 434.6. These are nonsense!
> > No doubt the reason is the very high multiple correlations,
> > of which I've had warnings. Haven't tried C.Prof, but the result will
> > be almost exactly like the model with GNP in it. Good results very
> > close to the mean values and meaningless forecasts elsewhere.
These are nice exploratory steps. Unfortunately, you did not take
advantage of the "golden goose" and ended with this model:
> >
> > 13 My current choice for the best model is just
> >
> > INVDEX = -119 + 0.87621*C.DIVD.
This was based on all 32 observations, and would have yield a
MSE of 1582, which is almost 5 times the MSE (or RMS) of 335 Jerry
got with GNP alone as the predictor, which was also about HALF the
RMS of 622 of the SPSS-like multiple regression model with the
kitchen sink thrown in.
> >
> > I'll do a bit more thinking about this, but can't hold out much hope of
> > an improvement. Maybe inspiration or advice will come from someone.
> >
> > I'll post this and then have a look at all the other contributions, to
> > see where I've gone wrong.
> >
> > **************************
> >
> > That's what I wrote as I went along with the analyses earlier this
> > evening.
Very nicely document. Helped others see your thought process as well
as where and how you missed the boat, so to speak, when you read my
LESSON 2.
> >
> > So, you can start shooting.
Sorry, the golden goose already walked away. :-)
> >
> > I should point out that I'm not a statistician - a mere long retired
> > industrial chemist, who came across stats via the experiment design
You certainly showed much better insight and thoughtfulness in your
exploratory step than MOST "applied statisticians" would have done,
sort of like the SPSS Manual example -- "Garbage IN, Garbage Out".
They might even get busy discussing whether the SIGN of one of the
coefficient is right or not. :-)
> > route, in 1956, from a book by Brownlee "Industrial Experimentation"
> > which was written to help industry during WW2. Looked dry as dust -
> > especially to someone who is no natural mathematician, but I liked the
> > notions of ANOVA and fractional factorials. Thought they might save me
> > some work!
> >
> > I've looked at the original postings - very interesting!
> >
> > Now to send this and download all the newer postings.
> >
> > Cheers, Robin
Data analysis and model-building are things that always have UNIQUE
features in every data set, and only those trained to look out for
them and take advantage of any "golden goose" they see during the
iterative process can consistently do well.
Thanks to you voluntary participation (at the risk of being shot),
I believe you've contributed more than you realized, to help OTHERS
think more about what THEY might do, the next time they get hold
of ANY multiple regression data set, that life is much more
interesting and fruitful than just throwing all the variables into
a large scale model, and look only at correlations and coefficient
signs.
Now you, or any reader who is following this EXERCISE, may continue
reading my continuation of the LESSON 2 thread, on this same data set.
-- Bob.
.
- Follow-Ups:
- Re: Fitting Data to a Multiple Regression Model: A Challenge
- From: G Robin Edwards
- Re: Fitting Data to a Multiple Regression Model: A Challenge
- References:
- Re: Fitting Data to a Multiple Regression Model: A Challenge
- From: G Robin Edwards
- Re: Fitting Data to a Multiple Regression Model: A Challenge
- From: Reef Fish
- Re: Fitting Data to a Multiple Regression Model: A Challenge
- Prev by Date: Job: Statistician Yahoo! Inc.
- Next by Date: Re: Significance in circular regression
- Previous by thread: Re: Fitting Data to a Multiple Regression Model: A Challenge
- Next by thread: Re: Fitting Data to a Multiple Regression Model: A Challenge
- Index(es):
Relevant Pages
|