Reef Fish Statistics for Dummies: Applied Simple Regression
- From: "Reef Fish" <large_nassua_grouper@xxxxxxxxx>
- Date: 4 Oct 2006 10:58:54 -0700
For those who have completed the first portion of any First Course in
Statistics to arrive at the "Simple Regression" topic, this lecture is
self-contained. It is elementary, but it contains some material used
in
my advanced Graduate Courses in Data Analysis, in Math. Sciences
and Statistics Departments, because it is easy for the mathematically
inclined to neglect the APPLIED aspects of Statistics in general, and
the necessary methodology in Simple Regression in particular.
1. What is a "Simple Regression"?
Most regression problems are simple, in the English sense of the
word simple. In Statistics, the term "simple" in "simple regression"
is
synonymous with "regression with ONE independent variable X"
in the model:
Y(i) = bo + b1 X(i) + e(i),
(1)
where e(i) is the random error of Y(i), independent and identically
distributed as a Normal random variable with mean 0 and variance
sigma^2, denoted as ~ i.i.d. N(0, sigma^2).
2. Model Assumptions in (1),
bo and b1 are unknown parameters to be estimated from DATA
(X(i), Y(i), i = 1, 2, ..., n, ASSUMED to have the error structure
of e(i).
Because of the model assumption, before one does ANY analysis
or statistical inference, one must first "validate" the model
assumptions, because if the assumptions are wrong, then the
estimation and inference theory and methodology would not be
applicable.
Q, What STATISTICAL assumptions can we examine or
valicate before we proceed with our application?
A. NOTHING. Nothing at all!
There is already the place that some of the graduate students
stumble in applying simple regression. They realize that each
Y(i) in the model comes from a Normal distribution with mean
bo + b1 X(i), and variance sigma^2, and they have both graphic
and analytic tests (SPSS, SAS, S, R, Maple, etc) for normality,
so they test the data values of Y for normality.
That is a BLUNDER Number 1. While each Y comes from a
normal distribution, the dependent variable Y is a mixture of n
different normal distribuitons, and there is no reason why the
mixture Y should resemble data from a normal distribution at all.
Not only graduate students, but authors of statistics textbooks
sometimes make the same error. I caught the authors of the
textbook "Statistics for Business and Economics", Boston:
Allyn and Bacon, (1980), making that error in Figure 11-2,
on page 278, "Normal Distribution of the Population of Y
Regressed on X", suggesting by a sketch of a SINGLE normal
distribution that the observed Y in the aggregate should follow
a normal distribution.
The authors of that book were: Heitman, W.R. and Mueller, F.W.
Q. What about the distribution of X?
A. There is NO assumption about the distribution of X. They
can come from any distribution, and they can be fixed
constants or any given values.
It is BLUNDER Number 2 for anyone to examine the probability
distribution of X, or any "outlier" of X because they think X
should behave like a normal distribution, or any distribution.
The ONLY statistical ASSUMPTION behind a simple regression
model is that the ERRORS are ~ i.i.d. N(0, sigma^2), and there
is NOTHING you can do until you have tried some fit and have
observed the errors, in the form of "residuals" (left over from an
exact fit to a straight line).
The only thing one CAN, and SHOULD do, is to examine the
DATA for typographical and other non-statistical errors, verify
that they are indeed errors, and correct them before doing any
regression fit.
The Reef Fish archives come in handy:
http://groups.google.com/group/sci.stat.math/msg/47b76385a33b802b?hl=en&
The DATA was taken from the 1975 SPSS Manual in which
the data in that post was used to illustrate the output of a
Multiple Regression. That was where "Lesson #1" was
mentioned by me:
RF> LESSON #1. ALWAYS examine the data for gross
RF> (and not so gross) anomaly.
RF> Jerry and Russell are making a good start toward
RF> their Fish University "A".
The 15849 is of course an obvious typo, not by ME (it took me
about 10 minutes to type the data, about 30 minutes to write
a multiple regression program in SPEAKEASY because I have
NO access to any statistical package; and at least an hour to
find and correct the half a dozen or so typos of MINE <G> by
checking against the results I had done 30 years ago). The
typos were in the 1975 SPSS Manual!
======= end excerpt
The TYPO was what contributed to all THREE variables being
statistically significant in the SPSS Manual -- without it ...
that's the next chapter/Lesson. :-)
For Simple Regression, the dataset was used in the series of
lectures on Model Building to show that a Simple Regression
model was "better" than the SPSS's Multiple Regression Model
(with 3 independent variables) when the ERRORS in the DATA
were removed.
3. What is Step 2 of an Simple Regression Application?
THAT is where the Statistical Assumptions of the model are
examined after each attempted fit. The i.i.d N(0, sigma^2)
can be broken into three INDEPENDENT components:
1. Normality of the errors (residuals)
2. Independence of the errors (residuals)
3. Homoscedasticity (equal variances) of the errors (residuals)
All three MUST be satisfied before any statistical result of
a simple regression can be validly used. These are
independent assumptions in the sense that none implies
the other and no two of them implies the third.
Step 2 is what I had called LESSON 2 in the Model Building
thread, using the SPSS data, in the post
http://groups.google.com/group/sci.stat.math/msg/1fc722fa8dc6abf2?hl=en&
"LESSON 2 in Model Building. Iterative Loop of Sponsorship vs Critic"
The Applied Statistician (Data Analyst) first sponsors a model (our
initial simple regression model (1)). Once the model is FITTED,
the analyst must then act as his own CRITIC, to see if any of the
assumptions are violated. If so, he make changes in the model,
do a new fit, and acts as a critic of his own model AGAIN. This is
the "iterative Loop of Sponsorship vs Critic" described in detail in
George Box's JASA (1976) article, "Science and Statistics".
The Data Analysis / SPSS / Model Building thread started on
March 17, 2005. By Jun 29 2005 2:06 am, we began the post
on LESSON 2. :-)
At this time, I'll skim over the steps of the validation of the THREE
assumptions and how each violation might be accommodated.
It was in the cited post above that I made this observation,
RF> It's interesting in a way that all FOUR of us found the same
RF> "hockey stick" in the scatterplot of the INVDEX variable vs the
RF> GNP variable. All four of us took DIFFERENT actions!!
RF> In that respect, if we were working as a TEAM, we would
RF> put our heads together on the four TENTATIVE models
RF> and decide what model to sponsor next (if any).
That is the most interesting and rewarding part of being a Data
Analyst or an Applied Statistician! There are no formulas that
tell you what to do -- there are GUIDELINES on what you should
avoid and what are valid continuations.
That's where the SCIENCE of Statistics blends with the ART
of Statistics that the "mathematical statistician" is most deficient.
You can find numerous posts of Reef Fish which cited particular
passengers from George Box's JASA article -- which everyone
really should read, re-read, and re-re-read, very carefully every
time he analyses a real set of DATA. In my post below:
http://groups.google.com/group/sci.math/msg/548d518140187c36?hl=en&
I cited Box's indictment of "mathematical statistician's
mathematistry":
"Mathematistry is characterized by development of theory for
theory's sake, which since it seldom touches down with practice,
has a tendency to redefine the problem rather than solve it."
(p,797, 1976 JASA paper on "Science and Statistics").
The accommodation of a model with misbehaving residuals is
commonly accomplished via transformation of either the
dependent OR the independent (or both) variable in the
model.
Tukey and Mosteller's book "Data Analysis and Regression"
has a nice Exhiibit 1 in Chapter 4 "Straightening Curves and
Plots" showing a "continuum ladder of power transformation"
and where negative powers, root transformations and log
transformation fall in the diagrammed exhibit.
Going back to the SPSS example... Of those who dared to
show what they tried, NONE used any power transformation
to straight out the "hockey stick" seen in the simple
regression scatter. :-)
There ain't such a thing as "cook book" or "recipe" in an
enlightened "data analysis", or "exploratory data analysis"
guided by both MODEL and DATA.
That's what Applied Statistics and Applied SImple Regression
is all about.
In the Preface to "Reef Fish Statistics for Dummies", I
said there will be very few formulas or equations. So far,
I had given only ONE, equation (1), which is the usual model
for a simple regression.
4. What comes next?
This is where, after the model and statistical assumptions
have been validated to apply the theoretical results derived
by statisticians, almost everyone will have no trouble finding
the FORMULAS and EQUATIONS used in constructing
Confidence Intervals for the parameters, test Statistical
Hypotheses about the intercept or slope of the parameters
bo and b1 in (1), and obtain prediction intervals for future
observations.
These are formulas that are easy to derive and even easier
to apply, and they are the ones that all of my students have
access in their OPEN BOOK, and OPEN NOTES exams.,
so I won't even bother to go over them here, once we have
carefully carried out the APPLIED steps 1 and 2 to ensure
that it is valid to apply the formulas and results.
The only REMAINING STEP is outside of most Statistics
textbooks on regression, which is labeled as STEP 3 in
my Model Building lessons:
http://groups.google.com/group/sci.stat.math/msg/761da58a2262b6fc?hl=en&
"LESSON 3 in Model Building: Practical Significance"
There is an ongoing discussion in two of the three sci.stat groups
now, under a not-so-obvious topic of "confidence intervals". In my
opening post
http://groups.google.com/group/sci.stat.consult/msg/7cd25dc2db74f282
which blossomed into a mini-thread of 8 posts, based on my
statement in the initial post, in which I wrotet:
RF> Any statisticians worth his salt would know that a highly
RF> "statistically significant" result can be completely
RF> worthless from a practical point of view of the usefulness
RF> of the result.
RF> Conversely, a statistical result that is not statistically
RF> significant at some .05 or .10 level can be very useful.
RF> The two concepts are TOTALLY different in terms of
RF> knowing how to apply statistics sensibly and usefully.
That is a VERY important principle for an APPLIED statistician,
which no "mathematical statistician" ever even think about.
The 1975 SPSS example was a good illustration of a highly
statistically significant result is completely USELESS in
practice, as demonstrated in the Model Building threads.
RF> In my re-analysis of the DATA in the SPSS Manual, I came to
RF> INVDEX = -197.51 + 0.018234 * GNP
RF> (15.031) (.000667)
RF> T=-13.14 T=27.33
RF> p-value 10^(-12)
RF> Multiple R-sq = 0.9726, MSE = 317.96.
RF> All very impressive and highly statistically significant.
And I punctured the euphoria of any social scientist or
non-thinking statistician, in LESSON 3, showing that the
simple regression model was completely USELESS in
practice. :-)
I conclude this First Lesson of "Reef Fish Statistics for
Dummies" with the sig of one Dr. Flash Gordon, M.D.,
FG> in theory, there is no difference between
FG> theory and practice. but in practice, there is.
FG> flash gordon, m.d. f...@xxxxxxxxxxxxx
Flash is a PRACTICAL Man (and M.D.) of many interests:
-- Reef Fish Bob.
.
- Follow-Ups:
- Re: Reef Fish Statistics for Dummies: Applied Simple Regression
- From: \"Luis A. Afonso\"
- Re: Reef Fish Statistics for Dummies: Applied Simple Regression
- From: \"Luis A. Afonso\"
- Re: Reef Fish Statistics for Dummies: Applied Simple Regression
- From: TomC
- Re: Reef Fish Statistics for Dummies: Applied Simple Regression
- Prev by Date: "Double" lottery winner
- Next by Date: Re: Test for uniform distribution for small sample size
- Previous by thread: "Double" lottery winner
- Next by thread: Re: Reef Fish Statistics for Dummies: Testing Correlations
- Index(es):
Relevant Pages
|