Re: interaction term in linear regression with a dummy coded predictor
- From: Old Mac User <chendrixstats@xxxxxxxxx>
- Date: Sun, 15 Mar 2009 13:29:18 -0700 (PDT)
Here is a modest example to chew on.
This is a 2-level factorial design
in two variables X1 and X2. It doesn't
get much easier than this.
These are "synthetic" data.
Here's the data file set up for analysis.
In this format the variables X1 and X2
will be properly centered. The model coefficients
for the primary variables will be "the average
slopes"
Interaction example
//
X1, X2, Y
//
X1 = X1 - 1500 Variables centered
X2 = X2 - 60 in this image
X12 = X1*X2
//
X1, Pressure
X2, Temperature
X12, PT interaction
Y, Response
//
X1 X2 Y
1000 40 22.5
1000 40 18.7
1000 40 19.2
2000 40 30.2
2000 40 33.9
2000 40 31.4
1000 80 37.2
1000 80 39.3
1000 80 40.1
2000 80 54.9
2000 80 56.8
2000 80 54.3
&&&&&&&&&&&&&&&&&&&&&&&&&&&777
Analysis with X1 & X2 centered
WML98 v.1.3
Ex1111OUT
Interaction example 15:57:42 03-15-2009
//
X1 X2 Y
//
X1 = X1 - 1500
X2 = X2 - 60
X12 = X1*X2
//
1 X1 Pressure
2 X2 Temperature
3 X12 PT interaction
4 Y Response
//
Here is what the software "sees"
X1 X2 X12 Y
1 -500.000 -20.000 10000.000 22.500
2 -500.000 -20.000 10000.000 18.700
3 -500.000 -20.000 10000.000 19.200
4 500.000 -20.000-10000.000 30.200
5 500.000 -20.000-10000.000 33.900
6 500.000 -20.000-10000.000 31.400
7 -500.000 20.000-10000.000 37.200
8 -500.000 20.000-10000.000 39.300
9 -500.000 20.000-10000.000 40.100
10 500.000 20.000 10000.000 54.900
11 500.000 20.000 10000.000 56.800
12 500.000 20.000 10000.000 54.300
12 Rows of Data Were Read From Your Data File.
12 Rows of Data Were Actually Moved Into WML.
AVERAGES AND STANDARD DEVIATIONS
1 0.000000 522.233030 Pressure
2 0.000000 20.889320 Temperature
3 0.000000 10444.660000 PT interaction
4 36.541664 13.393860 Response
CORRELATION MATRIX Coefficients X1000
1 2 3 4
X1 X2 X12 Y
1 X1 1000 0 0 549
2 X2 0 1000 0 823 <---
3 X12 0 0 1000 93
4 Y 549 823 93 1000
549 = 0.549 93 = 0.093 etc.
Note that X1, X2, and X12 are not
correlated with each other.
Plainly, X2 (Pressure) has
the most significant impact
on Y. Then X1. The role of
the interaction will be modest
and may not even be "significant".
A "Forward" Stepwise regression.
Let's not get into a food fight over
"which kind of regression to use".
STEP NO. 1
VARIABLES COEFFICIENTS SE OF COEFF T-RATIO
0 Intercept 36.54166
RESSUMSQ STDDEV OF RES DF R-SQ
1973.34820 13.39385 11 0.0000
STEP NO. 2 Determinant = 1.0000
VAR. 2 GOING IN T-CRIT 0.05 & 0.01 = 2.23 & 3.19
VARIABLES COEFFICIENTS SE OF COEFF T-RATIO
0 Intercept 36.54166
2 Temperatur 0.52792 0.11507 4.59
RESSUMSQ STDDEV OF RES DF R-SQ
635.60720 7.97250 10 0.6779
STEP NO. 3 Determinant = 1.0000
VAR. 1 GOING IN T-CRIT 0.05 & 0.01 = 2.27 & 3.28
VARIABLES COEFFICIENTS SE OF COEFF T-RATIO
0 Intercept 36.54166
1 Pressure 0.01408 0.00123 11.49
2 Temperatur 0.52792 0.03065 17.22
RESSUMSQ STDDEV OF RES DF R-SQ
40.58604 2.12357 9 0.9794
STEP NO. 4 Determinant = 1.0000
VAR. 3 GOING IN T-CRIT 0.05 & 0.01 = 2.31 & 3.39
VARIABLES COEFFICIENTS SE OF COEFF T-RATIO
0 Intercept 36.54166
1 Pressure 0.01408 0.00099 14.22
2 Temperatur 0.52792 0.02476 21.32
3 PT interac 1.191668E-04 4.952398E-05 2.41
RESSUMSQ STDDEV OF RES DF R-SQ
23.54517 1.71556 8 0.9881
Model: Y = 36.54 + 0.014*(X1 - 1500) + 0.528*(X2 - 60)
+ 0.000119*(X1 - 1500)*(X2 - 60)
The t-ratios for the "main effects" X1 and X2 are quite large.
The role of the PT interaction is questionable. tcrit at the
0.05 level is 2.31
OBS PRED RESIDS STD RES
1 22.50 20.13 2.37 1.38
2 18.70 20.13 -1.43 -0.84
3 19.20 20.13 -0.93 -0.54
4 30.20 31.83 -1.63 -0.95
5 33.90 31.83 2.07 1.20
6 31.40 31.83 -0.43 -0.25
7 37.20 38.87 -1.67 -0.97
8 39.30 38.87 0.43 0.25
9 40.10 38.87 1.23 0.72
10 54.90 55.33 -0.43 -0.25
11 56.80 55.33 1.47 0.85
12 54.30 55.33 -1.03 -0.60
RES SUMSQ FROM REGRESSION = 23.54517
RES SUMSQ DIRECT = 23.54666
7 Negative Residuals 5 Positive Residuals
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
Now for an analysis of the same data with the variables
X1 & X2 not centered before creating X1*X2
Analysis with X1 & X2 not centered.
WML98 v.1.3
Ex2222OUT
Interaction example 15:56:07 03-15-2009
//
X1 X2 Y
//
X12=X1*X2 <--- NOTE: X1 and X2 not centered
//
1 X1 Pressure
2 X2 Temperature
3 X12 PT interaction
4 Y Response
//
X1 X2 X12 Y
1 1000.000 40.000 40000.000 22.500
2 1000.000 40.000 40000.000 18.700
3 1000.000 40.000 40000.000 19.200
4 2000.000 40.000 80000.000 30.200
5 2000.000 40.000 80000.000 33.900
6 2000.000 40.000 80000.000 31.400
7 1000.000 80.000 80000.000 37.200
8 1000.000 80.000 80000.000 39.300
9 1000.000 80.000 80000.000 40.100
10 2000.000 80.000160000.000 54.900
11 2000.000 80.000160000.000 56.800
12 2000.000 80.000160000.000 54.300
12 Rows of Data Were Read From Your Data File.
12 Rows of Data Were Actually Moved Into WML.
AVERAGES AND STANDARD DEVIATIONS
1 1500.000000 522.233030 Pressure
2 60.000000 20.889320 Temperature
3 90000.000000 45527.220000 PT interaction
4 36.541664 13.393860 Response
CORRELATION MATRIX Coefficients X1000
1 2 3 4
X1 X2 X12 Y
1 X1 1000 0 688 549
2 X2 0 1000 688 823
3 X12 688 688 1000 966
4 Y 549 823 966 1000
Note the unnecessary correlations
among the variables X1, X2, X12
STEP NO. 1
VARIABLES COEFFICIENTS SE OF COEFF T-RATIO
0 Intercept 36.54166
RESSUMSQ STDDEV OF RES DF R-SQ
1973.34820 13.39385 11 0.0000
STEP NO. 2 Determinant = 1.0000
VAR. 3 GOING IN T-CRIT 0.05 & 0.01 = 2.23 & 3.19
VARIABLES COEFFICIENTS SE OF COEFF T-RATIO
0 Intercept 10.96664
3 PT interac 2.841670E-04 2.408237E-05 11.80
RESSUMSQ STDDEV OF RES DF R-SQ
132.23090 3.63636 10 0.9330
STEP NO. 3 Determinant = 0.5263
VAR. 2 GOING IN T-CRIT 0.05 & 0.01 = 2.27 & 3.28
VARIABLES COEFFICIENTS SE OF COEFF T-RATIO
0 Intercept 4.86666
2 Temperatur 0.19317 0.04086 4.73
3 PT interac 2.231670E-04 1.874939E-05 11.90
RESSUMSQ STDDEV OF RES DF R-SQ
37.96618 2.05389 9 0.9808 <---
One might be tempted to stop here and omit the X2
(Pressure) term.
Determinant of the correlation
matrix is very low
STEP NO. 4 Determinant = 0.0526 <--- NOTE THIS
Determinant of Correlation Matrix is Low
VAR. 1 GOING IN T-CRIT 0.05 & 0.01 = 2.31 & 3.39
VARIABLES COEFFICIENTS SE OF COEFF T-RATIO
0 Intercept -5.53331
1 Pressure 0.00693 0.00313 2.21
2 Temperatur 0.34917 0.07831 4.46
3 PT interac 1.191671E-04 4.952478E-05 2.41
RESSUMSQ STDDEV OF RES DF R-SQ
23.54530 1.71556 8 0.9881
A quick interpretation of this by a person who is using
regression without understanding the role of centering
vs. not centering might be "we can leave pressure out of
the model." Temperature and the PT interaction is
sufficient.
Compare the model coefficients for X1 & X2 to those
with the centered variables. Also note that due to variance
inflation the t-ratios for X1 and X12 are marginal at the
p = 0.05 level. Compare these t-ratios to the earlier model.
In some instances the signs of the model coefficients for the primary
variables may be reversed from reality. A loose
interpretation of that by a casual "analyst" may give
the wrong impression about the role of those variables.
I'm seen such false claims in reports, presentations,
and published literature many times.
The coefficient for the interaction is the same with or
without centering. So is it's t-ratio, and all of the
summary statistics such as R-sq.
All of this is predictable with a little bit of algebra.
As a rule I say "if you are entertaining any kind of
second-order effects (interactions or quadratics)
then center the variables. This is not a negotiable."
OMU
OBS PRED RESIDS STD RES
1 22.50 20.13 2.37 1.38
2 18.70 20.13 -1.43 -0.84
3 19.20 20.13 -0.93 -0.54
4 30.20 31.83 -1.63 -0.95
5 33.90 31.83 2.07 1.20
6 31.40 31.83 -0.43 -0.25
7 37.20 38.87 -1.67 -0.97
8 39.30 38.87 0.43 0.25
9 40.10 38.87 1.23 0.72
10 54.90 55.33 -0.43 -0.25
11 56.80 55.33 1.47 0.85
12 54.30 55.33 -1.03 -0.60
RES SUMSQ FROM REGRESSION = 23.5453
RES SUMSQ DIRECT = 23.54666
7 Negative Residuals 5 Positive Residuals
BTW, I've seen some horrible examples presented in
software "guides" and Help files for commercial
software. By this I mean examples in which the signs
on one or more model coefficients are "backward".
The one I've always cherished is one in which a
loose interpretation of the model implies that
the values of homes decreases with increasing square
footage. This "failure to center" issue is pervasive.
For this reason I don't trust any published model
that contains an interaction with the variables not
centered.
My first insight into this (including the need for
the correlation matrix) came from Dr. Harry Smith
when he worked for P&G in the early 1960s. I will
not attempt an analysis of multifactor data (by this
I mean analysis for significance of factors and building
models) without easy access to the correlation matrix.
Now, before the rocks start flying, yes I know that if
we code the variables as -1, +1 then tho variables X1
and X2 are automatically centered. But then we end up
with models having coded variables on the right side.
In order to use those models in a practical way we'll
have to "decode" the variables... an unpleasant and
error-prone mess if there are several variables.
And yes, I know that models with the variables centered
are messy... even downright ugly. When I am ready to
teach students about centering I tell them beforehand
that we are about to get into the most frustrating
thing I have to teach... but teach it I must. There will
usually be someone in the room who is working on an MBA
and she/he will take my message to a prof at a local
university that evening and the prof will tell the student
"nonsense!! I've never seen such a thing." As sure as
little kittens have tails I'll hear about this the next
morning and will have to teach centering all over again.
<sigh> OMU
.
- Follow-Ups:
- Re: interaction term in linear regression with a dummy coded predictor
- From: Ray Koopman
- Re: interaction term in linear regression with a dummy coded predictor
- From: Ken Butler
- Re: interaction term in linear regression with a dummy coded predictor
- From: Paul Rubin
- Re: interaction term in linear regression with a dummy coded predictor
- From: Old Mac User
- Re: interaction term in linear regression with a dummy coded predictor
- References:
- interaction term in linear regression with a dummy coded predictor
- From: kj
- Re: interaction term in linear regression with a dummy coded predictor
- From: RichUlrich
- Re: interaction term in linear regression with a dummy coded predictor
- From: Ray Koopman
- Re: interaction term in linear regression with a dummy coded predictor
- From: Old Mac User
- Re: interaction term in linear regression with a dummy coded predictor
- From: Ray Koopman
- interaction term in linear regression with a dummy coded predictor
- Prev by Date: Re: repeated measures+bonferroni question
- Next by Date: Re: interaction term in linear regression with a dummy coded predictor
- Previous by thread: Re: interaction term in linear regression with a dummy coded predictor
- Next by thread: Re: interaction term in linear regression with a dummy coded predictor
- Index(es):
Relevant Pages
|