Re: interaction terms in regression model



Richard...

Thanks for your post. I agree that when the form of the model is "set"
or comes close to being that way, then attention is focused on the
estimated values of the parameters and how those values may be changing
over time.
In my work, as cited, it was much more about finding a suitable model
for each of many specific circumstances and dealing with those as more
data became available. (Model verification, documentation, patents,
etc.) A lot of this was driven by running designed experiments, albeit
often not the kind we see in textbooks. Also, in some instances,
fitting data to first-order differential (rate) equations... which is
another matter altogether. In doing this we taught hundreds (probably
several thousand) of people how to do this sort of work for themselves
because there was no way we could get involved in every project all
over the corporation (at one time with 115,000 people in multiple
locations.) This meant he had to lay down some rules of engagement,
else many would get creative and do foolish things.

Just fyi, my son works in a form of economics (investment management --
modeling -- a "quant") and as you suggested, the form of the models
there are generally "set" but model cofficients must be updated and
properly interpreted. Besides, I'm not even sure they use interactive
forms as I know and use interactions.

Thanks again. Take care... OMU





richardstartz@xxxxxxxxxxx wrote:
On 14 Jan 2007 12:28:32 -0800, "Old Mac User"
<chendrixstats@xxxxxxxxx> wrote:

OMU:
I don't disagree with anything you say. I think the different points
of view illustrates that economists (me) have a slightly different
approach to regression than that in some other disciplines. We're
usually estimating specific model parameters, or at least partial
derivatives. If one is doing this (at least if one's paying
attention), then centering is irrelevant. I won't argue with you that
in many disciplines centering is very relevant.
-*** Startz
Richard...

As suggested in another post, I suggest you go to

http://groups.google.com/group/sci.stat.consult/browse_frm/thread/6b1...


for further discussion on this. The bottom line is simple. While the
algebra (i.e., p-redictions) will be the same regardless of which
centering constants you use (I'm assuming a two-variable model here...
two linear and one interaction between those) the "significance" of the
linear factors (values of their regression coefficients and the
t-ratios associated with those coefficients) are controlled by the
choice of the values of the centering constants.
If we fail to center, that is exactly the same using "zero" as the
centering constant.

The fastest and most convincing way to see this is to set up a small
example with two independent variables and one dependent variable. Do
this with a simple two-level factorial arrangement, and create
synthetic data such that there is an interaction. Fit these data to
the obvious model. Change the centering constants and watch what
happens to coefficients for the linear terms and to their t-ratios (or
F-ratios if that's the way you are traveling.) The "failure to center"
and the failure to understand and appreciate this frequently leads to
nonsense results. With the widespread use of statistical software...
notably multiple regression... among amateurs the frequency of this
misunderstanding has grown exponentially. True, if you code the
variables as -1 and +1 (in a factorial arrangement) then this problem
is not present. But if the independent variables are in their original
metrics or if the data spawns from something other than a "perfect"
factorial design then there a potential for major misunderstandings
about the effects of the variables (effects here means the linear
effects.)

Use centering constants conveniently near the averages of each of the
variables.

Y = bo + b1(X1 - c1) + b2(X2 - c2) + b12 (X1 - c1)(X2 - c2) which means
you must center before cross-multiplying to get the interaction term.

OK, so most people know this. Right? Well, I hate to break the bad
news. When I was in charge of an applied statistics group at a major
corporation (inside an R/D & Engineering Dept.) over a period of about
37 years I hired 12 PhDs and 4 MS degreed statisticians. Of these only
one really knew about the need for centering... and that one had worked
in industry for more than 15 years when I hired him. I had to
patiently teach all the others.
Why? I'm not sure, but I suspect it's this. Designed Experiments, as
taught in universities, are typically analyzed in terms of "Analysis of
Variance" (which is actually comparisons of averages) but not as a way
to get data suitable for building models. In my operation it was "all
about modeling"... "ANOVA" was rarely appropriate because one of our
major products was models for designing equipment and for controlling
processes. These models had to be expressed in terms of the original
metrics... no exceptions. So... the newcomers to this found themselves
in a somewhat different game than the "ANOVA" routine at their
universities.

To carry this one step further, consider this. If you are using
software that has "model selection" capability and if you fail to
properly center the variables, then with experience you learn that
you'll end up with a multiplicity of "interactions" where such
interactions are not actually active. I've seen this happen
again-and-again and explaining it to a person who has done it is like
trying to nail Jello to a wall. I could write a book full of examples
of silly models I've seen published... silly because of a failure to
properly center the variables.

It's messy. It's no fun. But I'll stick to my prior claim... there is
no negotiation here. It's gotta be done. OMU





richardstartz@xxxxxxxxxxx wrote:
On 11 Jan 2007 08:19:13 -0800, "Old Mac User"
<chendrixstats@xxxxxxxxx> wrote:

Values of the t-ratios (or F-ratios, whichever) associated with the
linear terms... hence the "significance" of those terms... are
functionally related to the constants used to center those variables.
If you did not properly center the variables (for the linear terms)
then significance is mostly likely to be depressed. The failure to
center is the most common problem I see in published examples of the
analysis of multivariable data. You did center those variables...
didn't you?

snip

OMU:

Could you elaborate a bit on this part of the answer? I'm not sure
what you mean by "center." If you regress
y = a + bx

and then regress

(y-k1) = a + b(x-k2)

You get identical estimates of b and identical estimates of the
t-statistic on b, for arbitrary constants k1 and k2.

Perhaps I misunderstand your suggestion.
-*** Startz

.


Quantcast