Re: regression assumptions are violated, what next?



On Jul 10, 6:27 pm, Paul Rubin <ru...@xxxxxxx> wrote:

[skipped]

I played with your data a little. First, let me dispense with the
normality question. I generally advise my students not to interpret
non-Gaussian residuals as an indication of a defective model _unless_
there is some standing theory that says that their specific type of data
should contain Gaussian noise. The main general concern, to me, of
non-Gaussian residuals is that most of the tests you're likely to want
to apply to the model are parametric and assume Gaussian noise, so you
can't trust the results as much (if at all). In some cases, you can
find nonparametric alternatives. In other cases, you have to run in
circles at high speed, shouting "Central Limit Theorem" or something
like that.

Hmm, in my case I'm going to use the model to simulate the system
operation. So I'm *assuming* I need to know the residuals
distribution, since the delay introduced by the component (in
simulation) will be something like (for a linear model case):

d = intercept + slope * illen + err

where 'err' should come from residuals distribution. It doesn't have
to be normal, of course (though it would simplify things).

So my primary concern here is to get the regression coefficients
right, and if I understood you correctly, it may be possible even if
the residuals normality property is violated. Is that right? If yes,
how do I ensure that the coefficients are reasonable (apart from
looking at the plot and checking that the confidence intervals are
narrow enough for my situation)?


Regarding the residual plot, which is indeed a bit scary, I suggest you
scatter plot your data separately for three ranges, say illen <= 600,
600 < illen <= 1000, and illen > 1000. As best I can see, the overall


Just curious: when selecting subsets of data like this, am I expected
to provide some kind of justification on choosing the splitting
points? Or saying that "visually the system behaviour changes around
this and that point" is sufficient?


regression function is concave, not linear. In that third range, it's
plausibly linear. The second range looks like it's _maybe_ linear. The
first range looks concave, but it also has an outlier issue. In the
first range, you get occasional points where total is considerably
larger than you would expect from the overall plot, but none where it is
considerably smaller. I don't see that one-sidedness in the other two
ranges. It probably has something to do with your noise source having
one tail, or at least one tail a lot heavier than the other; but the
fact that it only seems to crop up at the low end is still a bit
bothersome. You can see it in your raw data plot (the one that makes
you think there's a linear relationship).


I suspect this heavy-tailedness might stem from the following. The
machine that the system runs on occasionally performs some background
processing which might introduce delays on its own. Thus the
processing time, of course, can only get greater, and never smaller.

Now, the next question is why do we see more outliers in the first
range? Looking at the data, there were 21470 requests shorter than 600
and about 12 outliers in that range; for request lengths 600-1000: 340
requests and 1 outlier; for 1000+: 120 requests and 0 outliers. So
assuming that those extra load spikes on the processing machine are
exponentially distributed, it might be that we observe more outliers
for shorter request lengths just because there are so much more of
them.

Does this sound reasonable and if yes, could I use these grounds to
just reject the outliers data from the analysis?


You could try fitting a quadratic or cubic model (cubic seems to work
better than linear -- I tried log-log and that didn't work well at all),
or you might ask yourself whether there's justification to consider
separate models for different ranges of illen.


I'm perfectly fine to use piecewise model. My ultimate goal is to
create a model of this component to study how it would perform under
different load patterns.

I will take a look at cubic models; may be you have a reference to a
good resource at hand? Statistics wasn't my major, so I'm working my
way through on my own. Book references will work too. :-)

Ah, and last and not least -- thanks a lot for looking into this! :-)
.



Relevant Pages

  • Re: Normality for nonlinear regression?
    ... I had to meet the assumptions of normality on that data? ... meeting the assumption for normality while running linear regression ... In linear models, the sample residuals are ...
    (sci.stat.edu)
  • Re: regression assumptions are violated, what next?
    ... Here is how the data looks like ("illen" is the request length, ... The processing does depend on the request length, so the linear ... residuals QQ-plot: ... Regarding the residual plot, which is indeed a bit scary, I suggest you scatter plot your data separately for three ranges, say illen <= 600, ...
    (sci.stat.math)
  • The FINAL, FINAL KO and Check-Mate of m00es in Applied Simple Regression.
    ... The RESIDUALS are Normally distributed. ... fit the simple regression. ... Not the normality in the DATA for Y. As I said, ... Point 1 of m00es debunked. ...
    (sci.stat.math)
  • Re: Origin of Chinese spoken languages
    ... >> the error term (residuals) shows more complex pattern that I ... >You said, in your earlier post, that one needed to run a "simple linear ... You need a "simple linear regression", but that is just a starting ... Regarding distribution of residuals, ...
    (sci.lang)
  • Re: regression assumptions are violated, what next?
    ... different lengths and we record the request length and time it took to ... residuals QQ-plot: ... Regarding the residual plot, which is indeed a bit scary, I suggest you ... regression function is concave, not linear. ...
    (sci.stat.math)

Quantcast