Re: distribution of an outlier?




dave@xxxxxxxxxxx wrote:
> Mat,
> >
> Furthermore note that as usual the errors from the regression must be
> Gaussian .

And the usual WRONG method of tagging possible "outliers" is to put
some number of asterisks behind those residuals that are two, three,
or more standard deviations from zero (SAS does that).

The idea behind such tagging is that if you generate ONE observation
form a Normal(0, sigma) population, then there is a rather small
probability that the random deviate is more than 3 sigma away from 0.

But in the analysis of residuals in a regression, the "outliers" are
always the LARGEST observed residuals, or the maximum order statistics!

Thus, if you do a regression analysis with 10,000 observations, say,
you will find pages full of "***" in SAS because there's nothing
unusual about MANY observed residuals more than 3 std dev. away from
zero. It would take a MUCH larger observed residual to be considered
a candidate for an "outler".
>
> For more http://www.autobox.com/outlier.html

I found MANY questionable points in the exposition given in that link.


It did identify the problem I stated above:

*> Outlier points ( points above or below 3 standard deviations )
*> are immediately identified and thus may be deleted from the next
*> stage of the analysis. The flaw in the above logic is obvious.

For different "obvious reasons". But the use of a fixed "3-sigma"
as the detection rule (based on alternative estimates of sigma)
remains throughout the link, with complete disregard of the sample
size n and the maximum-order-statistics.


Furthermore, the DETECTION of outliers is an entirely different
matter from the DELECTION of outliers.

Any DELETION of outliers is a CRIME unless you can fully justify its
deletion. One can always do a different analyses with or without
the rogue observation, or use some robust procedure(s) that are
robust to the presence of a small number of outliers (RARE observations
DO naturally occur, rarely of course).


*> Some would argue that the outliers can be identified via an
*> "influential observation approach" or "cook's distance approach".
*> Essentially this detection scheme focuses on the effect of its
*> deletion on the residual sum of squares. But this approach usually
*> fails because the outlier is an "unusual value" to its prediction
*> and that prediction requires a model.

This is a very poor characterization (so much so that it could be
considered "wrong" on the role of Cook's distance and the notion of
"influential observation" vs "outliers".


> Dave Reilly
> AUTOMATIC FORECASTING SYSTEMS
> http://www.autobox.com
> 215-675-0652

-- Bob.

.



Relevant Pages

  • Re: a method for detecting multivariate outliers..
    ... outlier detection,how would i come to know that clustering, regression ... Let me tell u all that i am not interested in doing clustering, ... Outliers can only be defined with respect to some model of ... detect outliers in case of multivariate data.. ...
    (sci.stat.consult)
  • Re: std. deviation
    ... particularly influential observations if they might be outliers. ... variables if there aren't many predictors. ... multiplicative rather than additive model; you don't say much about the ... Regression estimates aren't particularly sensitive to moderate outliers ...
    (sci.stat.consult)
  • Re: Advice needed on regression analysis
    ... transformation is immediately apparent. ... I have a data set to which I am trying to fit a linear regression and I ... I also have a fair number of outliers. ... I have been experimenting a bit with robust regression. ...
    (sci.stat.math)
  • Re: Confidence interval for all the estimated Ys?
    ... Frank E Harrell Jr wrote: ... residuals confidence interval [provided by the regression tool] didn't ... I tought it would be reasonable to remove such "outliers", ...
    (sci.stat.consult)