Re: Regression Inference and Data Splitting



Bruce Weaver <bweaver@xxxxxxxxxxxx> wrote:
NPDave@xxxxxxxxx wrote:
I've just read something in a Statistics text and the authors have
conveniently left out an explanation. The statement is : " Regression
Inference should never be done after the model has been chosen from the

same data".

In this case, the original data set was split into half, with one half
being used for obtaining the model and the other for inferencing.

Any help would be appreciated. Thank you.



Try searching on "cross-validation".

And while you're at it have a look at the following. Although the focus
is logistic regression, it has relevance to all regression-type models.

Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJ, Vergouwe Y,
Habbema JD. Internal validation of predictive models: efficiency of some
procedures for logistic regression analysis.J Clin Epidemiol. 2001
Aug;54(8):774-81.

Abstract: The performance of a predictive model is overestimated when
simply determined on the sample of subjects that was used to construct the
model. Several internal validation methods are available that aim to
provide a more accurate estimate of model performance in new subjects. We
evaluated several variants of split-sample, cross-validation and
bootstrapping methods with a logistic regression model that included eight
predictors for 30-day mortality after an acute myocardial infarction.
Random samples with a size between n = 572 and n = 9165 were drawn from a
large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in
data sets with between 5 and 80 events per variable. Independent
performance was determined on the remaining subjects. Performance measures
included discriminative ability, calibration and overall accuracy. We
found that split-sample analyses gave overly pessimistic estimates of
performance, with large variability. Cross-validation on 10% of the sample
had low bias and low variability, but was not suitable for all performance
measures. Internal validity could best be estimated with bootstrapping,
which provided stable estimates with low bias. We conclude that
split-sample validation is inefficient, and recommend bootstrapping for
estimation of internal validity of a predictive logistic regression model.



.



Relevant Pages

  • Re: Single-Factor-Cox-Regression
    ... allowed within Cox-Regression) ... Like the logistic regression, which it is sort-of an extension ... Cox regression models hazards and hazard ratios. ... Logistic regression is to require at least 20 more cases in the ...
    (sci.stat.math)
  • Re: Approximate solution to linear regression
    ... Construct an ensemble of regression models, ... these are binary variables. ... So my idea to use a logistic regression to classify 15% of the ...
    (sci.stat.consult)
  • Re: Adjusting
    ... some about unequal sample sizes. ... >> One way is to just do weighted regression, ... > iterations of the logistic regression fitting process. ...
    (sci.stat.math)
  • Re: Logistic regression or Poisson regression (log linear)
    ... I've been looking at some analyses with similarly sparse data. ... prior to logistic regression will probably lose a lot of information. ... > treatment more often than with the experimental treatment. ...
    (sci.stat.consult)
  • Re: logistic regression
    ... > logistic regression? ... as log-- "Logistic regression" has easy ... a more subtle solution, ... diagnostics than what you can get on OLS regression. ...
    (sci.stat.math)

Loading