Re: Regression Inference and Data Splitting
- From: naught@xxxxxxx
- Date: Wed, 6 Dec 2006 16:21:11 +0000 (UTC)
Bruce Weaver <bweaver@xxxxxxxxxxxx> wrote:
NPDave@xxxxxxxxx wrote:
I've just read something in a Statistics text and the authors have
conveniently left out an explanation. The statement is : " Regression
Inference should never be done after the model has been chosen from the
same data".
In this case, the original data set was split into half, with one half
being used for obtaining the model and the other for inferencing.
Any help would be appreciated. Thank you.
Try searching on "cross-validation".
And while you're at it have a look at the following. Although the focus
is logistic regression, it has relevance to all regression-type models.
Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJ, Vergouwe Y,
Habbema JD. Internal validation of predictive models: efficiency of some
procedures for logistic regression analysis.J Clin Epidemiol. 2001
Aug;54(8):774-81.
Abstract: The performance of a predictive model is overestimated when
simply determined on the sample of subjects that was used to construct the
model. Several internal validation methods are available that aim to
provide a more accurate estimate of model performance in new subjects. We
evaluated several variants of split-sample, cross-validation and
bootstrapping methods with a logistic regression model that included eight
predictors for 30-day mortality after an acute myocardial infarction.
Random samples with a size between n = 572 and n = 9165 were drawn from a
large data set (GUSTO-I; n = 40,830; 2851 deaths) to reflect modeling in
data sets with between 5 and 80 events per variable. Independent
performance was determined on the remaining subjects. Performance measures
included discriminative ability, calibration and overall accuracy. We
found that split-sample analyses gave overly pessimistic estimates of
performance, with large variability. Cross-validation on 10% of the sample
had low bias and low variability, but was not suitable for all performance
measures. Internal validity could best be estimated with bootstrapping,
which provided stable estimates with low bias. We conclude that
split-sample validation is inefficient, and recommend bootstrapping for
estimation of internal validity of a predictive logistic regression model.
.
- References:
- Regression Inference and Data Splitting
- From: NPDave
- Re: Regression Inference and Data Splitting
- From: Bruce Weaver
- Regression Inference and Data Splitting
- Prev by Date: Re: help with non parametric analysis
- Next by Date: Consumer Price Index Calculations
- Previous by thread: Re: Regression Inference and Data Splitting
- Next by thread: PDF Estimation
- Index(es):
Relevant Pages
|
Loading