Re: Selection of explanatory variables in a model
- From: naught@xxxxxxx
- Date: Tue, 28 Nov 2006 14:52:14 +0000 (UTC)
Cat <job.alerte@xxxxxxxxx> wrote:
Dear colisters,
I have to build a logistic regression model and I've got a couple of
questions:
- I heard / read (on the Internet) that the number of variables to put
as explanatory characters must not be greater than n/10 or n/ 20 (n is
the sample size). Does someone can provide me with a serious
bibliographic reference about this ?
1: J Clin Epidemiol. 1996 Dec;49(12):1373-9.
A simulation study of the number of events per variable in logistic
regression analysis.
* Peduzzi P,
* Concato J,
* Kemper E,
* Holford TR,
* Feinstein AR.
Abstract. We performed a Monte Carlo study to evaluate the effect of the
number of events per variable (EPV) analyzed in logistic regression
analysis. The simulations were based on data from a cardiac trial of 673
patients in which 252 deaths occurred and seven variables were cogent
predictors of mortality; the number of events per predictive variable was
(252/7 =) 36 for the full sample. For the simulations, at values of EPV =
2, 5, 10, 15, 20, and 25, we randomly generated 500 samples of the 673
patients, chosen with replacement, according to a logistic model derived
from the full sample. Simulation results for the regression coefficients
for each variable in each group of 500 samples were compared for bias,
precision, and significance testing against the results of the model
fitted to the original sample. For EPV values of 10 or greater, no major
problems occurred. For EPV values less than 10, however, the regression
coefficients were biased in both positive and negative directions; the
large sample variance estimates from the logistic model both overestimated
and underestimated the sample variance of the regression coefficients; the
90% confidence limits about the estimated values did not have proper
coverage; the Wald statistic was conservative under the null hypothesis;
and paradoxical associations (significance in the wrong direction) were
increased. Although other factors (such as the total number of events, or
sample size) may influence the validity of the logistic model, our
findings indicate that low EPV can lead to major problems.
- Second, I've got a first set of about 15 variables (for around 150
patients), interactions excluded. I categorized some into binary
characters to yield odds-ratios. What is the risk ? lack of power ?
More than just lack of power. Among a number of problems, you discard the
opportunity to model nonlinearity, and also potentially generate a number
of artifactual non-zero slopes (see, for example, Maxwell, S. E., &
Delaney, H. D. (1993). Bivariate median splits and spurious statistical
significance. Psychological Bulletin, 113, 181-190.) There's no reason
to create groups to get odds ratios. You can simply rescale the
continuous predictor to some meaningful interval.
- Third, performing a second selection via the stepwise algorithm, is
there a consensus about the significance cut-off (alpha). to use ? I
read 20% instead of 5%. Is this usual ?
If you must use stepwise, yes, it's better to use a higher inclusion and
use backward selection, but if you want the model to replicate out of
sample, you'll probably be disappointed.
- Regarding SAS programming (I run the version 8.2 under Windows OS),
which procedure, between Proc Logistic and Proc GLM, should I choose,
and according to which criteria ?
PROC GLM is generally not appropriate for binary outcomes.
.
Thanks a lot.
Catherine.
- References:
- Prev by Date: Re: Selection of explanatory variables in a model
- Next by Date: exponential family
- Previous by thread: Re: Selection of explanatory variables in a model
- Next by thread: مقطع بلووتوث سكس للبنت صغار 16 سنة ... !!
- Index(es):
Relevant Pages
|
Loading