Re: Selection of explanatory variables in a model



Cat <job.alerte@xxxxxxxxx> wrote:
Dear colisters,

I have to build a logistic regression model and I've got a couple of
questions:


- I heard / read (on the Internet) that the number of variables to put
as explanatory characters must not be greater than n/10 or n/ 20 (n is
the sample size). Does someone can provide me with a serious
bibliographic reference about this ?

1: J Clin Epidemiol. 1996 Dec;49(12):1373-9.
A simulation study of the number of events per variable in logistic
regression analysis.

* Peduzzi P,
* Concato J,
* Kemper E,
* Holford TR,
* Feinstein AR.

Abstract. We performed a Monte Carlo study to evaluate the effect of the
number of events per variable (EPV) analyzed in logistic regression
analysis. The simulations were based on data from a cardiac trial of 673
patients in which 252 deaths occurred and seven variables were cogent
predictors of mortality; the number of events per predictive variable was
(252/7 =) 36 for the full sample. For the simulations, at values of EPV =
2, 5, 10, 15, 20, and 25, we randomly generated 500 samples of the 673
patients, chosen with replacement, according to a logistic model derived
from the full sample. Simulation results for the regression coefficients
for each variable in each group of 500 samples were compared for bias,
precision, and significance testing against the results of the model
fitted to the original sample. For EPV values of 10 or greater, no major
problems occurred. For EPV values less than 10, however, the regression
coefficients were biased in both positive and negative directions; the
large sample variance estimates from the logistic model both overestimated
and underestimated the sample variance of the regression coefficients; the
90% confidence limits about the estimated values did not have proper
coverage; the Wald statistic was conservative under the null hypothesis;
and paradoxical associations (significance in the wrong direction) were
increased. Although other factors (such as the total number of events, or
sample size) may influence the validity of the logistic model, our
findings indicate that low EPV can lead to major problems.

- Second, I've got a first set of about 15 variables (for around 150
patients), interactions excluded. I categorized some into binary
characters to yield odds-ratios. What is the risk ? lack of power ?

More than just lack of power. Among a number of problems, you discard the
opportunity to model nonlinearity, and also potentially generate a number
of artifactual non-zero slopes (see, for example, Maxwell, S. E., &
Delaney, H. D. (1993). Bivariate median splits and spurious statistical
significance. Psychological Bulletin, 113, 181-190.) There's no reason
to create groups to get odds ratios. You can simply rescale the
continuous predictor to some meaningful interval.

- Third, performing a second selection via the stepwise algorithm, is
there a consensus about the significance cut-off (alpha). to use ? I
read 20% instead of 5%. Is this usual ?

If you must use stepwise, yes, it's better to use a higher inclusion and
use backward selection, but if you want the model to replicate out of
sample, you'll probably be disappointed.

- Regarding SAS programming (I run the version 8.2 under Windows OS),
which procedure, between Proc Logistic and Proc GLM, should I choose,
and according to which criteria ?

PROC GLM is generally not appropriate for binary outcomes.





Thanks a lot.


Catherine.

.



Relevant Pages

  • Re: R^2 and beta coefficients in multiple regression
    ... vector of standardized regression coefficients in ... regression coefficients. ... the Beta vector should equal R2. ... the sum of squares of the regression weights would be unchanged, ...
    (sci.stat.math)
  • Re: R^2 and beta coefficients in multiple regression
    ... vector of standardized regression coefficients in ... regression coefficients. ... the Beta vector should equal R2. ... standardize my principal components. ...
    (sci.stat.math)
  • Re: theta = (XX)^-1*Xy
    ... I should have mentioned Ridge Regression also, ... of getting the "wrong sign" for some of the regression coefficients, ... But there is a lot of "contaiminated and missing data" ...
    (sci.stat.math)
  • Re: Correlation coefficient not suited for small samples!? Cross validation as an alternative?
    ... Richard Ulrich wrote: ... since the expectation of ... it is evident that a regression line with one variable ... Regression coefficients have less dependence on variance, ...
    (sci.stat.consult)
  • Re: multiple regression and strange results
    ... If you cannot convince yourself of the error of expecting a ... produces the multiple regression coefficients through a series of simple ... Regress Y on X2 and store the residuals, ... The regression coefficients in a multiple regression model are ...
    (sci.stat.math)

Loading