A methodology problem in Cox regressions
- From: "Alfa Beta" <alfa@xxxxxxx>
- Date: Tue, 11 Apr 2006 19:03:33 GMT
Question: Is it ok or is it wrong in a Cox regression to add observations
that you have incomplete followup data on, i.e. observations where data is
missing some time before the terminal event?
I haven't been able to find anyone who could give a straight answer back on
my home turf, so I turn to this list now.
I can illustrate the problem. You study the survival of X over 5 years with
regards to how various variables influence hazard. Let's say X is a rare
bird or something. You can't get a large enough sample in one year alone to
follow for 5 years, so you have to add newly hatched observations/birds each
spring, and so you follow these new birds for 5 years too. Under the
proportional hazards assumption it would be ok to clump all the observations
together in the end and make one big Cox regression on it all - let us
assume we can make that assumption for the sake of argument.
However, here comes the thing. In year 5 we decide to include a final
generation of birds and we will have enough data. We just have to get 5
years of followup data on those birds too and we're ready to do the
analysis. But after year 7 it turns out we have to abandon the project.
Funds running out or something. Let's see where we stand:
generation 1 - 7 years of followup data
gen 2 - 6 yrs
gen 3 - 5 yrs
gen 4 - 4 yrs
gen 5 - 3 yrs
Assume that there are some specific scientific reasons for setting the
terminal event at year 5 in the followup timeline. We can then safely use
generations 1-3 together for the regression, since they all have at least 5
years of followup data. But my question is, is it wrong to also include
generations 4-5. And if so, why exactly? And how serious a "statistical
crime" would it be to use incomplete data like this if it isn't right?
One thought I have had is that maybe it is wrong because it could let you
get significant regressions where you otherwise wouldn't be able to. The
improved likelihood that you have found the "true" regression is misleading
since this improvement is an improved accuracy only at the beginning of the
timeline, whereas the end of the survival curve is still as uncertain as
before. Is this correct?
Then again, I have noticed that while SPSS, which is what I use, does not
protest when you use incomplete data in this way, it might shorten the
survival graphs somewhat. The tables look like before at first glance but
the graphs come out looking like as if you had set the terminal event
earlier. So I also feel unsure of how SPSS actually deals with this issue.
Any input would be much appreciated.
/andreas
.
- Follow-Ups:
- Re: A methodology problem in Cox regressions
- From: David Winsemius
- Re: A methodology problem in Cox regressions
- Prev by Date: Re: Why kullback-leibler distance?
- Next by Date: Population SD when sample size > 20?
- Previous by thread: Reading a input in form of matrix
- Next by thread: Re: A methodology problem in Cox regressions
- Index(es):
Loading