Kaplan-Meier analysis
We are often interested in analyzing the time to the occurrence of a disease or event, where time is measured in hours, days, weeks, months, or years. The time to disease can be informative for a number of reasons: it can be used to ascertain the severity of a disease, to determine what the prognosis of a disease may be, or to compare the effect of new treatments against existing treatments. In these cases (and others), we need appropriate methods for quantifying the natural history of disease.
Given some data, we want to know the probability that an individual will have a disease event after a certain amount of time. This probability is not easy to calculate, because individuals may be observed for different periods of time, or some will get the disease sooner than others. The Kaplan-Meier method can deal with these situations. In this approach, the exact point in time of the disease event is identified, so that each disease event terminates the previous interval and a new interval is started. The number of person who died at this point is used as the denominator and the number of people who survived up until that point is used as the denominator. Below, we use Stata to demonstrate the Kaplan-Meier method.
Let’s look at some data that is already in Stata. You can read such data into Stata using the input
command (type help input
).
. list time, sep(20) +------+ | time | |------| 1. | 1 | 2. | 1 | 3. | 2 | 4. | 3 | 5. | 4 | 6. | 4 | 7. | 5 | 8. | 5 | 9. | 6 | 10. | 6 | 11. | 7 | 12. | 8 | 13. | 8 | 14. | 8 | 15. | 10 | +------+ . codebook time, compact Variable Obs Unique Mean Min Max Label -------------------------------------------------- time 15 9 5.2 1 10 --------------------------------------------------
This data tells us that two individuals died at the end of week 1, one individual died at the end of week 2, and so on. Using the codebook
command, we see there are 15 disease events across 9 distinct time periods. We would like to count the number of patients that got the disease and the number that did not get the disease for each time interval. Stata provides the sts list
command to do this. Before we can use this command we must first stset
the data. For now, we are just telling Stata that the time at which the disease events occurred (otherwise it is just a set of meaningless numbers to Stata).
. stset time failure event: (assumed to fail at time=time) obs. time interval: (0, time] exit on or before: failure ------------------------------------------------------------------------------ 15 total observations 0 exclusions ------------------------------------------------------------------------------ 15 observations remaining, representing 15 failures in single-record/single-failure data 78 total analysis time at risk and under observation at risk from t = 0 earliest observed entry t = 0 last observed exit t = 10 . tempfile Data . qui save "`Data'"
After running the sts list
command, we see the number of individuals at the start of each time interval (column 1) and the number of disease events (which Stata calls failures) recorded at the end of each time interval (column 2). No individuals were lost to follow-up (column 3).
The output tells us that there were 15 individuals at the start of the first week and 2 got the disease by the end of the first week. At the fourth time interval, 11 individuals entered and 2 got the disease event at the end of the interval, and so on. We will come back to the survivor function (column 4) later.
. tempfile Data1 . sts list, noshow saving("`Data1'") Beg. Net Survivor Std. Time Total Fail Lost Function Error [95% Conf. Int.] ------------------------------------------------------------------------------- 1 15 2 0 0.8667 0.0878 0.5639 0.9649 2 13 1 0 0.8000 0.1033 0.4998 0.9307 3 12 1 0 0.7333 0.1142 0.4362 0.8905 4 11 2 0 0.6000 0.1265 0.3176 0.7965 5 9 2 0 0.4667 0.1288 0.2123 0.6875 6 7 2 0 0.3333 0.1217 0.1215 0.5640 7 5 1 0 0.2667 0.1142 0.0826 0.4963 8 4 3 0 0.0667 0.0644 0.0043 0.2603 10 1 1 0 0.0000 . . . -------------------------------------------------------------------------------
We now calculate the proportion of individuals that got the disease event and those that did not in each time interval. To do this, we save the sts list
output to a temporary data file, which we then read into Stata, as shown below.
. use "`Data1'", clear . gen prop_fail = fail/begin . gen prop_surv = 1 - prop_fail . list time begin fail prop_fail prop_surv, abb(14) sep(10) +---------------------------------------------+ | time begin fail prop_fail prop_surv | |---------------------------------------------| 1. | 1 15 2 .1333333 .8666667 | 2. | 2 13 1 .0769231 .9230769 | 3. | 3 12 1 .0833333 .9166667 | 4. | 4 11 2 .1818182 .8181818 | 5. | 5 9 2 .2222222 .7777778 | 6. | 6 7 2 .2857143 .7142857 | 7. | 7 5 1 .2 .8 | 8. | 8 4 3 .75 .25 | 9. | 10 1 1 1 0 | +---------------------------------------------+
The proportion of individuals that got the disease in the first week is 0.133. Having survived the first week, the proportion of individuals that got the disease in the second week is 0.077. The prop_surv column gives the proption of individuals that did not get the disease, which is $1 - $ prop_fail. This proportion is 0.867 in the first week, which can also be expressed as a probability, i.e., the probability of surviving up until the end of the first week is 0.867. We now ask: having survived the first week, what is the probability of surviving up until the second week? The output shows that this probability is 0.923.
Importantly, we are interested in knowing the probability of survival up until the end of each time-point $t$. For example, what is the probability of surviving up until the second week? In the Kaplan-Meier approach, we treat each time interval as an indepedent event, which means that we can multiply the probability of survival in time $t_1$ by the probability of survival in time $t_2$. Thus, the probability of surviving up until the second week is $S(t=2)$ = 0.867 $\times$ 0.923 = 0.800. This is because there is some probability of surviving the first week and another probability of surviving the second week. Both of these probabilities have to be accounted for, similarly for survival up until later weeks. The survivor function $S(t)$ gives the probability of survival at each time-point $t$, which Stata computed for us in the sts list
command.
. list time begin fail prop_fail prop_surv survivor , abb(14) sep(10) +---------------------------------------------------------+ | time begin fail prop_fail prop_surv survivor | |---------------------------------------------------------| 1. | 1 15 2 .1333333 .8666667 .86666667 | 2. | 2 13 1 .0769231 .9230769 .8 | 3. | 3 12 1 .0833333 .9166667 .73333333 | 4. | 4 11 2 .1818182 .8181818 .6 | 5. | 5 9 2 .2222222 .7777778 .46666667 | 6. | 6 7 2 .2857143 .7142857 .33333333 | 7. | 7 5 1 .2 .8 .26666667 | 8. | 8 4 3 .75 .25 .06666667 | 9. | 10 1 1 1 0 0 | +---------------------------------------------------------+
We can visualize the survivor function using the sts graph
in Stata. (The numbers in the graph indicate the number of individuals at risk in each time interval.)
. use "`Data'", clear . sts graph, noshow atrisk
Stata can easily handle cases when some participants are lost to follow-up. The data below shows the id, time of disease, and a fail variable indicating whether the individual got the disease (fail = 1) or was lost-to-follow-up (fail =0) by the end of the time interval.
. list, sep(10) +------------------+ | id time fail | |------------------| 1. | 1 2 1 | 2. | 2 4 1 | 3. | 3 4 1 | 4. | 4 5 0 | 5. | 5 7 1 | 6. | 6 8 0 | +------------------+
We use the stset
command again, this time with the failure()
argument, and then run the sts list
command to obtain the survival probabilities.
. stset time, failure(fail) failure event: fail != 0 & fail < . obs. time interval: (0, time] exit on or before: failure ------------------------------------------------------------------------------ 6 total observations 0 exclusions ------------------------------------------------------------------------------ 6 observations remaining, representing 4 failures in single-record/single-failure data 30 total analysis time at risk and under observation at risk from t = 0 earliest observed entry t = 0 last observed exit t = 8 . sts list failure _d: fail analysis time _t: time Beg. Net Survivor Std. Time Total Fail Lost Function Error [95% Conf. Int.] ------------------------------------------------------------------------------- 2 6 1 0 0.8333 0.1521 0.2731 0.9747 4 5 2 0 0.5000 0.2041 0.1109 0.8037 5 3 0 1 0.5000 0.2041 0.1109 0.8037 7 2 1 0 0.2500 0.2041 0.0123 0.6459 8 1 0 1 0.2500 0.2041 0.0123 0.6459 ------------------------------------------------------------------------------- . sts graph, atrisk failure _d: fail analysis time _t: time