Kaplan-Meier analysis

Last updated on Nov 11, 2022 8 min read

We are often interested in analyzing the time to the occurrence of a disease or event, where time is measured in hours, days, weeks, months, or years. The time to disease can be informative for a number of reasons: it can be used to ascertain the severity of a disease, to determine what the prognosis of a disease may be, or to compare the effect of new treatments against existing treatments. In these cases (and others), we need appropriate methods for quantifying the natural history of disease.

Given some data, we want to know the probability that an individual will have a disease event after a certain amount of time. This probability is not easy to calculate, because individuals may be observed for different periods of time, or some will get the disease sooner than others. The Kaplan-Meier method can deal with these situations. In this approach, the exact point in time of the disease event is identified, so that each disease event terminates the previous interval and a new interval is started. The number of person who died at this point is used as the denominator and the number of people who survived up until that point is used as the denominator. Below, we use Stata to demonstrate the Kaplan-Meier method.

Let’s look at some data that is already in Stata. You can read such data into Stata using the input command (type help input).

. list time, sep(20)

     +------+
     | time |
     |------|
  1. |    1 |
  2. |    1 |
  3. |    2 |
  4. |    3 |
  5. |    4 |
  6. |    4 |
  7. |    5 |
  8. |    5 |
  9. |    6 |
 10. |    6 |
 11. |    7 |
 12. |    8 |
 13. |    8 |
 14. |    8 |
 15. |   10 |
     +------+

. codebook time, compact

Variable   Obs Unique  Mean  Min  Max  Label
--------------------------------------------------
time        15      9   5.2    1   10  
--------------------------------------------------

This data tells us that two individuals died at the end of week 1, one individual died at the end of week 2, and so on. Using the codebook command, we see there are 15 disease events across 9 distinct time periods. We would like to count the number of patients that got the disease and the number that did not get the disease for each time interval. Stata provides the sts list command to do this. Before we can use this command we must first stset the data. For now, we are just telling Stata that the time at which the disease events occurred (otherwise it is just a set of meaningless numbers to Stata).

. stset time

     failure event:  (assumed to fail at time=time)
obs. time interval:  (0, time]
 exit on or before:  failure

------------------------------------------------------------------------------
         15  total observations
          0  exclusions
------------------------------------------------------------------------------
         15  observations remaining, representing
         15  failures in single-record/single-failure data
         78  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =        10

. tempfile Data

. qui save "`Data'"

After running the sts list command, we see the number of individuals at the start of each time interval (column 1) and the number of disease events (which Stata calls failures) recorded at the end of each time interval (column 2). No individuals were lost to follow-up (column 3). The output tells us that there were 15 individuals at the start of the first week and 2 got the disease by the end of the first week. At the fourth time interval, 11 individuals entered and 2 got the disease event at the end of the interval, and so on. We will come back to the survivor function (column 4) later.

. tempfile Data1

. sts list, noshow saving("`Data1'")

           Beg.          Net            Survivor      Std.
  Time    Total   Fail   Lost           Function     Error     [95% Conf. Int.]
-------------------------------------------------------------------------------
     1       15      2      0             0.8667    0.0878     0.5639    0.9649
     2       13      1      0             0.8000    0.1033     0.4998    0.9307
     3       12      1      0             0.7333    0.1142     0.4362    0.8905
     4       11      2      0             0.6000    0.1265     0.3176    0.7965
     5        9      2      0             0.4667    0.1288     0.2123    0.6875
     6        7      2      0             0.3333    0.1217     0.1215    0.5640
     7        5      1      0             0.2667    0.1142     0.0826    0.4963
     8        4      3      0             0.0667    0.0644     0.0043    0.2603
    10        1      1      0             0.0000         .          .         .
-------------------------------------------------------------------------------

We now calculate the proportion of individuals that got the disease event and those that did not in each time interval. To do this, we save the sts list output to a temporary data file, which we then read into Stata, as shown below.

. use "`Data1'", clear

. gen prop_fail = fail/begin

. gen prop_surv = 1 - prop_fail

. list time begin fail prop_fail prop_surv, abb(14) sep(10)

     +---------------------------------------------+
     | time   begin   fail   prop_fail   prop_surv |
     |---------------------------------------------|
  1. |    1      15      2    .1333333    .8666667 |
  2. |    2      13      1    .0769231    .9230769 |
  3. |    3      12      1    .0833333    .9166667 |
  4. |    4      11      2    .1818182    .8181818 |
  5. |    5       9      2    .2222222    .7777778 |
  6. |    6       7      2    .2857143    .7142857 |
  7. |    7       5      1          .2          .8 |
  8. |    8       4      3         .75         .25 |
  9. |   10       1      1           1           0 |
     +---------------------------------------------+

The proportion of individuals that got the disease in the first week is 0.133. Having survived the first week, the proportion of individuals that got the disease in the second week is 0.077. The prop_surv column gives the proption of individuals that did not get the disease, which is $1 -$ prop_fail. This proportion is 0.867 in the first week, which can also be expressed as a probability, i.e., the probability of surviving up until the end of the first week is 0.867. We now ask: having survived the first week, what is the probability of surviving up until the second week? The output shows that this probability is 0.923.

Importantly, we are interested in knowing the probability of survival up until the end of each time-point $t$ . For example, what is the probability of surviving up until the second week? In the Kaplan-Meier approach, we treat each time interval as an indepedent event, which means that we can multiply the probability of survival in time $t_{1}$ by the probability of survival in time $t_{2}$ . Thus, the probability of surviving up until the second week is $S (t = 2)$ = 0.867 $\times$ 0.923 = 0.800. This is because there is some probability of surviving the first week and another probability of surviving the second week. Both of these probabilities have to be accounted for, similarly for survival up until later weeks. The survivor function $S (t)$ gives the probability of survival at each time-point $t$ , which Stata computed for us in the sts list command.

. list time begin fail prop_fail prop_surv survivor , abb(14) sep(10)

     +---------------------------------------------------------+
     | time   begin   fail   prop_fail   prop_surv    survivor |
     |---------------------------------------------------------|
  1. |    1      15      2    .1333333    .8666667   .86666667 |
  2. |    2      13      1    .0769231    .9230769          .8 |
  3. |    3      12      1    .0833333    .9166667   .73333333 |
  4. |    4      11      2    .1818182    .8181818          .6 |
  5. |    5       9      2    .2222222    .7777778   .46666667 |
  6. |    6       7      2    .2857143    .7142857   .33333333 |
  7. |    7       5      1          .2          .8   .26666667 |
  8. |    8       4      3         .75         .25   .06666667 |
  9. |   10       1      1           1           0           0 |
     +---------------------------------------------------------+

We can visualize the survivor function using the sts graph in Stata. (The numbers in the graph indicate the number of individuals at risk in each time interval.)

. use "`Data'", clear

. sts graph, noshow atrisk

Stata can easily handle cases when some participants are lost to follow-up. The data below shows the id, time of disease, and a fail variable indicating whether the individual got the disease (fail = 1) or was lost-to-follow-up (fail =0) by the end of the time interval.

. list, sep(10)

     +------------------+
     | id   time   fail |
     |------------------|
  1. |  1      2      1 |
  2. |  2      4      1 |
  3. |  3      4      1 |
  4. |  4      5      0 |
  5. |  5      7      1 |
  6. |  6      8      0 |
     +------------------+

We use the stset command again, this time with the failure() argument, and then run the sts list command to obtain the survival probabilities.

. stset time, failure(fail)

     failure event:  fail != 0 & fail < .
obs. time interval:  (0, time]
 exit on or before:  failure

------------------------------------------------------------------------------
          6  total observations
          0  exclusions
------------------------------------------------------------------------------
          6  observations remaining, representing
          4  failures in single-record/single-failure data
         30  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =         8

. sts list

         failure _d:  fail
   analysis time _t:  time

           Beg.          Net            Survivor      Std.
  Time    Total   Fail   Lost           Function     Error     [95% Conf. Int.]
-------------------------------------------------------------------------------
     2        6      1      0             0.8333    0.1521     0.2731    0.9747
     4        5      2      0             0.5000    0.2041     0.1109    0.8037
     5        3      0      1             0.5000    0.2041     0.1109    0.8037
     7        2      1      0             0.2500    0.2041     0.0123    0.6459
     8        1      0      1             0.2500    0.2041     0.0123    0.6459
-------------------------------------------------------------------------------

. sts graph, atrisk

         failure _d:  fail
   analysis time _t:  time

Kaplan-Meier analysis

Alain Vandormael

Senior Data Scientist, PhD, MSc