Survival methods for right censored data (Survival Series 2)

Last updated on Nov 11, 2022 8 min read

Introduction

In Post 1 of this series, I introduced the survivor and hazard functions for survival data. One motivation for survival analysis is the topic of censoring. Censoring is defined as when the event occurs but the participant is no longer under observation. There are generally three reasons for right-censored data: 1) the participant did not experience the event by the end of the study period, 2) the participant prematurely withdrew from the study for some known reason, 3) the participant was lost to follow-up for unknown reasons before the study ended. In the case of censoring, some information is provided about the individual’s survival time, but we do not know the survival time exactly. The study of survival data requires the careful treatment of these censored events.

Let $T$ be a nonnegative random variable that denotes the the event of interest and let $δ$ denote the censoring event. It is convention to treat right censoring as a random event that is unrelated to the event of interest, so that

$δ = 1 if event$ $δ = 0 if censored .$

Let $T_{c}$ denote the time of censoring and let $C = min (T, T_{c})$ . The data then takes the form $(C, δ)$ .

Kaplan-Meier survivor function

For right-censored data, we can the use the Kaplan-Meier (KM) estimator to compute $S (t)$ , which is given by:

$\hat{S} (t) = \prod_{j | t_{j} \leq t} (\frac{n_{j} - d_{j}}{n_{j}})$

where $n_{j}$ is the number of participants at risk at time $t_{j}$ and $d_{j}$ is the number of events at $t_{j}$ . The product is over all the observed event times less than or equal to $t$ .

Imagine we are interested in the time to HIV infection. The expression for $S (t)$ says that at time $t$ , the probability of staying uninfected (if uninfected before $t$ ) is $1 - d_{j} / n_{j}$ , conditional on the probability of being uninfected in all the previous time intervals (hence the product function $\prod$ ).

To demonstrate how $\hat{S} (t)$ is computed for right censored data, consider the following data from Kleinbaum and Klein. I enter the data into R.

# Data from Klein & Kleinbaum Chap 2, pg 51
# censored times group 1 only
time1 <- c(6, 6, 6, 7, 10, 13, 16, 22, 23) 
# uncensored times
time0 <- c(6, 9, 10, 11, 17, 19, 20, 25, 32, 32, 34, 35)
time <- c(time1, time0)
cens <- c(rep(1, length(time1)), rep(0, length(time0)))

Next, I use the survival package to set up the data and obtain the relevant estimates. For now, I show the data.frame of the number at risk, the number of events, and the number censored at time $t$ .

set_surv <- survival::Surv(time, cens)
res <- survival::survfit(set_surv ~ 1)
sdat <- with(res, data.frame(time, n.risk, n.event, n.censor))
head(sdat)
  time n.risk n.event n.censor
1    6     21       3        1
2    7     17       1        0
3    9     16       0        1
4   10     15       1        1
5   11     13       0        1
6   13     12       1        0

The first 3 events occurs at time $t =$ 6, where there are 21 participants at risk. Therefore, the probability of survival is $\hat{S (6)} = (21 - 3) / 21$ $= 0.857$ . At time $t =$ 7, there is 1 event and 17 participants at risk (because one participant was right censored at $t = 6$ ). I use R to calculate the inner function, $(n_{j} - d_{j}) / n_{j}$ , and then take the product of these probabilities.

# calculate surv probabilities st
st <- (sdat$n.risk - sdat$n.event) / sdat$n.risk 
st
 [1] 0.8571429 0.9411765 1.0000000 0.9333333 1.0000000 0.9166667 0.9090909
 [8] 1.0000000 1.0000000 1.0000000 0.8571429 0.8333333 1.0000000 1.0000000
[15] 1.0000000 1.0000000
# multiply probabilities at t
St <- cumprod(st)
St
 [1] 0.8571429 0.8067227 0.8067227 0.7529412 0.7529412 0.6901961 0.6274510
 [8] 0.6274510 0.6274510 0.6274510 0.5378151 0.4481793 0.4481793 0.4481793
[15] 0.4481793 0.4481793

Fortunately, the surv library calculates the survival probabilities for you, which can be obtained using the summary function.

sres <- summary(res)
sres
Call: survfit(formula = set_surv ~ 1)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    6     21       3    0.857  0.0764        0.720        1.000
    7     17       1    0.807  0.0869        0.653        0.996
   10     15       1    0.753  0.0963        0.586        0.968
   13     12       1    0.690  0.1068        0.510        0.935
   16     11       1    0.627  0.1141        0.439        0.896
   22      7       1    0.538  0.1282        0.337        0.858
   23      6       1    0.448  0.1346        0.249        0.807

Recall that $f_{j} (t) = d_{j} / n_{j}$ is the number of uninfected participants that get infected at time $t$ . For example, at the first time-point, $t =$ 6, $\hat{f} (6) = 3 / 21 =$ 0.143. At $t = 7$ , we have one infection among the 17 participants, but we must account for the fact that they remained uninfected to this time. This survival probability, as we have already calculated, is $\hat{S} (6) =$ 0.857. Then, $\hat{f} (7) = \hat{S} (6) \times 1 / 17 =$ 0.05. Thus, we obtain the probability density function, $f (t)$ , from the KM estimator as follows:

$f_{j} (t) = \frac{d_{j}}{n_{j}} \times \prod_{j | t_{j} < t} (\frac{n_{j} - d_{j}}{n_{j}}) .$

This expression says that the probability of being infected at time $t$ is conditional on the probability of being uninfected in the previous time intervals. To demonstate, I multiply the survival estimates from the previous intervals by the probability of getting the infection in the current time interval.

# get the previous interval estimate of survival
Sji <- c(1, St[-length(St)])
ft <- (sdat$n.event / sdat$n.risk) * Sji
ft
 [1] 0.14285714 0.05042017 0.00000000 0.05378151 0.00000000 0.06274510
 [7] 0.06274510 0.00000000 0.00000000 0.00000000 0.08963585 0.08963585
[13] 0.00000000 0.00000000 0.00000000 0.00000000

We can double check this result with a second method for computing $f (t)$ described in this post.

Ft <- 1 - c(1, St)
ft2 <- diff(Ft)
ft2
 [1] 0.14285714 0.05042017 0.00000000 0.05378151 0.00000000 0.06274510
 [7] 0.06274510 0.00000000 0.00000000 0.00000000 0.08963585 0.08963585
[13] 0.00000000 0.00000000 0.00000000 0.00000000

We can also calculate the cumulative hazard function for right-censored data using the Nelson-Aalen estimator, given as:

$\hat{H} (t) = \sum_{j | t_{j} \leq t} \frac{d_{j}}{n_{j}}$

where $n_{j}$ is the number at risk and $d_{j}$ is the number of events at time $t_{j}$ . The sum is over all distinct event times less than or equal to $t$ . This can be easily extracted from the survival results.

res$cumhaz
 [1] 0.1428571 0.2016807 0.2016807 0.2683473 0.2683473 0.3516807 0.4425898
 [8] 0.4425898 0.4425898 0.4425898 0.5854469 0.7521136 0.7521136 0.7521136
[15] 0.7521136 0.7521136

Random censoring

Random censoring, also known as Type III censoring, occurs when a study period is fixed, but participants enter the study at different times. The definition of right censoring under random censoring is the same. Consider the data used from the previous example where $t_{0}$ is the start time of the study. We make it that some of participants enter into the cohort after the study start.

# everyone starts at t0
t0 <- rep(0, length(cens))
# except these participants
t0[c(7, 8)] <- 9
t0[c(16, 19)] <- 13
t0[18] <- 15

Kaplan-Meier

In computing the survival function, R must now account for these delayed entries. We can easily do this using the Surv function by specifying the time (the time the participant started) and time2 (the time the participant was censored) arguments.

sobj <- survival::Surv(time = t0, time2 = time, event = cens)
res1 <- survival::survfit(sobj ~ 1)
sdat1 <- with(res1, data.frame(time.t = time, n.risk,
    n.enter = as.integer(n.enter), n.censor, n.event, 
    surv))
sdat1
   time.t n.risk n.enter n.censor n.event      surv
1       6     16      16        1       3 0.8125000
2       7     12       0        0       1 0.7447917
3       9     11       0        1       0 0.7447917
4      10     12       2        1       1 0.6827257
5      11     10       0        1       0 0.6827257
6      13      9       0        0       1 0.6068673
7      16     11       3        0       1 0.5516975
8      17     10       0        1       0 0.5516975
9      19      9       0        1       0 0.5516975
10     20      8       0        1       0 0.5516975
11     22      7       0        0       1 0.4728836
12     23      6       0        0       1 0.3940697
13     25      5       0        1       0 0.3940697
14     32      4       0        2       0 0.3940697
15     34      2       0        1       0 0.3940697
16     35      1       0        1       0 0.3940697

We now have a new n.enter column in the output. At $t = 0$ , there are 16 participants that enter the study. At the start of $t = 9$ , there are 11 participants at risk and 1 is lost (censored). Therefore at $t = 10$ , there are 12 participants at risk since 2 participants enter the study at this time. However, 1 is lost and 1 is infected, which leaves 10 participants at risk at $t = 11$ . We can confirm the survival results by showing that the KM estimates account for the delayed entry of the 5 participants.

# calculate inner function probabilities
st1 <- (sdat1$n.risk - sdat1$n.event) / sdat1$n.risk 
# multiply probabilities to get KM estimates
St1 <- cumprod(st1)
St1
 [1] 0.8125000 0.7447917 0.7447917 0.6827257 0.6827257 0.6068673 0.5516975
 [8] 0.5516975 0.5516975 0.5516975 0.4728836 0.3940697 0.3940697 0.3940697
[15] 0.3940697 0.3940697

Probability density function

We can also confirm that two methods for obtaining the probability density function $f (t)$ are consistent for random censoring.

# first method
Sji <- c(1, St1[-length(St1)])
ft3 <- (sdat1$n.event / sdat1$n.risk) * Sji
ft3
 [1] 0.18750000 0.06770833 0.00000000 0.06206597 0.00000000 0.07585841
 [7] 0.05516975 0.00000000 0.00000000 0.00000000 0.07881393 0.07881393
[13] 0.00000000 0.00000000 0.00000000 0.00000000
# second method
Ft1 <- 1 - c(1, St1)
ft4 <- diff(Ft1)
ft4
 [1] 0.18750000 0.06770833 0.00000000 0.06206597 0.00000000 0.07585841
 [7] 0.05516975 0.00000000 0.00000000 0.00000000 0.07881393 0.07881393
[13] 0.00000000 0.00000000 0.00000000 0.00000000

R Statistics Data analytics