Survival methods for right censored data (Survival Series 2)
Introduction
In Post 1 of this series, I introduced the survivor and hazard functions for survival data. One motivation for survival analysis is the topic of censoring. Censoring is defined as when the event occurs but the participant is no longer under observation. There are generally three reasons for right-censored data: 1) the participant did not experience the event by the end of the study period, 2) the participant prematurely withdrew from the study for some known reason, 3) the participant was lost to follow-up for unknown reasons before the study ended. In the case of censoring, some information is provided about the individual’s survival time, but we do not know the survival time exactly. The study of survival data requires the careful treatment of these censored events.
Let $T$ be a nonnegative random variable that denotes the the event of interest and let $\delta$ denote the censoring event. It is convention to treat right censoring as a random event that is unrelated to the event of interest, so that
$$ \delta = 1 \text{ if event } $$ $$ \delta = 0 \text{ if censored}. $$
Let $T_c$ denote the time of censoring and let $C = \min(T, T_c)$. The data then takes the form $(C, \delta)$.
Kaplan-Meier survivor function
For right-censored data, we can the use the Kaplan-Meier (KM) estimator to compute $S(t)$, which is given by:
\begin{equation} \hat{S}(t) = \prod_{j|t_j\leq t} \left(\frac{n_j - d_j}{n_j}\right) \end{equation}
where $n_j$ is the number of participants at risk at time $t_j$ and $d_j$ is the number of events at $t_j$. The product is over all the observed event times less than or equal to $t$.
Imagine we are interested in the time to HIV infection. The expression for $S(t)$ says that at time $t$, the probability of staying uninfected (if uninfected before $t$) is $1 - d_j/n_j$, conditional on the probability of being uninfected in all the previous time intervals (hence the product function $\prod$).
To demonstrate how $\hat{S}(t)$ is computed for right censored data, consider the
following data from Kleinbaum and
Klein. I enter the data into R
.
# Data from Klein & Kleinbaum Chap 2, pg 51
# censored times group 1 only
time1 <- c(6, 6, 6, 7, 10, 13, 16, 22, 23)
# uncensored times
time0 <- c(6, 9, 10, 11, 17, 19, 20, 25, 32, 32, 34, 35)
time <- c(time1, time0)
cens <- c(rep(1, length(time1)), rep(0, length(time0)))
Next, I use the survival
package to set up the data and obtain the relevant estimates. For now, I show the data.frame of the number at risk, the number of events, and the number censored at time $t$.
set_surv <- survival::Surv(time, cens)
res <- survival::survfit(set_surv ~ 1)
sdat <- with(res, data.frame(time, n.risk, n.event, n.censor))
head(sdat)
time n.risk n.event n.censor
1 6 21 3 1
2 7 17 1 0
3 9 16 0 1
4 10 15 1 1
5 11 13 0 1
6 13 12 1 0
The first 3 events occurs at time $t = $ 6, where there are 21 participants at risk. Therefore, the probability of survival is $\hat{S(6)} = (21 - 3) / 21$ $= 0.857$. At time $t = $ 7, there is 1 event and 17 participants at risk (because one participant was right censored at $t = 6$). I use R
to calculate the inner function, $(n_j - d_j) / n_j$, and then take the product of these probabilities.
# calculate surv probabilities st
st <- (sdat$n.risk - sdat$n.event) / sdat$n.risk
st
[1] 0.8571429 0.9411765 1.0000000 0.9333333 1.0000000 0.9166667 0.9090909
[8] 1.0000000 1.0000000 1.0000000 0.8571429 0.8333333 1.0000000 1.0000000
[15] 1.0000000 1.0000000
# multiply probabilities at t
St <- cumprod(st)
St
[1] 0.8571429 0.8067227 0.8067227 0.7529412 0.7529412 0.6901961 0.6274510
[8] 0.6274510 0.6274510 0.6274510 0.5378151 0.4481793 0.4481793 0.4481793
[15] 0.4481793 0.4481793
Fortunately, the surv
library calculates the survival probabilities for you, which can
be obtained using the summary
function.
sres <- summary(res)
sres
Call: survfit(formula = set_surv ~ 1)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
6 21 3 0.857 0.0764 0.720 1.000
7 17 1 0.807 0.0869 0.653 0.996
10 15 1 0.753 0.0963 0.586 0.968
13 12 1 0.690 0.1068 0.510 0.935
16 11 1 0.627 0.1141 0.439 0.896
22 7 1 0.538 0.1282 0.337 0.858
23 6 1 0.448 0.1346 0.249 0.807
Recall that $f_j(t) = d_j/n_j$ is the number of uninfected participants that get infected at time $t$. For example, at the first time-point, $t = $ 6, $\hat{f}(6) = 3/21 = $ 0.143. At $t = 7$, we have one infection among the 17 participants, but we must account for the fact that they remained uninfected to this time. This survival probability, as we have already calculated, is $\hat{S}(6) = $ 0.857. Then, $\hat{f}(7) = \hat{S}(6) \times 1/17 =$ 0.05. Thus, we obtain the probability density function, $f(t)$, from the KM estimator as follows:
\begin{equation} f_j(t) = \frac{d_j}{n_j} \times \prod_{j|t_j < t} \left(\frac{n_j - d_j}{n_j}\right). \end{equation}
This expression says that the probability of being infected at time $t$ is conditional on the probability of being uninfected in the previous time intervals. To demonstate, I multiply the survival estimates from the previous intervals by the probability of getting the infection in the current time interval.
# get the previous interval estimate of survival
Sji <- c(1, St[-length(St)])
ft <- (sdat$n.event / sdat$n.risk) * Sji
ft
[1] 0.14285714 0.05042017 0.00000000 0.05378151 0.00000000 0.06274510
[7] 0.06274510 0.00000000 0.00000000 0.00000000 0.08963585 0.08963585
[13] 0.00000000 0.00000000 0.00000000 0.00000000
We can double check this result with a second method for computing $f(t)$ described in this post.
Ft <- 1 - c(1, St)
ft2 <- diff(Ft)
ft2
[1] 0.14285714 0.05042017 0.00000000 0.05378151 0.00000000 0.06274510
[7] 0.06274510 0.00000000 0.00000000 0.00000000 0.08963585 0.08963585
[13] 0.00000000 0.00000000 0.00000000 0.00000000
We can also calculate the cumulative hazard function for right-censored data using the Nelson-Aalen estimator, given as:
\begin{equation}\label{eqname:Ht_na} \hat{H}(t) = \sum_{j|t_j \leq t} \frac{d_j}{n_j} \end{equation}
where $n_j$ is the number at risk and $d_j$ is the number of events at
time $t_j$. The sum is over all distinct event times less than or equal to $t$.
This can be easily extracted from the survival
results.
res$cumhaz
[1] 0.1428571 0.2016807 0.2016807 0.2683473 0.2683473 0.3516807 0.4425898
[8] 0.4425898 0.4425898 0.4425898 0.5854469 0.7521136 0.7521136 0.7521136
[15] 0.7521136 0.7521136
Random censoring
Random censoring, also known as Type III censoring, occurs when a study period is fixed, but participants enter the study at different times. The definition of right censoring under random censoring is the same. Consider the data used from the previous example where $t_0$ is the start time of the study. We make it that some of participants enter into the cohort after the study start.
# everyone starts at t0
t0 <- rep(0, length(cens))
# except these participants
t0[c(7, 8)] <- 9
t0[c(16, 19)] <- 13
t0[18] <- 15
Kaplan-Meier
In computing the survival function, R
must now account for these delayed entries. We can easily do this using the Surv
function by specifying the time
(the time the participant started) and time2
(the time the participant was censored) arguments.
sobj <- survival::Surv(time = t0, time2 = time, event = cens)
res1 <- survival::survfit(sobj ~ 1)
sdat1 <- with(res1, data.frame(time.t = time, n.risk,
n.enter = as.integer(n.enter), n.censor, n.event,
surv))
sdat1
time.t n.risk n.enter n.censor n.event surv
1 6 16 16 1 3 0.8125000
2 7 12 0 0 1 0.7447917
3 9 11 0 1 0 0.7447917
4 10 12 2 1 1 0.6827257
5 11 10 0 1 0 0.6827257
6 13 9 0 0 1 0.6068673
7 16 11 3 0 1 0.5516975
8 17 10 0 1 0 0.5516975
9 19 9 0 1 0 0.5516975
10 20 8 0 1 0 0.5516975
11 22 7 0 0 1 0.4728836
12 23 6 0 0 1 0.3940697
13 25 5 0 1 0 0.3940697
14 32 4 0 2 0 0.3940697
15 34 2 0 1 0 0.3940697
16 35 1 0 1 0 0.3940697
We now have a new n.enter
column in the output. At $t = 0$, there are 16 participants that enter
the study. At the start of $t = 9$, there are 11 participants at risk and 1 is lost
(censored). Therefore at $t = 10$, there are 12 participants at risk since 2
participants enter the study at this time. However, 1 is lost and 1 is infected,
which leaves 10 participants at risk at $t = 11$. We can confirm the survival results
by showing that the KM estimates account for the delayed entry of the 5 participants.
# calculate inner function probabilities
st1 <- (sdat1$n.risk - sdat1$n.event) / sdat1$n.risk
# multiply probabilities to get KM estimates
St1 <- cumprod(st1)
St1
[1] 0.8125000 0.7447917 0.7447917 0.6827257 0.6827257 0.6068673 0.5516975
[8] 0.5516975 0.5516975 0.5516975 0.4728836 0.3940697 0.3940697 0.3940697
[15] 0.3940697 0.3940697
Probability density function
We can also confirm that two methods for obtaining the probability density function $f(t)$ are consistent for random censoring.
# first method
Sji <- c(1, St1[-length(St1)])
ft3 <- (sdat1$n.event / sdat1$n.risk) * Sji
ft3
[1] 0.18750000 0.06770833 0.00000000 0.06206597 0.00000000 0.07585841
[7] 0.05516975 0.00000000 0.00000000 0.00000000 0.07881393 0.07881393
[13] 0.00000000 0.00000000 0.00000000 0.00000000
# second method
Ft1 <- 1 - c(1, St1)
ft4 <- diff(Ft1)
ft4
[1] 0.18750000 0.06770833 0.00000000 0.06206597 0.00000000 0.07585841
[7] 0.05516975 0.00000000 0.00000000 0.00000000 0.07881393 0.07881393
[13] 0.00000000 0.00000000 0.00000000 0.00000000