Survivor and hazard functions (Survival Series 1)
Introduction
Survival analysis is a set of statistical procedures for studying the time to an event, such as a marriage, infection, or death. In this post I describe two key concepts in survival analysis called the survivor and hazard functions and show how they can be derived from the familiar probability density and cumulative density functions.
Let $T$ denote the time (in days, weeks, years, etc) to an event, which is a random variable. Because $T$ is a random variable, it will have a probability density function (pdf): $f(t)=Pr(T=t)$ and a cumulative distribution function (cdf): $F(t)=Pr(T\leq t)$.
Consider the procedure of flipping a coin $N$ times and counting the number of times the coin lands tails-up, denoted by $T$. Because it is a random variable, there will be some probability of observing exactly 0 or 1 or 2 or 3 or $\dots N$ tails. The probabilities can be obtained by modelling $T$ as a binomial distribution with parameters $n \in N$ and $p \in [0, 1]$. Then, the probability of getting exactly $t$ tails in $n$ coin flips is given by the pdf:
\begin{equation} f(t) = Pr(T=t) = \binom{n}{t} p^t (1-p)^{n-t} \text{ for } t=0,\dots,n \end{equation}
For $n=10$ coin flips and a fair coin $p=0.5$, the pdf of $T$ is given in the table below.
set.seed(100300)
n <- 10; p <- 0.5; t <- c(0:n)
dens_t <- choose(n, t) * (p^t) * (1-p)^(n-t)
We can now calculate the probability of observing, for example, exactly 4 tails in 10 flips. (It is $f(4) = Pr(T=4)=$ 0.205).
tails: t | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
prob: f(t) | 0.001 | 0.01 | 0.044 | 0.117 | 0.205 | 0.246 | 0.205 | 0.117 | 0.044 | 0.01 | 0.001 |
The cdf represents the probability that the random variable $T$ takes on a value less than or equal to $t$. For a discrete random variable, the cdf can be obtained as the sum of the probabilities calculated from the pdf $f(t)$. The cdf is given as:
\begin{equation} F_T(t) = P(T \leq t) = \sum_{t_i \leq t} p(t_i) \text{ for } i=1,\dots,n \end{equation}
We can easily calculate the cdf using the cumsum
function.
cdf_t <- cumsum(dens_t)
tails: t | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
prob: F(t) | 0.001 | 0.011 | 0.055 | 0.172 | 0.377 | 0.623 | 0.828 | 0.945 | 0.989 | 0.999 | 1 |
plot(cdf_t, xlab = "Tails", ylab = "F(t)", type="s", bty = "l",
main = "Cumulative density function", cex.main = 0.9)
What is the probability of getting either 0 or 1 or 2 or 3 tails in 10 flips? It
is $F(3) = Pr(T\leq 3)=$ 0.172. We can recover the pdf
from the cdf for each outcome by taking the difference at each step of the
cdf (which is the reverse of the cumsum
process we used to obtain the cdf).
dens_tx <- diff(c(0, cdf_t))
all(dens_tx == dens_t)
[1] TRUE
Survival data
Thus far, I have used a coin flipping example to demonstrate the relationship between the cdf and the survivor function. In epidemiological studies, we are typically interested in the times to an event, which do not follow a binomial distribution. To measure the event times, we enroll participants into a study who are at risk of experiencing the event and collect their data over time. Data collection stops when the failure is recorded or if the participant is lost to follow-up. We say that the participant is under observation from the time of enrollment until the last follow-up time. The event occurs only once after which the participant is no longer observed.
Survivor function
In survival analysis, we refer to the survivor function $S(t)$, rather than $F(t)$. The survivor function is simply the reverse of the cdf of $T$:
$$ S(t)= 1 - F(t) = Pr(T>t). $$
The survivor function gives the probability of surviving beyond $t$. It is related to $f(t)$ in the following way:
\begin{equation} S(t) = Pr(T>t) = \int_{t}^{\infty} f(t) dt \end{equation}
with the properties:
$$ S(t) = {1 \text{ for } t = 0 \text{ and } S(t) = 0 \text{ for } t = \infty}. $$
In words,
$$ S(t) = P(\text{a person survives longer than } t). $$
If the event times are known exactly, the survivor function can be estimated from the data using:
\begin{equation}\label{eqname:st} \hat{S}(t) = \frac{\text{number of persons surviving longer than } t}{\text{ number of persons at } t_0}. \end{equation}
where $t_0$ represent the start time of the study. The survivor function is related to the density function, $f(t)$, in the following way:
$$ S(t) = Pr(T>t) = \int_{t}^{\infty} f(t) dt. $$
For this reason, the survivor function is also called the cumulative survival rate. We can get $f(t)$ from the survivor function as follows:
\begin{equation}\label{eqname:ft} f(t) = \frac{dF(t)}{dt} = \frac{d}{dt} {1-S(t)} = - \frac{dS(t)}{dt} \end{equation}
so that $f(t)$ may be thought of as an approximate probability that the event will occur at time $t$. We say that this result gives the instantaneous rate of surviving at time $t$. Therefore, the survivor function has a pdf defined as the limit of the probability that a person fails in the short interval $t$ to $t + \Delta t$ per unit width $\Delta t$:
\begin{equation} f(t) = \lim_{\Delta{t} \to 0} \frac{P[\text{a person fails in } (t, \Delta t)]}{\Delta t} \end{equation}
In words, $f(t)$ can be estimated from the data using:
\begin{equation*} \hat{f}(t) = \frac{\text{number of persons failing in interval } t}{(\text{number of persons at } t_0) \times (\text{interval width})}. \end{equation*}
Hazard function
The hazard function is the known as the conditional failure rate. It is the rapidity with which new failures occur during the observation period. Specifically, the hazard function gives the instantaneous potential per unit time for the failure to occur, given that the person has survived up until time $t$.
$$ h(t) = \lim_{\Delta{t} \to 0} \frac{P(t + \Delta{t} > T > t | T > t)}{\Delta{t}} = \frac{f(t)}{S(t)}. $$
In words,
\begin{equation} h(t) = \lim_{\Delta{t} \to 0} \frac{P[\text{a person fails in } (t, \Delta t) \text{ given survival up to } t]}{\Delta t}. \end{equation}
This definition shows that the hazard is a function of the density function $f(t)$ and the survival function $S(t)$, and it is a rate since it is in units $1/t$.
Cumulative hazard function
Another important measure related to the survivor function is the cumulative hazard function, $H(t)$, which measures the total amount of risk that has accumulated over the study period. We can therefore describe $H(t)$ as the integral from 0 to $t$ of the hazard rate $h(t)$:
\begin{equation}\label{eqname:Ht} H(t) = \int_{0}^{t} h(u)du. \end{equation}
Since we have defined $h(t)$, we can now write:
\begin{equation} H(t) = \int_0^t \frac{f(u)}{S(u)} = - \int_0^t \frac{1}{S(u)} \left[ \frac{d}{du} S(u) \right] du = -\ln \left[(S(t)\right], \end{equation}
where $\frac{f’(x)}{f(x)} = \ln \left[f(x) \right]$.
Relationships of functions
Given one of the functions that we have seen above, we can determine one from the other three in the following way:
\begin{equation}\label{eqname:Sfunc} S(t)= \exp[-H(t)] \end{equation} \begin{equation}\label{eqname:Ffunc} F(t) = 1 - \exp[H(t)] \end{equation} \begin{equation} f(t) = h(t) \exp[-H(t)] \end{equation} \begin{equation}\label{eqname:Hfunc} H(t) = -\ln[S(t)]. \end{equation}
Example
Consider the data adapted from Lee at
al. for a cohort of
HIV-negative participants, who were followed for 55 months. The first column of Table
shows the time interval of 5 months, the second (n
) shows the number of
HIV-negative participants at the beginning of the interval, the third hiv
shows the
number of participants that tested HIV-positive in that interval. Below, I show the
R
code for calculating the survivor function (st
), the density function (ft
),
and the hazard function (ht
).
# Taken from Table 2.1 of Elisa Lee.
n <- c(40, 35, 28, 22, 18, 13, 9, 5, 5, 3, 2)
hiv <- c(5, 7, 6, 4, 5, 4, 4, 0, 2, 1, 2)
ldat <- data.frame(n, hiv)
time <- cut(c(0:55), breaks = seq(0, 55, 5),
right = FALSE, include.lowest = TRUE)
# Calculate s_t
ldat$st <- ldat$n / ldat$n[1]
# Calculate f_t
ldat$ft <- ldat$hiv / (ldat$n[1] * 5)
# Calculate h_t
ldat$ht <- round(ldat$ft / ldat$st, 3)
n hiv st ft ht
0- 40 5 1.000 0.025 0.025
5- 35 7 0.875 0.035 0.040
10- 28 6 0.700 0.030 0.043
15- 22 4 0.550 0.020 0.036
20- 18 5 0.450 0.025 0.056
25- 13 4 0.325 0.020 0.062
30- 9 4 0.225 0.020 0.089
35- 5 0 0.125 0.000 0.000
40- 5 2 0.125 0.010 0.080
45- 3 1 0.075 0.005 0.067
50- 2 2 0.050 0.010 0.200
I note that the hazard rate estimates ($\hat{h}[t]$) in this Table are an approximation since the time intervals are fixed at 5 months. Therefore, we do not involve a theoretical limit as the time measure approaches zero. I describe in greater detail the person-time incidence rate, which measures the rate of new failures for fixed intervals $(t, t + \Delta t)$, in another post.