Survivor and hazard functions (Survival Series 1)

Introduction
Survival analysis is a set of statistical procedures for studying the time to an event, such as a marriage, infection, or death. In this post I describe two key concepts in survival analysis called the survivor and hazard functions and show how they can be derived from the familiar probability density and cumulative density functions.
Let
Consider the procedure of flipping a coin
For
set.seed(100300)
n <- 10; p <- 0.5; t <- c(0:n)
dens_t <- choose(n, t) * (p^t) * (1-p)^(n-t)
We can now calculate the probability of observing, for example, exactly 4 tails in
10 flips. (It is
tails: t | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
prob: f(t) | 0.001 | 0.01 | 0.044 | 0.117 | 0.205 | 0.246 | 0.205 | 0.117 | 0.044 | 0.01 | 0.001 |
The cdf represents the probability that the random variable
We can easily calculate the cdf using the cumsum
function.
cdf_t <- cumsum(dens_t)
tails: t | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
prob: F(t) | 0.001 | 0.011 | 0.055 | 0.172 | 0.377 | 0.623 | 0.828 | 0.945 | 0.989 | 0.999 | 1 |
plot(cdf_t, xlab = "Tails", ylab = "F(t)", type="s", bty = "l",
main = "Cumulative density function", cex.main = 0.9)

What is the probability of getting either 0 or 1 or 2 or 3 tails in 10 flips? It
is cumsum
process we used to obtain the cdf).
dens_tx <- diff(c(0, cdf_t))
all(dens_tx == dens_t)
[1] TRUE
Survival data
Thus far, I have used a coin flipping example to demonstrate the relationship between the cdf and the survivor function. In epidemiological studies, we are typically interested in the times to an event, which do not follow a binomial distribution. To measure the event times, we enroll participants into a study who are at risk of experiencing the event and collect their data over time. Data collection stops when the failure is recorded or if the participant is lost to follow-up. We say that the participant is under observation from the time of enrollment until the last follow-up time. The event occurs only once after which the participant is no longer observed.
Survivor function
In survival analysis, we refer to the survivor function
The survivor function gives the probability of surviving beyond
with the properties:
In words,
If the event times are known exactly, the survivor function can be estimated from the data using:
where
For this reason, the survivor function is also called the cumulative survival
rate. We can get
so that
In words,
Hazard function
The hazard function is the known as the conditional failure rate. It is the
rapidity with which new failures occur during the observation period. Specifically,
the hazard function gives the instantaneous potential per unit time for the
failure to occur, given that the person has survived up until time
In words,
This definition shows that the hazard is a function of the density function
Cumulative hazard function
Another important measure related to the survivor function is the cumulative
hazard function,
Since we have defined
where
Relationships of functions
Given one of the functions that we have seen above, we can determine one from the other three in the following way:
Example
Consider the data adapted from Lee at
al. for a cohort of
HIV-negative participants, who were followed for 55 months. The first column of Table
shows the time interval of 5 months, the second (n
) shows the number of
HIV-negative participants at the beginning of the interval, the third hiv
shows the
number of participants that tested HIV-positive in that interval. Below, I show the
R
code for calculating the survivor function (st
), the density function (ft
),
and the hazard function (ht
).
# Taken from Table 2.1 of Elisa Lee.
n <- c(40, 35, 28, 22, 18, 13, 9, 5, 5, 3, 2)
hiv <- c(5, 7, 6, 4, 5, 4, 4, 0, 2, 1, 2)
ldat <- data.frame(n, hiv)
time <- cut(c(0:55), breaks = seq(0, 55, 5),
right = FALSE, include.lowest = TRUE)
# Calculate s_t
ldat$st <- ldat$n / ldat$n[1]
# Calculate f_t
ldat$ft <- ldat$hiv / (ldat$n[1] * 5)
# Calculate h_t
ldat$ht <- round(ldat$ft / ldat$st, 3)
n hiv st ft ht
0- 40 5 1.000 0.025 0.025
5- 35 7 0.875 0.035 0.040
10- 28 6 0.700 0.030 0.043
15- 22 4 0.550 0.020 0.036
20- 18 5 0.450 0.025 0.056
25- 13 4 0.325 0.020 0.062
30- 9 4 0.225 0.020 0.089
35- 5 0 0.125 0.000 0.000
40- 5 2 0.125 0.010 0.080
45- 3 1 0.075 0.005 0.067
50- 2 2 0.050 0.010 0.200
I note that the hazard rate estimates (