Kernel smoothed hazard rates (Survival Series 4)

Introduction

In an earlier post, I introduced a kernel smoothing method for univariate data. The aim of this post is to describe how to use survival and kernel smoothing methods to estimate the hazard rate function. I do so using an example of 5 exact HIV-infection times.

Kernel smoothing

As a brief review, let X1,,Xn be a set of continuous random variables having a common probability density function f. The kernel density estimator is:

f^(x;b)=(nb)1i=1nK([xXi]/b)

where K is called the kernel and satisfies K(x),dx=1 and b is the bandwidth (for b>0). This equation can be written more compactly by rescaling Kb(u)=b1K(u/b) such that:

f^(x;b)=n1i=1nKb(xXi).

One common choice for Kb is the Gaussian density, N(0,σ2),

ϕ(x)=12πσ2exp(x22σ2),

which is N(0,1) for the standard normal density. Consider obtaining the kernel smoothed estimates with (bandwidth b=0.6) for 5 HIV-infection times:

X=1.5,2.0,2.2,3.8,4.5.

The kernel hazard functions for Xi (the dashed color lines) and the overall density estimate (the solid line) are shown in the Figure below.

Smoothing the hazard rates using survival weights

Previously, Wand and Jones proposed a smoothing approach to the hazard rate with survival weights. This is:

h^(x;b)=n1i=1nKb(xXi)1F~(Xi),

where

F~(x)=i=1n1Xixn1

is a slightly modifed empirical distribution function that avoids division by zero. The idea is that a survival weight at Xi (the denominator) is applied to each participant’s kernel density (the numerator) so that the weight increases through the ordered Xi’s as the probability of survival decreases. The kernel weights at each X are therefore inversely proportional to the height of 1F~(Xi).

In the R code below I show the step by step calculations. First, I define a function (edf) to calculate the empirical distribution.

# func to calculate edf
edf <- function(x, Xi, adjust = TRUE) {
  n <- length(Xi)
  if (adjust) n <- n + 1
  sXi <- sort(Xi)
  sum(sXi <= x) / n
}

# calculate surv func s(t)
surv <- 1 - sapply(c(0, Xi), edf, Xi = Xi)
surv
[1] 1.0000000 0.8333333 0.6666667 0.5000000 0.3333333 0.1666667

Then, for each Xi, I multiply the kernel estimate of all x by the inverse of the survival weight at Xi, which gives the hazard rate estimates, h^(x).

# plot
set1 <- RColorBrewer::brewer.pal(7, "Set1")
out <- matrix(NA, nrow = length(Xi), ncol = length(x))
plot(c(0, Xi), surv, type = "s", ylab = "Rate", xlab = "Time (x)", 
  bty = "n", ylim = c(0, 1), xlim = c(0, 6), lty = 2)
# collect density at xi
for (i in Xi) {
  pw <- 1 - edf(i, Xi) # get weight at Xi 
  haz <- kern(i, x, b = 0.6) / (pw * length(Xi))
  out[which(Xi == i), ] <- haz
  lines(x, haz, lty = 2, lwd =2, col = set1[which(Xi == i)])
}
points(Xi, rep(0, length(Xi)), pch = 18, cex = 1.8, 
   col = set1[seq(length(Xi))])
# get overall hazard rate, sum because already divide by length(Xi)
hazard <- apply(out, 2, sum)
lines(x, hazard)

The figure shows the hazard rate estimates (solid line), which is a function of the survivor function (dashed step line). An important point to make is that the kernel densities of the ordered observations increase as the probability of survival decreases over time. This is because the survival estimate at each Xi is being applied to the kernel density estimate. For example, the adjusted survival estimate at S^(1.5)=0.83, and hence the weight is 1/0.83=1.2, which is used to multiply all the kernel density estimates at x for observation X1. At x=4.5, the survival estimate is S^(4.5)=0.167 and so the survival weight for X5 is 1/0.1676. This explains why the maximum of the kernel density estimate for X5 is higher than the maximum estimate of X1.

Smoothing the hazard rate using the jump sizes of the cumulative hazard function

An alternative approach is kernel smoothed estimates derived from the cumulative hazard function (H^[t]). In Part 1 and Part 2 of this series, I introduced the cumulative hazard function and showed how to estimate it from the data:

H(t)=0th(u)du.

Theoretically, the above equation tells us that the hazard rate, h(t), is the derivative of H(t). However, since the cumulative hazard estimate is a step function, it is not differentiable. One solution is to estimate the hazard rate by obtaining the step or jump sizes of the Nelson-Aalen cumulative hazard rate. Thus, for each observed HIV infection time, the estimated hazard contribution is:

ΔH^(Xi)=H^(Xi)H^(Xi1).

We can then smooth using:

h^(x)=b1i=1nKb(xXib)ΔH^(Xi),

where n represents the number of infection times and b is the bandwidth.

Using the same 5 HIV-infection times as before, I first calculate the hazard rates by dividing the number of infections by the number at risk at time Xi. We can also extract the hazard rates from the survival package.

Xi <- c(1.5, 2.0, 2.2, 3.8, 4.5)
set_surv <- survival::Surv(Xi)
dat <- survival::survfit(set_surv ~ 1)
sdat <- summary(dat)
sdat
Call: survfit(formula = set_surv ~ 1)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  1.5      5       1      0.8   0.179       0.5161            1
  2.0      4       1      0.6   0.219       0.2933            1
  2.2      3       1      0.4   0.219       0.1367            1
  3.8      2       1      0.2   0.179       0.0346            1
  4.5      1       1      0.0     NaN           NA           NA
ht <- sdat$n.event / sdat$n.risk
ht
[1] 0.2000000 0.2500000 0.3333333 0.5000000 1.0000000
# alternative
ht2 <- diff(c(0, sdat$cumhaz))
ht2
[1] 0.2000000 0.2500000 0.3333333 0.5000000 1.0000000

I now plot the data and show the individual density estimates for each Xi, which is multiplied by the hazard rate at Xi. The overall smoothed hazard rate function is shown in the solid line and the observed hazard rates and the non-smoothed hazard rate is shown in the dashed line.

x <- seq(0, 6, 0.1)
out <- matrix(NA, nrow = length(Xi), ncol = length(x))
# collect density at xi
plot(c(0, Xi, 6), c(0, ht, 1), type = "s", ylab = "Rate", xlab = "Time (x)", 
  bty = "n", ylim = c(0, 1.0), xlim = c(0, 6), lty = 2)
for (i in Xi) {
  haz <- kern(i, x, b = 0.5) * ht[which(Xi == i)] 
  out[which(Xi == i), ] <- haz
  lines(x, haz,
    lty = 2, lwd =2, col = set1[which(Xi == i)])
}
hazard <- apply(out, 2, sum) # note that kern already divided by b
lines(x, hazard)
points(Xi, rep(0, length(Xi)), pch = 18, cex = 1.8, 
   col = set1[seq(length(Xi))])
legend("topleft", c("Hazard rate", "Hazard rate smoothed"),
  lty = c(2, 1), col = "black", bty = "n")
Alain Vandormael
Alain Vandormael
Senior Data Scientist, PhD, MSc