Kernel smoothed hazard rates (Survival Series 5)

Introduction
In an earlier post, I showed how to estimate the hazard rate for 5 HIV-infection times using a Gaussian kernel. In this post, I use an Epanechnikov kernel to estimate the hazard rate for a real and simulated dataset. R
code is provided to better illustrate the methods.
Leukimia transplant data
Consider the data given in Table 4.3 of Klein and Moeschberger, which gives the survival times of patients with acute leukemia who underwent a bone marrow transplant. The aim is to estimate the smoothed hazard rate. First, I enter the data for the survival times, which indicates the time ti
), the number of patients at risk (ni
) at this time, and the number that experienced the failure event (di
).
# event times
ti <- c(1, 55, 74, 86, 104, 107, 109, 110, 122, 129, 172, 192, 194,
230, 276, 332, 383, 418, 466, 487, 526, 609, 662)
# at risk
ni <- c( 38, 37, 36, 35, 34, 33, 32, 31, 30, 28, 27, 26, 25, 23,
22, 21, 20, 19, 18, 17, 16, 14, 13)
# events
di <- c(1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1 )
As demonstrated in a previous
post in this series, we can
calculate the cumulative hazard function (
# the Nelson-Aalen cumulative hazard rate
Ht <- cumsum(di / ni)
Ht
[1] 0.02631579 0.05334282 0.08112059 0.10969202 0.13910379 0.16940682
[7] 0.20065682 0.23291488 0.29958155 0.33529583 0.37233287 0.41079441
[13] 0.45079441 0.49427267 0.53972722 0.58734626 0.63734626 0.68997784
[19] 0.74553340 0.80435693 0.86685693 0.93828550 1.01520858
# approx hazard rate
ht <- diff(c(0, Ht))
ht
[1] 0.02631579 0.02702703 0.02777778 0.02857143 0.02941176 0.03030303
[7] 0.03125000 0.03225806 0.06666667 0.03571429 0.03703704 0.03846154
[13] 0.04000000 0.04347826 0.04545455 0.04761905 0.05000000 0.05263158
[19] 0.05555556 0.05882353 0.06250000 0.07142857 0.07692308
Let
where
# Epanechnikov kernel
epikern <- function(x) 0.75 * (1 - x^2)
epi_sym <- function(x, f = epikern)
return(ifelse(x < -1 | x > 1, 0, f(x)))
According to the theory, when
where
and
# asymmetric kernel
epi_asym <- function(x, params) {
if (x >= -1 & x <= params$q)
est <- 0.75 * (params$alpha + (params$beta * x)) * (1 - x^2)
else
est <- 0
return(est)
}
# calculate alpha and beta
epi_parms <- function(q) {
num_alpha <- 64 * (2 - (4 * q) + (6 * q^2) - (3 * q^3))
den_alpha <- (1 + q)^4 * (19 - (18 * q) + (3 * q^2))
alpha <- num_alpha / den_alpha
beta <- (240 * (1 - q)^2) / den_alpha
return(list(q = q, alpha = alpha, beta = beta))
}
When do_calc
which
handles observations that fall in the left (
do_calc <- function(t, ti, b) {
# do symetric kernel
val <- (t - ti) / b
if (t >= b & t <= (td - b))
est <- epi_sym(val)
# do asymetric left
else if (t < b) {
p <- epi_parms(t / b)
est <- epi_asym(val, p)
# do asymetric right
} else if (t > b) {
p <- epi_parms((td - t) / b)
est <- epi_asym(-val, p)
}
return(est)
}
kern_wt <- Vectorize(do_calc, vectorize.args = "ti")
I then multiply the kernel density weights by hazard rates at time
calc_haz <- function(t, ti, ht, b) kern_wt(t, ti, b) * ht
Using
b <- 100
td <- 662
# for h(150)
sum(calc_haz(150, ti = ti, ht = ht, b = 100)) / b
[1] 0.002569708
# for h(50)
sum(calc_haz(50, ti = ti, ht = ht, b = 100)) / b
[1] 0.001514734
# for h(600)
sum(calc_haz(600, ti = ti, ht = ht, b = 100)) / b
[1] 0.001323748
# plot for all vals 0, 700
xx <- seq(0, 700, 10)
dat <- sapply(xx, calc_haz, ti = ti, ht = ht, b = 100)
est <- colSums(dat) / b
The smoothed hazard rate for the Leukimia patients are shown below. We see that the hazard increases at day 50 and peaks at around day 150 before dropping off and remaining flat. After day 600, the hazard of death or relapse increases dramatically.

Simulated data
Consider the following method of simulating observations from a hazard function, motivated by this post. The aim is to simulate failure times from a known piecewise hazard function given by:
I use the following R
code to generate this function.
tdom <- seq(0, 5, by = 0.01)
haz <- rep(0, length(tdom))
haz[tdom <= 1] <- exp(-0.3 * tdom[tdom <= 1])
haz[tdom > 1 & tdom <= 2.5] <- exp(-0.3)
haz[tdom > 2.5] <- exp(0.3 * (tdom[tdom > 2.5] - 3.5))
plot(tdom, haz, type = "l", xlab = "time", ylab = "hazard rate",
bty = "l", ylim = c(0, 1.7))

From this data, we can calculate the cumulative hazard (tdom
).
cumhaz <- cumsum(haz * 0.01)
Surv <- exp(-cumhaz)
Given the survivor function, we can now generate the failure times using the inverse transform theory.
# generate random samples
set.seed(1005000)
u <- runif(10000)
failtimes <- tdom[colSums(outer(Surv, u, `>`))]
Using the survival
package, I plot the survival functions from the generated data.
library(survival)
event <- rep(1, length(failtimes))
# correction for oversampling at the last time
event[sample(which(failtimes == 5),
size = length(failtimes[which(failtimes == 5)]) * 0.85)] = 0
st <- survfit(Surv(failtimes, event) ~ 1)
# approx the hazard rate
ht <- diff(c(0, st$cumhaz))
Using the Epanechnikov kernel described in the first section, I estimate the
smoothed hazard rates with
b <- 1.5
td <- 5.00
xx <- seq(0, td, 0.01)
dat <- sapply(xx, calc_haz, ti = st$time, ht = ht, b = b)
est <- colSums(dat) / b
As a comparison, I also plotted the smoothed hazard rates using the Gaussian kernel with
# standard normal kernel
normfun <- function(x, sigma = 1)
(2 * pi * sigma^2)^(-0.5) * exp(-x^2 / (2 * sigma^2))
calc_norm <- function(x, Xi, b)
return(normfun((x - Xi) / b))
kern_wt_norm <- Vectorize(calc_norm, vectorize.args = "Xi")
calc_haz_norm <- function(x, Xi, ht, b) kern_wt_norm(x, Xi, b) * ht
# calc the estimates
b = 0.4
dat_norm <- sapply(xx, calc_haz_norm, Xi = st$time, b = b, ht = ht)
est_norm <- colSums(dat_norm) / b
With bandwidth
