Simulating event times (Simulation series 1)

Last updated on Nov 11, 2022 6 min read

Introduction

In this post, I describe how to simulate HIV infection times from an exponential distribution. I begin by simulating the unconditional infection times. Then, I show how to condition the infection times on covariate data using a Cox Proportional Hazards model. I specifically demonstrate this procedure with R code using two covariates (age, sex).

Simulating unconditional infection times

Let $T$ be a non-negative random variable denoting the time of HIV infection. We therefore have a probability density function (pdf): $f (t) = P r (T = t)$ and cumulative distribution function (cdf): $F (t) = P r (T \leq t)$ . Assume that the HIV infection times are exponentially distributed with rate $λ$ and pdf:

$f (t; λ) = λ \exp (- λ t)$

where $t \geq 0$ . To find the cdf of the exponential distribution, we integrate with respect to $x$ :

$\begin{aligned} F (t) & = \int_{0}^{1} λ \exp (- λ x) \\ = λ |_{0}^{x} - \exp (- λ x) \\ = 1 - \exp (- λ x) for x \geq 0. \end{aligned}$

To simulate the infection times, I use the Inverse Transform Theorem (ITT), which states that if $T$ is a continuous random variable having $F (T)$ , then $F (T) \sim Unif (0, 1)$ . The ITT proof is show at the end of this post. Define $U = Unif (0, 1)$ so that by ITT we have $U = F (T)$ . Then:

$\begin{aligned} U & = F (t) \\ U & = 1 - \exp (- λ T) \\ - λ T & = \ln (1 - U) \\ T & = \frac{- \ln (1 - U)}{λ}, \end{aligned}$

which gives the procedure for simulating the infection times.

Simulating conditional infection times

The choice of $T \sim Exp (λ)$ means that the participant’s infection time is independent of his or her individual characteristics or surrounding environment. This is often not the case in the real world. In the next steps, I simulate infection times that are conditional on covariate values.

It is common to model infection times using a Cox Proportional Hazards (PH) model:

$h (t | x) = h_{0} (t) \exp (β^{'} x)$

where $t$ is time, $x$ is a d-vector of covariates, $β$ is a vector of regression coefficients, and $h_{0} (t)$ is the baseline hazard function when $x = 0$ . In this way, the Cox PH model provides a method for simulating the infection times conditional on $x$ through the hazard rate. Following the theory presented in this post, we can write the cumulative hazard function for the Cox PH model as:

$\begin{aligned} H (t | x) & = \int_{0}^{t} h_{0} (u) \exp (β^{'} x) d u \\ = \exp (β^{'} x) \int_{0}^{t} h_{0} (u) d u \\ = H_{0} (t) \exp (β^{'} x) . \end{aligned}$

Then we have (see the referenced post):

$S (t | x) = \exp [- H_{0} (t) \exp (β^{'} x)]$

We can obtain $F (t | x)$ using

$F (t) = 1 - \exp [H (t)]$

and then invoke ITT to isolate $T$ :

$\begin{aligned} U & = 1 - \exp [- H_{0} (T) \exp (β^{'} x)] \\ \ln (1 - U) & = - H_{0} (T) \exp (β^{'} x) \\ \frac{- \ln (1 - U)}{\exp (β^{'} x)} & = H_{0} (T) \\ T & = H_{0}^{- 1} (\frac{- \ln (1 - U)}{\exp (β^{'} x)}) \end{aligned}$

We now have a generalized formula for generating infection times from a uniform distribution. The next step is to derive $H_{0}^{- 1} (t)$ , the inverse CHF, from the exponential distribution.

$\begin{aligned} H_{0} (t) & = - \ln [S (t)] \\ = - \ln [\exp (- λ t)] \\ = λ t \\ H_{0}^{- 1} [H_{0} (t)] & = H_{0}^{- 1} (λ t) \\ t & = λ H_{0}^{- 1} (t) \\ H_{0}^{- 1} (t) = \frac{t}{λ}, \end{aligned}$

for $t \geq 0$ . The infection times can be generated with (Reference):

$T = - \frac{\ln (1 - U)}{λ \exp (β^{'} x)} .$

HIV cohort: testing and censoring

The final step of this simulation exercise is to generate the $x$ covariate and right censoring values. Let $i$ represent the $i$ th participant for $i = 1, \dots, N$ . As per the definition of a cohort, each participant enters into the cohort with a negative status. Participants are then periodically tested for HIV until their first HIV-positive test or until the end of the observation period.

Let $δ$ denote the censoring indicator such that $δ = 1$ if the infection is observed during the observation period, otherwise $δ = 0$ . $δ$ is therefore a random variable and must be simulated. Here, I generate $δ$ by first simulating censoring times from an exponential distribution, with $λ_{c} = λ - 0.1$ , such that $T_{c} \sim exp (λ_{c})$ . Then, $δ = min (T, T_{c})$ .

Let $x$ be a $n \times d$ matrix of covariate values. For this exercise, $d = 2$ . Let $x_{1} \sim N(0, 4)$ , which represents the mean-centered age at infection with a standard deviation of 4 years. Let $x_{2} \sim Bern (0.60)$ indicate the sex of the participant, with women being over-represented in the cohort. I set the true/known parameters $β_{1} = 0.05$ and $β_{2} = 1.5$ . For this simulation exercise, I set $λ = 1$ .

R code

The R code below shows the procedure for generating the covariate values, the censoring indicator, and the infection times.

gen_data <- function(N, lambda, beta_1, beta_2) {
  # gen covariate data
  x1 <- rnorm(N, 0, 4)
  x2 <- rbinom(N, 1, 0.6)
  X <- as.matrix(cbind(x1, x2))
  betas <- matrix(c(beta_1, beta_2))
  # gen uniforms
  u <- runif(N)
  # gen the infections times
  itime <- -log(u)/(lambda * exp(X %*% betas))
  # Gen censoring times
  ctime <- rexp(N, rate = lambda - 0.1)
  # get the right censored time 
  time <- pmin(itime, ctime)
  # got HIV or not during obs period
  hiv <- as.numeric(itime <= ctime)
  dat <- data.frame(id = 1:N, time = time, hiv = hiv, x1 = x1, x2 = x2)
  return(dat)
}

Next, I pass the generated data to the coxph function. This function estimates $\hat{β_{1}}, {\hat{β}}_{2}$ .

library(survival)
do_sim <- function(N, lambda = 1, beta_1 = 0.05, beta_2 = 1.5) {
  # gen a dataset
  dat <- gen_data(N, lambda, beta_1, beta_2)
  # get the beta estimates
  bhat <- coxph(Surv(time, hiv) ~ x1 + x2, data = dat)  
  return(bhat$coefficients)
}

Let $k = 1, \dots, K$ represent the total number of datasets to be generated. In the code below, I pass the $k$ th dataset to the coxph function and extract the estimated coefficients in a $2 \times K$ matrix. I repeat steps 1 and 2 a total of $K$ times. I then took the average of the $β_{1}^{k}$ and $β_{2}^{k}$ estimates and plot the cumulative sum of the $k$ estimates divided by the corresponding number of estimates. For this simulation exercise, I select $N = 200$ and $K = 1000$ .

set.seed(100400)
K <- 1000

# take the mean of the estimates
bhats <- sapply(seq(K), function(x) do_sim(200))
beta_mn <- apply(bhats, 1, mean)
beta_sd <- apply(bhats, 1, sd)
beta_mn; beta_sd
        x1         x2 
0.05120952 1.51963760 
        x1         x2 
0.02239449 0.21762811 

# get the average over K runs
b1 <- cumsum(bhats[1, ])/(1:K)
b2 <- cumsum(bhats[2, ])/(1:K)

library(RColorBrewer)
Blues <- brewer.pal(4, "Blues")
plot(1:K,  b1, type = "l", bty = "l", lwd =2, ylim = c(0.02, 0.1),
  xlab = "Iterations (K)", ylab = "Estimate",
  main = "Age coefficient", col = Blues[3])
abline(h = 0.05, lty = 2)

plot(1:K,  b2, type = "l", bty = "l", lwd = 2, ylim = c(1.0, 2.0),
  xlab = "Iterations (K)", ylab = "Estimate",
  main = "Sex coefficient", col = Blues[3])
abline(h = 1.5, lty = 2)

The two figures show the results of the simulation exercise, where the left and right panels are the estimates for the Age and Sex coefficients, respectively. Both panels show that there is some variability in the estimates for $k < 150$ , thereafter the estimates converge approximately to the true parameters.

Proof of Inverse Transform Theorem:

Let $Y = F (X)$ , then the cdf of $Y$ is:

$\begin{aligned} P (Y \leq y) & = P (F (X) \leq y) \\ = P (X \leq F^{- 1} (y)) \\ = F (F^{- 1} (y)) = y, \end{aligned}$

which is the cdf of the $Unif (0, 1)$ .

R Statistics Data analytics