Simulating event times (Simulation series 2)

Last updated on Nov 11, 2022 7 min read

Introduction

I extend the work presented in the first post of this series to demonstrate how to simulate infections times from a Cox Proportional Hazards (CPH) model conditional on one dichotomous time-dependent covariate. I show the mathematical derivations and demonstrate a simulation example for the dichotomous covariate case using R code.

Simulating conditional infection times

Let $T$ be a non-negative random variable denoting the time of HIV infection. Consider adding a single time-dependent covariate to a CPH model, denoted by $z t$ , with the remaining time-independent covariates, denoted by $x$ . We now have the model:

$h (t | x (t)) = h_{0} (t), \exp (β_{t} z (t) + β^{'} x)$

with cumulative hazard function (CHF):

$H (t, x, z (t)) = \int_{0}^{t} \exp (B_{t} z (t) + β^{'} x) h (u) d u .$

Choose a dichotomous variable that can change its value at most once, for example: from treated to untreated, single to married, high school to college degree. Let $t_{0}$ denote the time at which the time-varying covariate changes from one state ( $Z = 0$ ) to the next state ( $Z = 1$ ). Then, $z (t) = 0$ for $t < t_{0}$ and $z (t) = 1$ for $t \geq t_{0}$ . This means that the CHF will be a piece-wise function having a different form over the two domains defined by $D_{1} = (0, t_{0})$ and $D_{2} = [t_{0}, \infty)$ . Assume that the infection times are exponentially distributed. If $t < t_{0}$ , then based on the last equation the CHF is:

$\begin{aligned} H (t, x, z (t)) & = \int_{0}^{t} \exp (β^{'} x) h (u) d u \\ = λ \exp (β^{'} x) \int_{0}^{t} d u \\ = λ \exp (β^{'} x) t, \end{aligned}$

where $β_{t} z (u) = 0$ and $λ = h (u)$ because the hazard (rate) of infection is constant at each time $t$ . If $t \geq t_{0}$ , the CHF is:

$\begin{aligned} H (t, x, z (t)) & = λ \exp (β^{'} x) \int_{0}^{t} \exp (β_{t} z (t)) d u \\ = λ \exp (β^{'} x) [\int_{0}^{t_{0}} d u + \int_{t_{0}}^{t} \exp (β_{t} z (t)) d u] \\ = λ \exp (β^{'} x) [t_{0} + \exp (β_{t}) (t - t_{0})] . \end{aligned}$

We therefore have:

$H (t, x, z (t)) = {\begin{cases} λ \exp (β^{'} x) t & if t < t_{0} \\ λ \exp (β^{'} x) [t_{0} + \exp (β_{t}) t - \exp (β_{t}) t_{0}] & if t \geq t_{0} . \end{cases}$

The two ranges of the piece-wise function correspond to $R_{1} = (0, λ \exp (β^{'} x) t_{0})$ and $R_{2} = [λ \exp (β^{'} x) t_{0}, \infty)$ . As in Post 1 of this series, we now have to find the inverse of $H (t, x, z (t))$ . Generally, we have that $F (t) = 1 - \exp (- H (t, x, z (t)))$ , and by the inverse transform theorem (ITT), $F (t) \sim Unif (0, 1)$ . Then $T = H^{- 1} (- \log (U))$ , so that to simulate infection times we have to find the inverse CHF. To do this, I deal with each part of the piece-wise function separately.

When $H (t, x, z (x)) < λ \exp (β^{'} x) t_{0}$ , then:

$\begin{aligned} H (t, x, z (t)) & = λ \exp (β^{'}) t \\ t & = \frac{H (t, x, z (t))}{λ \exp (β ‘) t} \\ H^{- 1} (t, x, z (t)) & = \frac{t}{λ \exp (β^{'} x)} . \end{aligned}$

For $H (t, x, z (t)) \geq λ \exp (β^{'} x) t_{0}$ , Austin shows that the inverse CHF is:

$H^{- 1} (t, x, z (t)) = \frac{t - λ \exp (β^{'} x) t_{0} + λ \exp (β^{'} x + β_{t}) t_{0}}{λ \exp (β^{'} x + β_{t})}$

if $t \geq λ \exp (β^{'} x) t_{0}$ .

Recalling that $T = H^{- 1} (- \log (u))$ by ITT, we again deal separately with the two piecewise functions. So if $- \log (u) < \exp (β^{'} x) t_{0}$ then:

$T = \frac{- l o g (u)}{λ \exp (β^{'} x)}$

and if $- \log (u) \geq λ \exp (β^{'} x) t_{0}$ then:

$T = \frac{- l o g (u) - λ \exp (β^{'} x) t_{0} + λ \exp (β^{'} x + β_{t}) t_{0}}{λ \exp (β^{'} x + β_{t})} .$

HIV cohort: testing and censoring

The next step of this simulation exercise is to generate the infection times conditional on one time-independent covariate $x$ , one time-dependent covariate $z (t)$ , and the right censoring values. Let $i$ represent the $i$ th participant for $i = 1, \dots, N$ . Because the data are time-varying, the $i$ th participant has $U_{j}$ follow-up times ( $j = 1, \dots, 10$ ), where $U_{1} <, \dots, < U_{10}$ .

Let $x$ be a $n \times d$ matrix of covariate values, where $n$ represents the total number of observation ( $n = \sum_{i}^{N} U_{i j}$ ) and $d = 2$ . Let the time-independent variable $x_{1} \sim Unif(0, 1)$ , which represents the mean-centered age at infection. Let the dichotomous time-dependent variable $x_{2} \sim Unif (1, 10) \times Bern (0.3)$ indicate the marital status of the participant, which can occur at any discrete time-point in the observation period with probability 0.3. I set the true/known parameters $β_{1} = 0.5$ and $β_{2} = - 0.5$ . For this simulation exercise, I set $λ = 0.2$ .

Let $δ$ denote the censoring indicator such that $δ = 1$ if the infection is observed during the observation period, otherwise $δ = 0$ . $δ$ is therefore a random variable and must be simulated. Here, I generate $δ$ by first simulating censoring times from an exponential distribution, with $λ_{c} = λ - 0.01$ , such that $T_{c} \sim Exp (λ_{c})$ . Then, $δ = min (T, T_{c})$ .

R code

The main function gen_dich generates a single dataset for the $i$ th participant with his/her infection times, follow-up times, one time-independent covariate (age at HIV infection) and one time-dependent covariate (marital status). A censoring indicator is also constructed.

library(survival)

# Func to gen the data
gen_dich <- function(i) {
  lambda <- 0.2
  # Number of FU times
  ntimes <- 10
  # Generate covariate vals
  x1 <- round(runif(1, 0, 1), 2)
  time <- 1:ntimes
  t0 <- sample(seq(ntimes), 1) * rbinom(1, 1, 0.3)
  x2 <- as.numeric(time > t0)
  u <- -log(runif(1))
  b1x1 <- as.vector(x1 %*% 0.5)
  # Generate infection time
  t1 <- lambda * exp(b1x1) * t0
  t2 <- lambda * exp(b1x1 + -0.5)
  if (u < t1) {
    itime <- u / (lambda * exp(b1x1))
  } else if (u >= t1) {
    itime <- (u - t1 + (t2 * t0)) / t2
  }
  # Make a censoring event
  ctime <- rexp(1, rate = lambda - 0.01)
  if (ctime < itime) {
    hiv <- 0
  } else {
    hiv <- as.numeric(itime < time)
  }
  # Get the censoring time
  mintime <- pmin(itime, ctime)
  start <- time - 1
  dat <- data.frame(id = i, start, time, ctime, itime, hiv, x1, x2)
  dat <- dat[time <= ceiling(mintime), ]
  return(dat)
}

An example of the dataset for the first participant is given below, which shows that x2, the marital status variable, changes over time for the fourth participant.

set.seed(187900)
lapply(seq(10), gen_dich)[4]
[[1]]
   id start time    ctime    itime hiv   x1 x2
1   4     0    1 11.29535 11.38225   0 0.14  0
2   4     1    2 11.29535 11.38225   0 0.14  0
3   4     2    3 11.29535 11.38225   0 0.14  0
4   4     3    4 11.29535 11.38225   0 0.14  0
5   4     4    5 11.29535 11.38225   0 0.14  0
6   4     5    6 11.29535 11.38225   0 0.14  0
7   4     6    7 11.29535 11.38225   0 0.14  0
8   4     7    8 11.29535 11.38225   0 0.14  1
9   4     8    9 11.29535 11.38225   0 0.14  1
10  4     9   10 11.29535 11.38225   0 0.14  1

Next, I write a function to pass the data to the coxph function. This function estimates $\hat{β_{1}}, {\hat{β}}_{2}$ .

# Function to run the simulations
do_sim <- function(n, f) {
  dat <- do.call(rbind, lapply(seq(n), f))
  bhat <- coxph(Surv(start, time, hiv) ~ x1 + x2 + cluster(id), data = dat)  
  return(bhat$coefficients)
}

Let $k = 1, \dots, K$ represent the total number of datasets to be generated. I run the simulation $K$ times to generate $K$ estimates and take the average of the $\hat{β}$ estimates. I then take the average of the $β_{1}^{k}$ and $β_{2}^{k}$ estimates and plot the cumulative sum of the $k$ estimates divided by the corresponding number of estimates. For this simulation exercise, I selected $K = 400$ .

set.seed(100400)
K <- 400
# Take the mean of the estimates
bhats <- sapply(seq(K), function(x) do_sim(400, gen_dich))
beta_mn <- apply(bhats, 1, mean)
beta_sd <- apply(bhats, 1, sd)
beta_mn; beta_sd

The two figures show the results of the simulation exercise, where the top and bottom panels are the estimates for the age and marital status coefficients, respectively. Both panels show that there is some variability in the estimates for $k < 150$ , thereafter the estimates converge approximately to the true parameters.