Simulating event times (Simulation series 2)

Introduction

I extend the work presented in the first post of this series to demonstrate how to simulate infections times from a Cox Proportional Hazards (CPH) model conditional on one dichotomous time-dependent covariate. I show the mathematical derivations and demonstrate a simulation example for the dichotomous covariate case using R code.

Simulating conditional infection times

Let T be a non-negative random variable denoting the time of HIV infection. Consider adding a single time-dependent covariate to a CPH model, denoted by zt, with the remaining time-independent covariates, denoted by x. We now have the model:

h(t|x(t))=h0(t),exp(βtz(t)+βx)

with cumulative hazard function (CHF):

H(t,x,z(t))=0texp(Btz(t)+βx)h(u)du.

Choose a dichotomous variable that can change its value at most once, for example: from treated to untreated, single to married, high school to college degree. Let t0 denote the time at which the time-varying covariate changes from one state (Z=0) to the next state (Z=1). Then, z(t)=0 for t<t0 and z(t)=1 for tt0. This means that the CHF will be a piece-wise function having a different form over the two domains defined by D1=(0,t0) and D2=[t0,). Assume that the infection times are exponentially distributed. If t<t0, then based on the last equation the CHF is:

H(t,x,z(t))=0texp(βx)h(u)du=λexp(βx)0tdu=λexp(βx)t,

where βtz(u)=0 and λ=h(u) because the hazard (rate) of infection is constant at each time t. If tt0, the CHF is:

H(t,x,z(t))=λexp(βx)0texp(βtz(t))du=λexp(βx)[0t0du+t0texp(βtz(t))du]=λexp(βx)[t0+exp(βt)(tt0)].

We therefore have:

H(t,x,z(t))={λexp(βx)tif t<t0λexp(βx)[t0+exp(βt)texp(βt)t0]if tt0.

The two ranges of the piece-wise function correspond to R1=(0,λexp(βx)t0) and R2=[λexp(βx)t0,). As in Post 1 of this series, we now have to find the inverse of H(t,x,z(t)). Generally, we have that F(t)=1exp(H(t,x,z(t))), and by the inverse transform theorem (ITT), F(t)Unif(0,1). Then T=H1(log(U)), so that to simulate infection times we have to find the inverse CHF. To do this, I deal with each part of the piece-wise function separately.

When H(t,x,z(x))<λexp(βx)t0, then:

H(t,x,z(t))=λexp(β)tt=H(t,x,z(t))λexp(β)tH1(t,x,z(t))=tλexp(βx).

For H(t,x,z(t))λexp(βx)t0, Austin shows that the inverse CHF is:

H1(t,x,z(t))=tλexp(βx)t0+λexp(βx+βt)t0λexp(βx+βt)

if tλexp(βx)t0.

Recalling that T=H1(log(u)) by ITT, we again deal separately with the two piecewise functions. So if log(u)<exp(βx)t0 then:

T=log(u)λexp(βx)

and if log(u)λexp(βx)t0 then:

T=log(u)λexp(βx)t0+λexp(βx+βt)t0λexp(βx+βt).

HIV cohort: testing and censoring

The next step of this simulation exercise is to generate the infection times conditional on one time-independent covariate x, one time-dependent covariate z(t), and the right censoring values. Let i represent the ith participant for i=1,,N. Because the data are time-varying, the ith participant has Uj follow-up times (j=1,,10), where U1<,,<U10.

Let x be a n×d matrix of covariate values, where n represents the total number of observation (n=iNUij) and d=2. Let the time-independent variable x1Unif(0, 1), which represents the mean-centered age at infection. Let the dichotomous time-dependent variable x2Unif(1,10)×Bern(0.3) indicate the marital status of the participant, which can occur at any discrete time-point in the observation period with probability 0.3. I set the true/known parameters β1=0.5 and β2=0.5. For this simulation exercise, I set λ=0.2.

Let δ denote the censoring indicator such that δ=1 if the infection is observed during the observation period, otherwise δ=0. δ is therefore a random variable and must be simulated. Here, I generate δ by first simulating censoring times from an exponential distribution, with λc=λ0.01, such that TcExp(λc). Then, δ=min(T,Tc).

R code

The main function gen_dich generates a single dataset for the ith participant with his/her infection times, follow-up times, one time-independent covariate (age at HIV infection) and one time-dependent covariate (marital status). A censoring indicator is also constructed.

library(survival)

# Func to gen the data
gen_dich <- function(i) {
  lambda <- 0.2
  # Number of FU times
  ntimes <- 10
  # Generate covariate vals
  x1 <- round(runif(1, 0, 1), 2)
  time <- 1:ntimes
  t0 <- sample(seq(ntimes), 1) * rbinom(1, 1, 0.3)
  x2 <- as.numeric(time > t0)
  u <- -log(runif(1))
  b1x1 <- as.vector(x1 %*% 0.5)
  # Generate infection time
  t1 <- lambda * exp(b1x1) * t0
  t2 <- lambda * exp(b1x1 + -0.5)
  if (u < t1) {
    itime <- u / (lambda * exp(b1x1))
  } else if (u >= t1) {
    itime <- (u - t1 + (t2 * t0)) / t2
  }
  # Make a censoring event
  ctime <- rexp(1, rate = lambda - 0.01)
  if (ctime < itime) {
    hiv <- 0
  } else {
    hiv <- as.numeric(itime < time)
  }
  # Get the censoring time
  mintime <- pmin(itime, ctime)
  start <- time - 1
  dat <- data.frame(id = i, start, time, ctime, itime, hiv, x1, x2)
  dat <- dat[time <= ceiling(mintime), ]
  return(dat)
}

An example of the dataset for the first participant is given below, which shows that x2, the marital status variable, changes over time for the fourth participant.

set.seed(187900)
lapply(seq(10), gen_dich)[4]
[[1]]
   id start time    ctime    itime hiv   x1 x2
1   4     0    1 11.29535 11.38225   0 0.14  0
2   4     1    2 11.29535 11.38225   0 0.14  0
3   4     2    3 11.29535 11.38225   0 0.14  0
4   4     3    4 11.29535 11.38225   0 0.14  0
5   4     4    5 11.29535 11.38225   0 0.14  0
6   4     5    6 11.29535 11.38225   0 0.14  0
7   4     6    7 11.29535 11.38225   0 0.14  0
8   4     7    8 11.29535 11.38225   0 0.14  1
9   4     8    9 11.29535 11.38225   0 0.14  1
10  4     9   10 11.29535 11.38225   0 0.14  1

Next, I write a function to pass the data to the coxph function. This function estimates β1^,β^2.

# Function to run the simulations
do_sim <- function(n, f) {
  dat <- do.call(rbind, lapply(seq(n), f))
  bhat <- coxph(Surv(start, time, hiv) ~ x1 + x2 + cluster(id), data = dat)  
  return(bhat$coefficients)
}

Let k=1,,K represent the total number of datasets to be generated. I run the simulation K times to generate K estimates and take the average of the β^ estimates. I then take the average of the β1k and β2k estimates and plot the cumulative sum of the k estimates divided by the corresponding number of estimates. For this simulation exercise, I selected K=400.

set.seed(100400)
K <- 400
# Take the mean of the estimates
bhats <- sapply(seq(K), function(x) do_sim(400, gen_dich))
beta_mn <- apply(bhats, 1, mean)
beta_sd <- apply(bhats, 1, sd)
beta_mn; beta_sd

The two figures show the results of the simulation exercise, where the top and bottom panels are the estimates for the age and marital status coefficients, respectively. Both panels show that there is some variability in the estimates for k<150, thereafter the estimates converge approximately to the true parameters.

Alain Vandormael
Alain Vandormael
Senior Data Scientist, PhD, MSc