Simulating event times (Simulation series 2)
Introduction
I extend the work presented in the first
post of this series to
demonstrate how to simulate infections times from a Cox Proportional Hazards (CPH)
model conditional on one dichotomous time-dependent covariate. I show the
mathematical derivations and demonstrate a simulation example for the dichotomous
covariate case using R
code.
Simulating conditional infection times
Let $T$ be a non-negative random variable denoting the time of HIV infection. Consider adding a single time-dependent covariate to a CPH model, denoted by $z t $, with the remaining time-independent covariates, denoted by $x$. We now have the model:
\begin{equation} h(t|x(t)) = h_0(t), \exp({\beta_t z(t) + \beta’ x}) \end{equation}
with cumulative hazard function (CHF):
\begin{equation}\label{eqname:Ht0t} H(t, x, z(t)) = \int_{0}^{t} \exp(B_t z(t) + \beta’ x) h(u)\,du. \end{equation}
Choose a dichotomous variable that can change its value at most once, for example: from treated to untreated, single to married, high school to college degree. Let $t_0$ denote the time at which the time-varying covariate changes from one state ($Z=0$) to the next state ($Z=1$). Then, $z(t) = 0$ for $t < t_0$ and $z(t)=1$ for $t \geq t_0$. This means that the CHF will be a piece-wise function having a different form over the two domains defined by $D_1 = (0, t_0)$ and $D_2 = [t_0, \infty)$. Assume that the infection times are exponentially distributed. If $t <t_0$, then based on the last equation the CHF is:
\begin{aligned} H(t, x, z(t)) & = \int_{0}^{t} \exp(\beta’ x) h(u) \,du \\ & = \lambda \exp(\beta’ x)\int_{0}^{t} \, du \\ & = \lambda \exp(\beta’x) t, \end{aligned}
where $\beta_t z(u) = 0$ and $\lambda = h(u)$ because the hazard (rate) of infection is constant at each time $t$. If $t\geq t_0$, the CHF is:
\begin{aligned} H(t, x, z(t)) &= \lambda \exp(\beta’ x)\int_{0}^{t} \exp( \beta_t z(t) ) \, du \\ &= \lambda \exp(\beta’ x)\left[ \int_{0}^{t_0} du + \int_{t_0}^t \exp(\beta_t z(t)) \, du \right] \\ &= \lambda \exp(\beta’ x)\left[t_0 + \exp(\beta_t)(t- t_0) \right]. \end{aligned}
We therefore have:
\begin{equation} H(t, x, z(t)) = \begin{cases} \lambda \exp(\beta’x) t & \text{if } t < t_0\\ \lambda \exp(\beta’x)[t_0 + \exp(\beta_t)t - \exp(\beta_t)t_0] & \text{if } t \geq t_0. \end{cases} \end{equation}
The two ranges of the piece-wise function correspond to $R_1 = (0, \lambda \exp(\beta’x)t_0)$ and $R_2 = [\lambda \exp(\beta’ x) t_0, \infty)$. As in Post 1 of this series, we now have to find the inverse of $H(t, x, z(t))$. Generally, we have that $F(t) = 1 - \exp(-H(t, x, z(t)))$, and by the inverse transform theorem (ITT), $F(t) \sim \text{Unif}(0, 1)$. Then $T = H^{-1}(-\log(U))$, so that to simulate infection times we have to find the inverse CHF. To do this, I deal with each part of the piece-wise function separately.
When $H(t, x, z(x)) < \lambda \exp(\beta’x)t_0$, then:
\begin{aligned} H(t, x, z(t)) &= \lambda \exp(\beta’)t \\ t & = \frac{H(t, x, z(t))}{\lambda \exp(\beta ‘)t} \\ H^{-1}(t, x, z(t)) &= \frac{t}{\lambda \exp(\beta’x) }. \end{aligned}
For $H(t, x, z(t)) \geq \lambda \exp(\beta’ x)t_0$, Austin shows that the inverse CHF is:
\begin{equation} H^{-1}(t, x, z(t))= \frac{t - \lambda \exp(\beta’x) t_0 + \lambda \exp(\beta’x + \beta_t) t_0}{\lambda \exp(\beta’x + \beta_t)} \end{equation}
if $t \geq \lambda \exp(\beta’ x)t_0$.
Recalling that $T = H^{-1}(-\log(u))$ by ITT, we again deal separately with the two piecewise functions. So if $-\log(u) < \exp(\beta’ x)t_0$ then:
\begin{equation} T = \frac{-log(u)}{\lambda \exp(\beta ’ x)} \end{equation}
and if $-\log(u) \geq \lambda \exp(\beta’ x)t_0$ then:
\begin{equation} T = \frac{-log(u) - \lambda \exp(\beta’x) t_0 + \lambda \exp(\beta’x + \beta_t) t_0}{\lambda \exp(\beta’x + \beta_t)}. \end{equation}
HIV cohort: testing and censoring
The next step of this simulation exercise is to generate the infection times conditional on one time-independent covariate $x$, one time-dependent covariate $z(t)$, and the right censoring values. Let $i$ represent the $i$th participant for $i = 1,\dots, N$. Because the data are time-varying, the $i$th participant has $U_j$ follow-up times ($j = 1,\dots,10$), where $U_1 <,\dots,<U_{10}$.
Let $x$ be a $n \times d$ matrix of covariate values, where $n$ represents the total number of observation ($n = \sum_i^N U_{ij}$) and $d = 2$. Let the time-independent variable $x_1 \sim \text{Unif(0, 1)}$, which represents the mean-centered age at infection. Let the dichotomous time-dependent variable $x_2 \sim \text{Unif}(1, 10) \times \text{Bern}(0.3)$ indicate the marital status of the participant, which can occur at any discrete time-point in the observation period with probability 0.3. I set the true/known parameters $\beta_1 = 0.5$ and $\beta_2 = -0.5$. For this simulation exercise, I set $\lambda =0.2$.
Let $\delta$ denote the censoring indicator such that $\delta = 1$ if the infection is observed during the observation period, otherwise $\delta = 0$. $\delta$ is therefore a random variable and must be simulated. Here, I generate $\delta$ by first simulating censoring times from an exponential distribution, with $\lambda_c = \lambda - 0.01$, such that $T_c \sim \text{Exp}(\lambda_c)$. Then, $\delta = \text{min}(T,\, T_c)$.
R code
The main function gen_dich
generates a single
dataset for the $i$th participant with his/her infection times, follow-up times,
one time-independent covariate (age at HIV infection) and one time-dependent
covariate (marital status). A censoring indicator is also constructed.
library(survival)
# Func to gen the data
gen_dich <- function(i) {
lambda <- 0.2
# Number of FU times
ntimes <- 10
# Generate covariate vals
x1 <- round(runif(1, 0, 1), 2)
time <- 1:ntimes
t0 <- sample(seq(ntimes), 1) * rbinom(1, 1, 0.3)
x2 <- as.numeric(time > t0)
u <- -log(runif(1))
b1x1 <- as.vector(x1 %*% 0.5)
# Generate infection time
t1 <- lambda * exp(b1x1) * t0
t2 <- lambda * exp(b1x1 + -0.5)
if (u < t1) {
itime <- u / (lambda * exp(b1x1))
} else if (u >= t1) {
itime <- (u - t1 + (t2 * t0)) / t2
}
# Make a censoring event
ctime <- rexp(1, rate = lambda - 0.01)
if (ctime < itime) {
hiv <- 0
} else {
hiv <- as.numeric(itime < time)
}
# Get the censoring time
mintime <- pmin(itime, ctime)
start <- time - 1
dat <- data.frame(id = i, start, time, ctime, itime, hiv, x1, x2)
dat <- dat[time <= ceiling(mintime), ]
return(dat)
}
An example of the dataset for the first participant is given below, which shows
that x2
, the marital status variable, changes over time for the
fourth participant.
set.seed(187900)
lapply(seq(10), gen_dich)[4]
[[1]]
id start time ctime itime hiv x1 x2
1 4 0 1 11.29535 11.38225 0 0.14 0
2 4 1 2 11.29535 11.38225 0 0.14 0
3 4 2 3 11.29535 11.38225 0 0.14 0
4 4 3 4 11.29535 11.38225 0 0.14 0
5 4 4 5 11.29535 11.38225 0 0.14 0
6 4 5 6 11.29535 11.38225 0 0.14 0
7 4 6 7 11.29535 11.38225 0 0.14 0
8 4 7 8 11.29535 11.38225 0 0.14 1
9 4 8 9 11.29535 11.38225 0 0.14 1
10 4 9 10 11.29535 11.38225 0 0.14 1
Next, I write a function to pass the data to the coxph
function. This
function estimates $\hat{\beta_1}, \hat\beta_2$.
# Function to run the simulations
do_sim <- function(n, f) {
dat <- do.call(rbind, lapply(seq(n), f))
bhat <- coxph(Surv(start, time, hiv) ~ x1 + x2 + cluster(id), data = dat)
return(bhat$coefficients)
}
Let $k = 1,\dots,K$ represent the total number of datasets to be generated. I run the simulation $K$ times to generate $K$ estimates and take the average of the $\hat\beta$ estimates. I then take the average of the $\beta_1^{k}$ and $\beta_2^{k}$ estimates and plot the cumulative sum of the $k$ estimates divided by the corresponding number of estimates. For this simulation exercise, I selected $K=400$.
set.seed(100400)
K <- 400
# Take the mean of the estimates
bhats <- sapply(seq(K), function(x) do_sim(400, gen_dich))
beta_mn <- apply(bhats, 1, mean)
beta_sd <- apply(bhats, 1, sd)
beta_mn; beta_sd
The two figures show the results of the simulation exercise, where the top and bottom panels are the estimates for the age and marital status coefficients, respectively. Both panels show that there is some variability in the estimates for $k<150$, thereafter the estimates converge approximately to the true parameters.