Simulating event times (Simulation series 1)
Introduction
In this post, I describe how to simulate HIV infection times from an exponential
distribution. I begin by simulating the unconditional infection times. Then, I show how to condition the infection times on covariate data using a Cox Proportional Hazards model. I specifically demonstrate this procedure with R
code using two covariates (age, sex).
Simulating unconditional infection times
Let $T$ be a non-negative random variable denoting the time of HIV infection. We therefore have a probability density function (pdf): $f(t)=Pr(T=t)$ and cumulative distribution function (cdf): $F(t)=Pr(T\leq t)$. Assume that the HIV infection times are exponentially distributed with rate $\lambda$ and pdf:
\begin{equation}\label{eqname:exp} f(t; \lambda) = \lambda \exp({-\lambda t}) \end{equation}
where $t \ge 0$. To find the cdf of the exponential distribution, we integrate with respect to $x$:
\begin{align} F(t) &= \int_0^1 \lambda \exp({-\lambda x}) \\ &= \lambda \Bigr\rvert_0^x - \exp({-\lambda x}) \\ & = 1 - \exp({-\lambda x}) \quad \text{for } x \geq 0. \end{align}
To simulate the infection times, I use the Inverse Transform Theorem (ITT), which states that if $T$ is a continuous random variable having $F(T)$, then $F(T) \sim \text{Unif}(0, 1)$. The ITT proof is show at the end of this post. Define $U = \text{Unif}(0, 1)$ so that by ITT we have $U = F(T)$. Then:
\begin{align} U & = F(t) \\ U & = 1 - \exp({-\lambda T}) \\ -\lambda T & = \ln(1-U) \\ T & = \frac{-\ln(1-U)}{\lambda}, \end{align}
which gives the procedure for simulating the infection times.
Simulating conditional infection times
The choice of $T \sim \text{Exp}(\lambda)$ means that the participant’s infection time is independent of his or her individual characteristics or surrounding environment. This is often not the case in the real world. In the next steps, I simulate infection times that are conditional on covariate values.
It is common to model infection times using a Cox Proportional Hazards (PH) model:
\begin{equation}\label{eqname:coxph} h(t|x) = h_0(t)\, \exp({\beta’ x}) \end{equation}
where $t$ is time, $x$ is a d-vector of covariates, $\beta$ is a vector of regression coefficients, and $h_0(t)$ is the baseline hazard function when $x = 0$. In this way, the Cox PH model provides a method for simulating the infection times conditional on $x$ through the hazard rate. Following the theory presented in this post, we can write the cumulative hazard function for the Cox PH model as:
\begin{align} H(t|x) &= \int_0^t h_0(u)\,\exp({\beta’ x})\, du \\ &= \exp({\beta’ x})\int_0^t h_0(u)\, du \\ &= H_0(t) \, \exp({\beta’ x}). \end{align}
Then we have (see the referenced post):
\begin{equation} S(t|x) = \exp[-H_0(t) \, \exp({\beta’ x})] \end{equation}
We can obtain $F(t|x)$ using
\begin{equation} F(t) = 1 - \exp[H(t)] \end{equation}
and then invoke ITT to isolate $T$:
\begin{align} U &= 1 - \exp[-H_0(T) \, \exp({\beta’ x})] \\ \ln(1 -U) &= -H_0(T) \, \exp({\beta’ x}) \\ \frac{-\ln(1 -U)}{\exp({\beta’ x})} &= H_0(T) \\ T &= H_0^{-1}\left(\frac{-\ln(1 -U)}{\exp({\beta’ x})}\right) \end{align}
We now have a generalized formula for generating infection times from a uniform distribution. The next step is to derive $H_0^{-1}(t)$, the inverse CHF, from the exponential distribution.
\begin{align} H_0(t) & = -\ln[S(t)] \\ & = -\ln[\exp({-\lambda t})] \\ & = \lambda t \\ H_0^{-1}[H_0(t)] & = H_0^{-1}(\lambda t) \\ t & = \lambda H_0^{-1}(t) \\ H_0^{-1}(t) = \frac{t}{\lambda},\\ \end{align}
for $t \geq 0$. The infection times can be generated with (Reference):
\begin{equation} T = -\frac{\ln(1 - U)}{\lambda \exp(\beta’ x)}. \end{equation}
HIV cohort: testing and censoring
The final step of this simulation exercise is to generate the $x$ covariate and right censoring values. Let $i$ represent the $i$th participant for $i = 1,\dots, N$. As per the definition of a cohort, each participant enters into the cohort with a negative status. Participants are then periodically tested for HIV until their first HIV-positive test or until the end of the observation period.
Let $\delta$ denote the censoring indicator such that $\delta = 1$ if the infection is observed during the observation period, otherwise $\delta = 0$. $\delta$ is therefore a random variable and must be simulated. Here, I generate $\delta$ by first simulating censoring times from an exponential distribution, with $\lambda_c = \lambda - 0.1$, such that $T_c \sim \text{exp}(\lambda_c)$. Then, $\delta = \text{min}(T,\, T_c)$.
Let $x$ be a $n \times d$ matrix of covariate values. For this exercise, $d = 2$. Let $x_1 \sim \text{N(0, 4)}$, which represents the mean-centered age at infection with a standard deviation of 4 years. Let $x_2 \sim \text{Bern}(0.60)$ indicate the sex of the participant, with women being over-represented in the cohort. I set the true/known parameters $\beta_1 = 0.05$ and $\beta_2 = 1.5$. For this simulation exercise, I set $\lambda=1$.
R code
The R
code below shows the procedure for generating the covariate values, the censoring indicator, and the infection times.
gen_data <- function(N, lambda, beta_1, beta_2) {
# gen covariate data
x1 <- rnorm(N, 0, 4)
x2 <- rbinom(N, 1, 0.6)
X <- as.matrix(cbind(x1, x2))
betas <- matrix(c(beta_1, beta_2))
# gen uniforms
u <- runif(N)
# gen the infections times
itime <- -log(u)/(lambda * exp(X %*% betas))
# Gen censoring times
ctime <- rexp(N, rate = lambda - 0.1)
# get the right censored time
time <- pmin(itime, ctime)
# got HIV or not during obs period
hiv <- as.numeric(itime <= ctime)
dat <- data.frame(id = 1:N, time = time, hiv = hiv, x1 = x1, x2 = x2)
return(dat)
}
Next, I pass the generated data to the coxph
function. This function estimates $\hat{\beta_1}, \hat\beta_2$.
library(survival)
do_sim <- function(N, lambda = 1, beta_1 = 0.05, beta_2 = 1.5) {
# gen a dataset
dat <- gen_data(N, lambda, beta_1, beta_2)
# get the beta estimates
bhat <- coxph(Surv(time, hiv) ~ x1 + x2, data = dat)
return(bhat$coefficients)
}
Let $k = 1,\dots,K$ represent the total number of datasets to be generated. In the
code below, I pass the $k$th dataset to the coxph
function and extract the
estimated coefficients in a $2 \times K$ matrix. I repeat steps 1 and 2 a total of
$K$ times. I then took the average of the $\beta_1^{k}$ and $\beta_2^{k}$ estimates
and plot the cumulative sum of the $k$ estimates divided by the corresponding
number of estimates. For this simulation exercise, I select $N=200$ and $K=1000$.
set.seed(100400)
K <- 1000
# take the mean of the estimates
bhats <- sapply(seq(K), function(x) do_sim(200))
beta_mn <- apply(bhats, 1, mean)
beta_sd <- apply(bhats, 1, sd)
beta_mn; beta_sd
x1 x2
0.05120952 1.51963760
x1 x2
0.02239449 0.21762811
# get the average over K runs
b1 <- cumsum(bhats[1, ])/(1:K)
b2 <- cumsum(bhats[2, ])/(1:K)
library(RColorBrewer)
Blues <- brewer.pal(4, "Blues")
plot(1:K, b1, type = "l", bty = "l", lwd =2, ylim = c(0.02, 0.1),
xlab = "Iterations (K)", ylab = "Estimate",
main = "Age coefficient", col = Blues[3])
abline(h = 0.05, lty = 2)
plot(1:K, b2, type = "l", bty = "l", lwd = 2, ylim = c(1.0, 2.0),
xlab = "Iterations (K)", ylab = "Estimate",
main = "Sex coefficient", col = Blues[3])
abline(h = 1.5, lty = 2)
The two figures show the results of the simulation exercise, where the left
and right panels are the estimates for the Age
and Sex
coefficients, respectively. Both panels show that there is some variability in
the estimates for $k<150$, thereafter the estimates converge approximately to the
true parameters.
Proof of Inverse Transform Theorem:
Let $Y = F(X)$, then the cdf of $Y$ is:
\begin{align} P(Y \leq y) &= P(F(X) \leq y) \\ &= P(X \leq F^{-1}(y)) \\ & = F(F^{-1}(y)) = y, \end{align}
which is the cdf of the $\text{Unif}(0, 1)$.