Kernel smoothing in R
Introduction
In this post I give a basic introduction to kernel smoothing in R. I motivate kernel smoothing with a short introduction to the histogram, which is the oldest and most widely used density estimator.
The histogram
The histogram is mainly used as a graphical tool for displaying the frequency distributions of univariate data. We can construct a graph by subdividing the axis of measurement into intervals or bins of equal width. Next, a rectangle is constructed over each bin, so that the height of the rectangle is proportional to the fraction of the total number of measurements falling in each cell.
Let $X_i$ represent the $i$th measurement and let $x$ represent the axis of measurement. Given positive integers $k$ and bin width $b$, the histogram has intervals of the form $[x_0 + kb, x_0 + (k+1)b)$, which are closed on the right and open on the left. In words, the histogram is defined as
\begin{equation*} \hat{f}(x) = \frac{1}{n} \times \frac{\text{number of } X_i \text{ in same bin as } x}{\text{width of bin containing } x}. \end{equation*}
Example
Consider constructing a histogram of $n=50$ infection times simulated from an exponential distribution with $\lambda = 2$.
# simulate data
set.seed(1234)
n <- 50
lambda <- 2
u <- runif(n)
itime <- -log(u) / lambda
itime[0:10]
[1] 1.08708094 0.23716697 0.24774300 0.23629994 0.07487953 0.22290095
[7] 2.32845514 0.72932392 0.20316993 0.33252177
We can easily make a plot using the hist
function in R
.
p1 <- hist(itime, breaks = 5, main = "Histogram: 5 bins",
xlab = "Time", col = "white", freq = FALSE)
R
shows us the breaks that define the left points of the intervals and the number
of $X_i$ in each bin.
p1$breaks
[1] 0.0 0.5 1.0 1.5 2.0 2.5
p1$counts
[1] 27 19 1 2 1
The heights of the bins are obtained by dividing the counts
in each bin by n
, the
total number of observations. These heights are called relative frequencies and are
the fraction of the total number of observations ($X_i$) in each bin $x$.
rf <- with(p1, counts / n)
rf
[1] 0.54 0.38 0.02 0.04 0.02
These relative frequencies sum to 1, which we would expect if we describe this as a discrete probability density function (pdf).
sum(rf)
[1] 1
Relative frequencies are not the same as kernel densities. Since the kernel densities are scaled by the bin width $b$, these can only be the same when $b = 1$. Thus, for the first bin, the density is:
\begin{equation} \hat{f} = \frac{27}{50 \times 0.5} = 1.08. \end{equation}
and for all bins, the density estimates are:
p1$density
[1] 1.08 0.76 0.04 0.08 0.04
To have the kernel densities sum to 1, we have to scale by the bin width.
sum(p1$density * 0.5)
[1] 1
The choice of the number of bins will clearly affect the bin width $b$ and change the appearance of the histogram. Generally, a smaller binwidth (which increases the variance) leads to a more jagged histogram and a larger binwidth (which increases the bias) leads to a more smoother looking histogram. For this reason, the binwidth $b$ is often called a smoothing parameter. In the example below, we increase the number of bins from $k=5$ to $k = 10$.
hist(itime, breaks = 10, main = "Histogram: 10 bins",
xlab = "Time", col = "white", freq = FALSE)
The kernel density estimator
As with the histogram, kernel density smoothing is a method for finding structure in the data without the imposition of a parametric model. The kernel density estimator is given by:
\begin{equation} \hat{f}(x;h) = (nh)^{-1} \sum_{i = 1}^n K{(x - X_i)/h} \end{equation}
where $K$ is called the kernel and satisfies
\begin{equation} \int_{-\infty}^{\infty} K(x)\, dx =1 \end{equation}
and $h$ is called the smoothing parameter or bandwidth ($h > 0$). The above equation can be written more compactly by rescaling $K_h(u) = h^{-1} K(u/h)$ such that:
\begin{equation}\label{eq:fest} \hat{f}(x;h) = n^{-1} \sum_{i = 1}^n K_h(x - X_i). \end{equation}
The kernel estimate $\hat{f}(x)$ is obtained by centering a scaled kernel at $X_i$ and then taking the average of the $n$ kernel ordinates at the point $x$. The more samples that fall in a bandwidth, the higher the estimate compared with regions with less observations. This approach is similar to the histogram method in that instead of summing the number of observations in a bin, we are now summing the kernel ordinates over the window width $h$.
One common choice for $K_h$ is the standard normal or Gaussian density:
\begin{equation} \phi(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(\frac{-x^2}{2 \sigma^2}\right). \end{equation}
Here, the scale of the kernel is $N(0, \sigma^2) = N(0, b^2) = N(0, 1)$.
Example
To demonstrate the univariate kernel density estimator, consider the following five infection times: $X = {3, 4.25, 5, 8, 9}$ and let $K_b = \phi(x)$ be the standard normal density.
# define the normal density
normfun <- function(x, sigma = 1)
(2 * pi * sigma^2)^(-0.5) * exp(-x^2 / (2 * sigma^2))
# rescale normfun
kern <- function(Xi, x, h = 1, sigma = 1)
normfun((x - Xi) / h, sigma = sigma) / h
## create the window width h
left <- 0
right <- 12
step <- 0.05
x <- seq(left, right, step)
# the observations
Xi <- c(3, 4.25, 5, 8, 9)
# get ith density
est_xi = do.call(cbind, lapply(Xi, kern, x))
# get n density
est <- apply(est_xi, 1, mean)
# get similar result using base function
# density(Xi, kernel = "gaussian", bw = 1, from = left, to = right, cut = step)
The figure below shows the kernel density estimate for these five datapoints (the diamonds). To produce this plot, I first constructed grid points on the x-axis with bandwidth $h = 0.05$ in the range $[0, 12)$. To get the estimates for $X_1$, for example, I calculated the kernel weight for each $x$ using $K_h (X_1 - x)$. This result is shown by the red line of the figure. I used a similar procedure for $X_2,\dots,X_5$, as shown in their corresponding colors. The kernel density estimate (black line) is the average of the $X_i$ ordinates at $x$.
library(RColorBrewer)
set1 <- brewer.pal(9, "Set1")
# plot
plot(x, est, type = "l", ylab = "Density", xlab = "x", bty = "n")
for (i in Xi) {
lines(x, est_xi[, which(i == Xi)] / length(Xi), lty = 2, lwd = 2, col = set1[which(i == Xi)])
}
points(Xi, rep(0, length(Xi)), pch = 18, cex = 1.8, col = set1[seq(length(x))])
Since the kernel density is a smoothed, continuous function, we have to integrate to obtain the total area. In the code below, I fit a spline function to the estimates and then integrate to show that the area sums to 1.
# Check that density integrates to 1
splxy = splinefun(x, est)
integrate(splxy, lower = min(x), upper = max(x))
0.9994515 with absolute error < 2.1e-07