Kernel smoothing in R

Introduction

In this post I give a basic introduction to kernel smoothing in R. I motivate kernel smoothing with a short introduction to the histogram, which is the oldest and most widely used density estimator.

The histogram

The histogram is mainly used as a graphical tool for displaying the frequency distributions of univariate data. We can construct a graph by subdividing the axis of measurement into intervals or bins of equal width. Next, a rectangle is constructed over each bin, so that the height of the rectangle is proportional to the fraction of the total number of measurements falling in each cell.

Let $X_i$ represent the $i$th measurement and let $x$ represent the axis of measurement. Given positive integers $k$ and bin width $b$, the histogram has intervals of the form $[x_0 + kb, x_0 + (k+1)b)$, which are closed on the right and open on the left. In words, the histogram is defined as

\begin{equation*} \hat{f}(x) = \frac{1}{n} \times \frac{\text{number of } X_i \text{ in same bin as } x}{\text{width of bin containing } x}. \end{equation*}

Example

Consider constructing a histogram of $n=50$ infection times simulated from an exponential distribution with $\lambda = 2$.

# simulate data
set.seed(1234)
n <- 50
lambda <- 2
u <- runif(n)
itime <- -log(u) / lambda
itime[0:10]
 [1] 1.08708094 0.23716697 0.24774300 0.23629994 0.07487953 0.22290095
 [7] 2.32845514 0.72932392 0.20316993 0.33252177

We can easily make a plot using the hist function in R.

p1 <- hist(itime, breaks = 5, main = "Histogram: 5 bins", 
     xlab = "Time", col = "white", freq = FALSE)

R shows us the breaks that define the left points of the intervals and the number of $X_i$ in each bin.

p1$breaks
[1] 0.0 0.5 1.0 1.5 2.0 2.5
p1$counts
[1] 27 19  1  2  1

The heights of the bins are obtained by dividing the counts in each bin by n, the total number of observations. These heights are called relative frequencies and are the fraction of the total number of observations ($X_i$) in each bin $x$.

rf <- with(p1, counts / n)
rf
[1] 0.54 0.38 0.02 0.04 0.02

These relative frequencies sum to 1, which we would expect if we describe this as a discrete probability density function (pdf).

sum(rf)
[1] 1

Relative frequencies are not the same as kernel densities. Since the kernel densities are scaled by the bin width $b$, these can only be the same when $b = 1$. Thus, for the first bin, the density is:

\begin{equation} \hat{f} = \frac{27}{50 \times 0.5} = 1.08. \end{equation}

and for all bins, the density estimates are:

p1$density
[1] 1.08 0.76 0.04 0.08 0.04

To have the kernel densities sum to 1, we have to scale by the bin width.

sum(p1$density * 0.5)
[1] 1

The choice of the number of bins will clearly affect the bin width $b$ and change the appearance of the histogram. Generally, a smaller binwidth (which increases the variance) leads to a more jagged histogram and a larger binwidth (which increases the bias) leads to a more smoother looking histogram. For this reason, the binwidth $b$ is often called a smoothing parameter. In the example below, we increase the number of bins from $k=5$ to $k = 10$.

hist(itime, breaks = 10, main = "Histogram: 10 bins", 
     xlab = "Time", col = "white", freq = FALSE)

The kernel density estimator

As with the histogram, kernel density smoothing is a method for finding structure in the data without the imposition of a parametric model. The kernel density estimator is given by:

\begin{equation} \hat{f}(x;h) = (nh)^{-1} \sum_{i = 1}^n K{(x - X_i)/h} \end{equation}

where $K$ is called the kernel and satisfies

\begin{equation} \int_{-\infty}^{\infty} K(x)\, dx =1 \end{equation}

and $h$ is called the smoothing parameter or bandwidth ($h > 0$). The above equation can be written more compactly by rescaling $K_h(u) = h^{-1} K(u/h)$ such that:

\begin{equation}\label{eq:fest} \hat{f}(x;h) = n^{-1} \sum_{i = 1}^n K_h(x - X_i). \end{equation}

The kernel estimate $\hat{f}(x)$ is obtained by centering a scaled kernel at $X_i$ and then taking the average of the $n$ kernel ordinates at the point $x$. The more samples that fall in a bandwidth, the higher the estimate compared with regions with less observations. This approach is similar to the histogram method in that instead of summing the number of observations in a bin, we are now summing the kernel ordinates over the window width $h$.

One common choice for $K_h$ is the standard normal or Gaussian density:

\begin{equation} \phi(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(\frac{-x^2}{2 \sigma^2}\right). \end{equation}

Here, the scale of the kernel is $N(0, \sigma^2) = N(0, b^2) = N(0, 1)$.

Example

To demonstrate the univariate kernel density estimator, consider the following five infection times: $X = {3, 4.25, 5, 8, 9}$ and let $K_b = \phi(x)$ be the standard normal density.

# define the normal density
normfun <- function(x, sigma = 1) 
  (2 * pi * sigma^2)^(-0.5) * exp(-x^2 / (2 * sigma^2))

# rescale normfun
kern <- function(Xi, x, h = 1, sigma = 1)
  normfun((x - Xi) / h, sigma = sigma) / h

## create the window width h
left <- 0
right <- 12
step <- 0.05
x <- seq(left, right, step)
# the observations
Xi <- c(3, 4.25, 5, 8, 9)
# get ith density
est_xi = do.call(cbind, lapply(Xi, kern, x))
# get n density
est <- apply(est_xi, 1, mean)

# get similar result using base function
# density(Xi, kernel = "gaussian", bw = 1, from = left, to = right, cut = step)

The figure below shows the kernel density estimate for these five datapoints (the diamonds). To produce this plot, I first constructed grid points on the x-axis with bandwidth $h = 0.05$ in the range $[0, 12)$. To get the estimates for $X_1$, for example, I calculated the kernel weight for each $x$ using $K_h (X_1 - x)$. This result is shown by the red line of the figure. I used a similar procedure for $X_2,\dots,X_5$, as shown in their corresponding colors. The kernel density estimate (black line) is the average of the $X_i$ ordinates at $x$.

library(RColorBrewer)
set1 <- brewer.pal(9, "Set1")
# plot
plot(x, est, type = "l", ylab = "Density", xlab = "x", bty = "n")
for (i in Xi) {
  lines(x, est_xi[, which(i == Xi)] / length(Xi), lty = 2, lwd = 2, col = set1[which(i == Xi)])
}
points(Xi, rep(0, length(Xi)), pch = 18, cex = 1.8, col = set1[seq(length(x))])

Since the kernel density is a smoothed, continuous function, we have to integrate to obtain the total area. In the code below, I fit a spline function to the estimates and then integrate to show that the area sums to 1.

# Check that density integrates to 1
splxy = splinefun(x, est)
integrate(splxy, lower = min(x), upper = max(x)) 
0.9994515 with absolute error < 2.1e-07
Alain Vandormael
Alain Vandormael
Senior Data Scientist, PhD, MSc