Kernel smoothing in R

Introduction
In this post I give a basic introduction to kernel smoothing in R. I motivate kernel smoothing with a short introduction to the histogram, which is the oldest and most widely used density estimator.
The histogram
The histogram is mainly used as a graphical tool for displaying the frequency distributions of univariate data. We can construct a graph by subdividing the axis of measurement into intervals or bins of equal width. Next, a rectangle is constructed over each bin, so that the height of the rectangle is proportional to the fraction of the total number of measurements falling in each cell.
Let
Example
Consider constructing a histogram of
# simulate data
set.seed(1234)
n <- 50
lambda <- 2
u <- runif(n)
itime <- -log(u) / lambda
itime[0:10]
[1] 1.08708094 0.23716697 0.24774300 0.23629994 0.07487953 0.22290095
[7] 2.32845514 0.72932392 0.20316993 0.33252177
We can easily make a plot using the hist
function in R
.
p1 <- hist(itime, breaks = 5, main = "Histogram: 5 bins",
xlab = "Time", col = "white", freq = FALSE)

R
shows us the breaks that define the left points of the intervals and the number
of
p1$breaks
[1] 0.0 0.5 1.0 1.5 2.0 2.5
p1$counts
[1] 27 19 1 2 1
The heights of the bins are obtained by dividing the counts
in each bin by n
, the
total number of observations. These heights are called relative frequencies and are
the fraction of the total number of observations (
rf <- with(p1, counts / n)
rf
[1] 0.54 0.38 0.02 0.04 0.02
These relative frequencies sum to 1, which we would expect if we describe this as a discrete probability density function (pdf).
sum(rf)
[1] 1
Relative frequencies are not the same as kernel densities.
Since the kernel densities are scaled by the bin width
and for all bins, the density estimates are:
p1$density
[1] 1.08 0.76 0.04 0.08 0.04
To have the kernel densities sum to 1, we have to scale by the bin width.
sum(p1$density * 0.5)
[1] 1
The choice of the number of bins will clearly affect the bin width
hist(itime, breaks = 10, main = "Histogram: 10 bins",
xlab = "Time", col = "white", freq = FALSE)

The kernel density estimator
As with the histogram, kernel density smoothing is a method for finding structure in the data without the imposition of a parametric model. The kernel density estimator is given by:
where
and
The kernel estimate
One common choice for
Here, the scale of the kernel is
Example
To demonstrate the univariate kernel density estimator, consider the following five infection times:
# define the normal density
normfun <- function(x, sigma = 1)
(2 * pi * sigma^2)^(-0.5) * exp(-x^2 / (2 * sigma^2))
# rescale normfun
kern <- function(Xi, x, h = 1, sigma = 1)
normfun((x - Xi) / h, sigma = sigma) / h
## create the window width h
left <- 0
right <- 12
step <- 0.05
x <- seq(left, right, step)
# the observations
Xi <- c(3, 4.25, 5, 8, 9)
# get ith density
est_xi = do.call(cbind, lapply(Xi, kern, x))
# get n density
est <- apply(est_xi, 1, mean)
# get similar result using base function
# density(Xi, kernel = "gaussian", bw = 1, from = left, to = right, cut = step)
The figure below shows the kernel density
estimate for these five datapoints (the diamonds). To produce this plot, I first constructed grid points on the x-axis with bandwidth
library(RColorBrewer)
set1 <- brewer.pal(9, "Set1")
# plot
plot(x, est, type = "l", ylab = "Density", xlab = "x", bty = "n")
for (i in Xi) {
lines(x, est_xi[, which(i == Xi)] / length(Xi), lty = 2, lwd = 2, col = set1[which(i == Xi)])
}
points(Xi, rep(0, length(Xi)), pch = 18, cex = 1.8, col = set1[seq(length(x))])

Since the kernel density is a smoothed, continuous function, we have to integrate to obtain the total area. In the code below, I fit a spline function to the estimates and then integrate to show that the area sums to 1.
# Check that density integrates to 1
splxy = splinefun(x, est)
integrate(splxy, lower = min(x), upper = max(x))
0.9994515 with absolute error < 2.1e-07