Distributions in R
Introduction
This post contains some of my notes on the Bernoulli, geometric, Poisson, and exponential
distributions. I describe these distributions using R
’s functions and graphics. I
continue to work on this post to improve my understanding and gain a deeper insight into
probability distributions, particularly through the use of worked examples in R
.
Bernoulli distribution
Consider a simple experiment, which results in either a “success” or “failure”.
Let $X =1$ be the event that the experiment results in a “success” and let $X = 0$ be the event that it results in a “failure”.
We define $X$ as a discrete random variable having a Bernoulli distribution with parameter $p$, where $p$ is a probability ($0 \leq p \leq 1$). Formally, we denote $X \sim \text{Bern}(p)$, if $X$ has the probability mass function (pmf):
$$ \begin{aligned} P(X=1) &= p \ P(X=0) &= 1-p. \end{aligned} $$
- A popular example of a Bernoulli distribution is a coin flip. We let $X=1$ be a “heads” up result and $X=0$ be a “tails” up result and assume the coin is fair ($p=0.5$).
- We can simulate a coin flip in
R
using the functionrbinom(n, size, prob)
, wheren
is the number of observations,size
is the number of trials, andprob
is the probability of success ($p$). The example produces the realization $x=0$, a “tails” result.
set.seed(1200)
rbinom(n=1, size=1, p=0.5)
[1] 0
Binomial distribution
The Bernoulli distribution is not that useful because it models only a single event, for example, one coin flip. When we put many Bernoulli trials together, we get a binomial experiment.
A binomial experiment consists of $n$ independent Bernoulli trials, where each trial result is a “success” or “failure”. The probability of “success” of each trial is $p$ and $1-p$ for “failure”. The trials are considered to be independent from one another, meaning that the results of any subset of trials do not affect the uncertainty of subsequent trials.
A random variable $Y$ has a binomial distribution with parameter $p$ based on $n$ trials, and is denoted as $Y\sim \text{Binom}(n, p)$. Let $Y$ count the number of successes in $n$ independent trials, each with probability $p$ of success, where ($0 \leq p \leq 1$).
Then, $Y=X_1+X_2+…+X_n$ where $X_1,\dots,X_n$ are $n$ independent random variables all with a Bern($p$) distribution. The pmf of the Bionomial distribution is:
$$ Pr(Y=y) = \binom{n}{y} p^y (1-p)^{n-y} \text{ for } y=0,\dots,n $$
- We can generate a realization of $Y \sim \text{Bin}(n, p)$ using the
rbinom
function. Contining with the coin example, we flip a coinn=10
times and count the number of results that are “tails”.
x = rbinom(n=10, size=1, prob=0.5)
x
[1] 0 0 1 0 1 1 1 1 1 1
sum(x)
[1] 7
- The binomial distribution can be used to answer questions such as: what is the
probability that we observe 3 “heads” in 10 coin flips? We could
compute this probability using the function
dbinom(x, size, prob)
, wherex
is the number of successes andsize
is the number of trials. The probaility of seeing 3 “heads” in 10 coin flips, expressed as $P(X=3)$, is:
dbinom(x=3, size=10, prob=0.5)
[1] 0.1171875
- We can plot a binomial distribution for different values of $(n,p)$ using the
dbinom
function.
plot(0:20, dbinom(0:20, 20, p=0.5), type = "o",
col = "indianred1", xlab = "x (number of successes)", xlim=c(0, 40),
ylab = "Probability")
lines(0:40, dbinom(0:40, 40, p=0.4), type="o", col="skyblue3")
lines(0:20, dbinom(0:20, 20, p=0.6), type="o", col="darkseagreen3")
legend("topright", legend=c("n=20, p=0.5", "n=40, p=0.4", "n=20, p=0.6"),
bty="n", lty=1, col=c("indianred1", "skyblue3", "darkseagreen3"))
- As $n$ gets large, the binomial distribution approaches the normal distribution.
Geometric distribution
In a geometric experiment, we ask how many Bernoulli trials must we run before we observe a “success”, or how many times must we flip a coin until we observe “heads”.
Again, we have $p$ for the probability of success and $X$ is the number of independent trials before we observe a success.
The pmf is of a Geometric distribution is:
$$ \begin{aligned} P(X=x) &= (1-p)^x p.\ \end{aligned} $$
In a game of darts, the probability of hitting the center area is $p=0.05$. What is $P(X=8)$, the probability that you hit the centre area on your $8$th throw?
We can use the
dgeom(x, prob)
function inR
to calculate the probability, wherex
is the number of trials (throws) andp
is the probability of success (hitting the centre area).
dgeom(8, 0.05)
[1] 0.03317102
#or
(1-0.05)^8 * 0.05
[1] 0.03317102
- We can vary $p$ to obtain different geometric distributions.
plot(0:20, dgeom(0:20, p=0.5), type = "o",
col = "indianred1", xlab = "x (number of trials)",
xlim=c(0, 20), ylim=c(0, 1), ylab = "Probability")
lines(0:20, dgeom(0:20, p=0.2), type="o", col="skyblue3")
lines(0:20, dgeom(0:20, p=0.8), type="o", col="darkseagreen3")
legend("topright", legend=c("p=0.5", "p=0.2", "p=0.8"),
bty="n", lty=1, col=c("indianred1", "skyblue3", "darkseagreen3"))
- We can also ask what is the probability that it takes 8 or less trials (throws of the data) to hit the centre. This is expressed as:
$$ \begin{aligned} P(X \leq x) &= 1 - (1-p)^{ x+1 } \ \end{aligned} $$
We can use the pgeom
function to calculate this probability.
1 - (1-0.05)^(8+1)
[1] 0.3697506
# Or
pgeom(8, 0.05)
[1] 0.3697506
- We can also generate a realization of $X \sim G(p)$ using
rgeom(n, prob)
, where $n$ is the number of trials andprob
is the probability. With $p=0.05$, the number of independent trials that must be run before observing a success is:
rgeom(1, 0.05)
[1] 12
- Of course, since $X$ is a random variable that has a geometric distribution, we will observe a different realization for every trial.
x <- rgeom(10, 0.05)
x
[1] 11 27 6 7 42 3 13 4 34 67
In this example, we perform $n=10$ different experiments; in the first experiment, we observe that 11 trials are needed before we observe a success, and in the second experiment, 27 trials are needed, etc.
Exponential distribution
The geometric distribution models how many times it takes for something (i.e., a success) to happen.
An exponential distribution is a continuous analog of the geometric distribution. The exponential distribution models how long it takes before a new customer enters the shop, how long before the next earthquake erupts, or how long it takes before the next piece of machinery breaks.
These questions involve an element of time, the time to wait until a given event occurs. (But not all exponential distributions involve a time component.) If the waiting time is unknown, then the random variable $X$ has an exponential distribution with a single parameter $\lambda$.
Since time is continuous, the exponential distribution has a probability density function (pdf):
$$ \text{Exp}(x;\lambda) = \begin{cases} \lambda e^{-\lambda x} &; x \ge 0, \ 0 &; x < 0. \end{cases} $$
The paramter $\lambda$ is a rate and represents the average number of events per interval. Let’s say that the average time for the next customer to arrive is $\mu=5$ minutes. Then the rate is one customer per 5 minutes or $\lambda = 1/5 = 0.2$ customers per minute.
We can calculate the probability that a customer enters the store at 2 minutes using the
dexp
function.
(1/5) * exp(-(1/5 * 2))
[1] 0.134064
# Or
dexp(2, rate = 1/5)
[1] 0.134064
- We can visualize the exponential distribution using different values for $\lambda$.
x <- seq(0, 5, by=0.05)
plot(x, dexp(x, rate = 1/5), type = "l",
col = "indianred1", xlab = "x",
xlim=c(0, 5), ylim=c(0, 1), ylab = "Probability density")
lines(x, dexp(x, rate=1/2), col="skyblue3")
lines(x, dexp(x, rate=2), col="darkseagreen3")
legend("topright",
bty="n", lty=1, col=c("indianred1", "skyblue3", "darkseagreen3"),
legend=c(expression(paste(lambda, " = 1/5")),
expression(paste(lambda, " = 1/2")),
expression(paste(lambda, " = 2"))))
Poisson distribution
A Poisson distribution is useful when we want to know the probability of observing $X$ successes (or events) in a given time period or interval. For example, we may define $X$ as the number of cars that pass in an hour or the number of new HIV infections that we observe in the calendar year.
The probability mass function of the Poisson distribution is given as:
$$ P(X = x|\lambda) = \frac{e^{-\lambda} \lambda^x}{x!} \text{ for } x = 0, 1, 2, \dots $$
$X$ is a random variable having the Poisson distribution, written as $X \sim \text{Poisson}(\lambda)$, where $\lambda$ represents the average number of events per interval. This is the same definition of $\lambda$ used in the exponential distribution.
Consider a store that receives 10 customers between 16:00 and 17:00. Let $X$ be the number of customers received in this interval, then $X$ has a Poisson distribution with $\lambda = 10$.
What is the probability of observing exactly 5 customers, $P(X=2|\lambda=10)$? We can calculate this probability in
R
using thedpois(x, lambda)
function.
dpois(5, lambda = 10)
[1] 0.03783327
# or
(exp(-10) * (10^5))/factorial(5)
[1] 0.03783327
- We can vary $x$ and $\lambda$ to generate different Poisson distributions.
x = seq(0, 10)
plot(x, dpois(x, lambda=1.5), type = "o",
col = "indianred1", xlab = "x (number of events)",
xlim=c(0, 10), ylim=c(0, 0.7), ylab = "Probability")
lines(x, dpois(x, lambda=1.0), type="o", col="skyblue3")
lines(x, dpois(x, lambda=0.5), type="o", col="darkseagreen3")
legend("topright", bty="n", lty=1,
col=c("indianred1", "skyblue3", "darkseagreen3"),
legend=c(expression(paste(lambda, " = 1.5")),
expression(paste(lambda, " = 1.0")),
expression(paste(lambda, " = 0.5"))))
The Poisson model assumes that events are independent, that is, the occurrence of one event does not affect the probability that a second event will occur. It also assumes that the events are identically distributed, that the average rate of events is constant.
In addition, two events cannot occur at exactly the same instant. Instead, at each very small sub-interval exactly one event either occurs or does not occur. In this way, the Poisson distribution is also the limit of a binomial distribution, for which the probability of success for each trial equals $\lambda$ divided by the number of trials, as the number of trials approaches infinity.$^1$$^{, 2}$
The exponential distribution is the probability distribution of time between events in a Poisson model. If the arrival times (of let’s say of customers) are Poisson with rate $\lambda$, then the time between arrivals, known as the inter-arrival time, follows an exponential distribution, with $\mu = 1/\lambda$ as the average interval time. If times between arrivals are exponentially distributed with average inter-arrival time $\mu = 1/\lambda$, then the arrivals follow the Poisson distribution with an average of $\lambda$ arrivals per unit time.