Variable selection methods for linear regression in R
Introduction
In this post I cover variable selection methods in a linear regression framework. We often select variables for two reasons:
We wish to present the simplest model possible. By removing insignificant variables, the model may be easier to interpret.
We want to limit the number of variables when this number is close to or larger than the number of data points. If we do not limit the variables, then the model may be fitted to random effects in the data, rather than the real effects.
I will review three variable selection methods: stepwise selection, Lasso regression, and Elastic net.
Stepwise regression
Stepwise regression selects a subset of variables based on model fit and model complexity. A common criterion for model fit and complexity is the Akaike Information Criterion (AIC):
$$ \text{AIC} = n, \text{log}(\text{RSS}/n) + k, $$
where RSS is the residual sum of squares, $n$ is the number of data points, and $k$ is the number of variables. Smaller AIC values are preferred and indicate a better model fit. With the AIC, we can use forward elimination, backward elimination, or stepwise selection to select the subset of variables. I briefly describe each below, with a more thorough treatment given here.
Forward elimination: Start with a model that consists of only the intercept and add one variable that is most highly correlated with the response, thus giving the smallest RSS. Then consider adding one variable to the previous step so that the second variable selected has the largest partial correlation with the response. Repeat the last step and stop when the addition of one variable increases the AIC.
Backward elimination: Start with all variables in the model. At the next step, remove one variable to get the best model, where best is defined by the smallest RSS. Continue until the next deletion increases the AIC.
Stepwise selection: this is a combination of forward and backward elimination. We either start with all the variables or no variables except the intercept. As we add variables, we immediately eliminate variables that do not reduce the AIC.
Stepwise selection is considered as greedy: at each step it does the one thing that looks best at the time without considering future options.
To demonstrate stepwise regression, I use US crime
data from 1960, which is available
here.
crime <- suppressMessages(
readr::read_delim("http://www.statsci.org/data/general/uscrime.txt", delim="\t"))
head(crime)
# A tibble: 6 x 16
M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob Time Crime
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 15.1 1 9.1 5.8 5.6 0.51 95 33 30.1 0.108 4.1 3940 26.1 0.0846 26.2 791
2 14.3 0 11.3 10.3 9.5 0.583 101. 13 10.2 0.096 3.6 5570 19.4 0.0296 25.3 1635
3 14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25 0.0834 24.3 578
4 13.6 0 12.1 14.9 14.1 0.577 99.4 157 8 0.102 3.9 6730 16.7 0.0158 29.9 1969
5 14.1 0 12.1 10.9 10.1 0.591 98.5 18 3 0.091 2 5780 17.4 0.0414 21.3 1234
6 12.1 0 11 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6 0.0342 21.0 682
I treat all variables as numeric and scale the data using R’s scale
function (with the
exception of So
[dichotomous variable] and Crime
[the response]).
crime[, c(1, 3:15)] <- lapply(crime[, c(1,3:15)], function(x) round(scale(x), 2))
head(crime)
# A tibble: 6 x 16
M[,1] So Ed[,1] Po1[,1] Po2[,1] LF[,1] M.F[,1] Pop[,1] NW[,1] U1[,1] U2[,1] Wealth[,1] Ineq[,1] Prob[,1] Time[,1] Crime
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.99 1 -1.31 -0.91 -0.87 -1.27 -1.12 -0.1 1.94 0.7 0.83 -1.36 1.68 1.65 -0.06 791
2 0.35 0 0.66 0.61 0.53 0.54 0.98 -0.62 0.01 0.03 0.24 0.33 0 -0.77 -0.18 1635
3 0.27 1 -1.49 -1.35 -1.3 -0.7 -0.48 -0.49 1.15 -0.08 -0.12 -2.15 1.4 1.6 -0.32 578
4 -0.2 0 1.37 2.15 2.17 0.39 0.37 3.16 -0.21 0.36 0.59 1.53 -0.68 -1.38 0.47 1969
5 0.19 0 1.37 0.81 0.74 0.74 0.07 -0.49 -0.69 -0.25 -1.66 0.55 -0.5 -0.25 -0.75 1234
6 -1.4 0 0.39 1.11 1.24 -0.35 -0.65 -0.31 -0.56 -0.64 -0.59 1.7 -1.7 -0.570 -0.79 682
Lets start with the intercept only model and the full model with all variables. The respective AIC and $R^2$ values are shown.
# Model with all variables
mod_all <- lm(Crime ~ ., data=crime)
AIC(mod_all)
[1] 650.1969
summary(mod_all)$r.squared
[1] 0.8023824
# Intercept only model
mod0 <- lm(Crime ~ 1, data=crime)
AIC(mod0)
[1] 696.4037
summary(mod0)$r.squared
[1] 0
I use the stepAIC
from the MASS
library, which performs variable selection based on
the AIC. Let’s start with forward elimination, by selecting the direction="forward"
argument. I pass the intercept only model (mod0
) to the object
argument and the full
model (mod_all
) to the scope
argument.
library(MASS)
fmod <- stepAIC(object = mod0, direction = "forward",
scope = formula(mod_all), trace=FALSE)
coef(fmod)
(Intercept) Po1 Ineq Ed M Prob U2
905.06369 341.66723 269.46799 219.39474 132.39880 -86.33974 75.78490
AIC(fmod)
[1] 640.2218
summary(fmod)$r.squared
[1] 0.7655889
Using forward elimination, 7 variables are selected with an $R^2=$ 0.77. Compared with the full model, the AIC is indeed lower, but the full model has a higher $R^2$. We can see the change in the AIC for each variable added.
fmod$anova
Stepwise Model Path
Analysis of Deviance Table
Initial Model:
Crime ~ 1
Final Model:
Crime ~ Po1 + Ineq + Ed + M + Prob + U2
Step Df Deviance Resid. Df Resid. Dev AIC
1 46 6880928 561.0235
2 + Po1 1 3251554.7 45 3629373 532.9579
3 + Ineq 1 738010.4 44 2891363 524.2733
4 + Ed 1 586033.7 43 2305329 515.6276
5 + M 1 240654.7 42 2064674 512.4458
6 + Prob 1 257982.3 41 1806692 508.1724
7 + U2 1 193726.5 40 1612966 504.8416
The output shows that with the addition of each variable the AIC get smaller, until the
model stops at the addition of U2
(Step 7). Let’s try with a backward elimination
approach. This time I reverse the arguments to object
and scope
so that the model
starts with all the variables, and eliminates until with only the intercept model (if
needed).
bmod <- stepAIC(mod_all, direction = "backward",
scope = formula(mod0), trace=FALSE)
coef(bmod)
(Intercept) M Ed Po1 M.F U1 U2 Ineq Prob
905.14017 117.41855 201.06383 305.09942 65.90717 -109.64063 158.30781 244.67357 -86.20085
AIC(bmod)
[1] 639.378
summary(bmod)$r.squared
[1] 0.7885439
Backwards elimination selects 9 variables with an $R^2=$ 0.79. The change in AIC for removing one variable is shown below:
bmod$anova
Stepwise Model Path
Analysis of Deviance Table
Initial Model:
Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 +
U2 + Wealth + Ineq + Prob + Time
Final Model:
Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob
Step Df Deviance Resid. Df Resid. Dev AIC
1 31 1359793 514.8167
2 - So 1 1.300167 32 1359794 512.8167
3 - Time 1 9665.503811 33 1369459 511.1496
4 - LF 1 9782.997381 34 1379242 509.4842
5 - NW 1 11688.074179 35 1390930 507.8808
6 - Po2 1 16178.099053 36 1407108 506.4243
7 - Pop 1 21603.437077 37 1428712 505.1404
8 - Wealth 1 26302.520114 38 1455014 503.9978
Both forward and backward elimination can be done by selecting the direction = "both"
argument.
smod <- stepAIC(mod_all, direction = "both", trace=FALSE)
coef(smod)
(Intercept) M Ed Po1 M.F U1 U2 Ineq Prob
905.14017 117.41855 201.06383 305.09942 65.90717 -109.64063 158.30781 244.67357 -86.20085
AIC(smod)
[1] 639.378
summary(smod)$r.squared
[1] 0.7885439
The analysis thus far has been demonstrated using the complete dataset. To avoid fitting
the model to random effects in the training dataset, we use a cross-validation approach.
I do this using the caret
library and its functions. Following the previous analysis, I
select the backward elimination method, and choose $k=10$ for the repeated k-fold
cross-validation approach.
library(caret)
set.seed(100599)
set_train <- trainControl(method="repeatedcv", number=10, repeats=3)
cvmod <- train(Crime ~ ., data=crime, scope = formula(mod0),
method="lmStepAIC", direction="backward", trace=FALSE, trControl=set_train)
coef(cvmod$finalModel)
(Intercept) M Ed Po1 M.F U1 U2 Ineq Prob
905.14017 117.41855 201.06383 305.09942 65.90717 -109.64063 158.30781 244.67357 -86.20085
AIC(cvmod$finalModel)
[1] 639.378
Results show the final model for the cross-validation approach includes the same variables as the first backward elimination model. Overall, stepwise regression is good for performing preliminary analyses and getting a better understanding of which variables may or may not be significant in your model. However, stepwise regression can select a final set of variables that fit more to random effects than real effects. This method may not perform as well when testing the final set of variables on other data.
Ridge regression
Another variable selection method is ridge regression. Ridge regression fits all variables in a model using a technique that shrinks the coefficients toward zero. To understand this technique, let’s begin with a review of least squares regression. For least squares, we estimate $\beta_0, \beta_1,\dots,\beta_p$ using values that minimize
$$ \text{RSS} = \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij}\right)^2. $$
In ridge regression, the coefficient estimates $\hat\beta^R$ are the values that minimize
$$ \text{RSS} + \lambda \sum_{j=1}^p \beta^2_j, $$
where $\lambda\geq 0$ is a tuning parameter. The second term, $\sum_{j=1}^p \beta^2_j$ is called a shrinkage penalty, and is small when the $\beta_j$ (for $j=1,\dots,p$) estimates are small. We do not shrink the intercept ($\beta_0$) because it is an estimate of the mean when the predictor values $x_{i1}=,\dots,=x_{\text{ip}} = 0$. When $\lambda=0$, the penalty term has no effect and we have the least squares estimates. However, as $\lambda$ gets bigger, the ridge regression estimates will approach zero. Because a value for $\lambda$ is chosen, the set of estimates, $\hat{\beta}^{\star}_{\lambda}$ are generated for each $\lambda$.
We can use the glmnet
library to run ridge regression by selecting $\alpha=0$ and
iterating through the $\lambda$ values.
library(glmnet)
y <- as.matrix(dplyr::select(crime, Crime))
x <- as.matrix(dplyr::select(crime, -Crime))
lgrid <- c(seq(0, 1, by=0.1), seq(2, 9, by=1), seq(10, 100, by=10))
rmod <- glmnet(x, y, alpha=1, lambda=lgrid, standardize=FALSE)
rmod
Call: glmnet(x = x, y = y, alpha = 1, lambda = lgrid, standardize = FALSE)
Df %Dev Lambda
1 1 40.28 100.0
2 1 41.61 90.0
3 1 42.79 80.0
4 3 44.40 70.0
5 4 48.12 60.0
6 5 54.85 50.0
7 5 60.68 40.0
8 6 65.37 30.0
9 8 71.60 20.0
10 10 76.76 10.0
11 10 77.25 9.0
12 11 77.69 8.0
13 11 78.15 7.0
14 11 78.55 6.0
15 11 78.89 5.0
16 11 79.16 4.0
17 12 79.39 3.0
18 13 79.55 2.0
19 15 79.80 1.0
20 15 79.88 0.9
21 15 79.95 0.8
22 15 80.01 0.7
23 15 80.07 0.6
24 15 80.12 0.5
25 15 80.16 0.4
26 15 80.19 0.3
27 15 80.21 0.2
28 15 80.23 0.1
29 15 80.24 0.0
The deviance ratio shows the percentage of deviance explained for each value of $\lambda$. Here, it seems $\lambda=0$ is the best fit, which reduces to the least squares fit. We can also run the ridge model using a cross-validated approach, which I show below:
rmod <- train(Crime~., data = crime, method = 'glmnet',
tuneGrid = expand.grid(alpha=0, lambda=lgrid), trControl = set_train)
rmod
glmnet
47 samples
15 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 42, 44, 42, 42, 42, 41, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.0 248.9878 0.5699106 206.6500
0.1 248.9878 0.5699106 206.6500
0.2 248.9878 0.5699106 206.6500
0.3 248.9878 0.5699106 206.6500
0.4 248.9878 0.5699106 206.6500
0.5 248.9878 0.5699106 206.6500
0.6 248.9878 0.5699106 206.6500
0.7 248.9878 0.5699106 206.6500
0.8 248.9878 0.5699106 206.6500
0.9 248.9878 0.5699106 206.6500
1.0 248.9878 0.5699106 206.6500
2.0 248.9878 0.5699106 206.6500
3.0 248.9878 0.5699106 206.6500
4.0 248.9878 0.5699106 206.6500
5.0 248.9878 0.5699106 206.6500
6.0 248.9878 0.5699106 206.6500
7.0 248.9878 0.5699106 206.6500
8.0 248.9878 0.5699106 206.6500
9.0 248.9878 0.5699106 206.6500
10.0 248.9878 0.5699106 206.6500
20.0 248.9878 0.5699106 206.6500
30.0 248.1680 0.5700836 205.9538
40.0 247.0281 0.5654477 205.1688
50.0 246.7117 0.5616184 204.9661
60.0 246.8697 0.5586736 205.0718
70.0 247.3154 0.5563842 205.2987
80.0 247.9284 0.5546016 205.5623
90.0 248.6588 0.5531579 205.9801
100.0 249.4650 0.5519836 206.3832
Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 50.
Using this approach, a value $\lambda=50$ is selected. To see how the $\beta$ coefficients shrink toward zero with increasing values of $\lambda$, I plot the results of the final model below. Essentially, when $\lambda$ is large, the $\beta$ coefficients are zero.
plot(rmod$finalModel, xvar="lambda", label=TRUE)
Ridge regression has an advantage over least squares regression in that variance is reduced as the coefficients shrink toward zero, as demonstrated in the above plot. However, this benefit comes at the cost of increased bias. In short, ridge regression tends to underestimate the coefficients of very predictive variables. Ridge regression is recommended in situations where the least squares estimates have high variance.
Lasso regression
Lasso regression is another variable selection approach that minimizes the quantity
$$ \text{RSS} + \lambda \sum_{j=1}^p |\beta_j|. $$
By using a different penalty term (with an $\ell_1$ norm rather than $\ell_2$ norm), some
coefficients can be forced to exactly zero when $\lambda$ is sufficiently large. This is
different to ridge regression where all $p$ elements are included in the model (unless
$\lambda=\infty$). The glmnet
function that we used for ridge regression can also be
used for Lasso regression by specifying $\alpha = 1$.
lmod <- glmnet(x, y, alpha=1, lambda=lgrid, standardize=FALSE)
lmod
Call: glmnet(x = x, y = y, alpha = 1, lambda = lgrid, standardize = FALSE)
Df %Dev Lambda
1 1 40.28 100.0
2 1 41.61 90.0
3 1 42.79 80.0
4 3 44.40 70.0
5 4 48.12 60.0
6 5 54.85 50.0
7 5 60.68 40.0
8 6 65.37 30.0
9 8 71.60 20.0
10 10 76.76 10.0
11 10 77.25 9.0
12 11 77.69 8.0
13 11 78.15 7.0
14 11 78.55 6.0
15 11 78.89 5.0
16 11 79.16 4.0
17 12 79.39 3.0
18 13 79.55 2.0
19 15 79.80 1.0
20 15 79.88 0.9
21 15 79.95 0.8
22 15 80.01 0.7
23 15 80.07 0.6
24 15 80.12 0.5
25 15 80.16 0.4
26 15 80.19 0.3
27 15 80.21 0.2
28 15 80.23 0.1
29 15 80.24 0.0
We can now examine the coefficients for each $\lambda$. For example, with $\lambda=10$, we can already see some coefficients are exactly zero.
g <- which(rev(lgrid==10))
lmod$lambda[g]
[1] 10
coef(lmod)[, g]
(Intercept) M So Ed Po1 Po2 LF M.F Pop NW U1
905.070218 90.059221 0.000000 129.547911 306.323406 0.000000 0.000000 53.324806 0.000000 12.220954 -35.163512
U2 Wealth Ineq Prob Time
69.807741 4.605232 193.579685 -77.984628 0.000000
For $\lambda=100$, we have only one non-zero coefficient (Po1
).
g <- which(rev(lgrid==100))
lmod$lambda[g]
[1] 100
coef(lmod)[, g]
(Intercept) M So Ed Po1 Po2 LF M.F Pop NW U1
905.0851 0.0000 0.0000 0.0000 163.6732 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
U2 Wealth Ineq Prob Time
0.0000 0.0000 0.0000 0.0000 0.0000
One disadvantage of Lasso regression is that it arbitrarily eliminates variables that are strongly correlated with each other. For example, if a dataset has two variables that are strongly correlated, Lasso will pick one to have a non-zero coefficient with the other being eliminated from the model.
Elastic net
Elastic net regression is a variation of Lasso regression, where we estimate $\beta_0, \beta_1,\dots,\beta_p$ using values that minimize
$$ \text{RSS} + \lambda \sum_{j=1}^p |\beta_j| + (1-\lambda) \sum_{j=1}^p \beta_i^2. $$
Elastic net combines the variable benefits of Lasso regression with the prediction
benefits of ridge regression. Lets implement this method using the glmnet
function. We
allow the alpha
parameter to range between 0 and 1.
emod <- train(Crime~., data = crime, method = 'glmnet',
tuneGrid = expand.grid(alpha=seq(0, 1, by=0.1), lambda=lgrid), trControl = set_train)
emod
glmnet
47 samples
15 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 42, 42, 42, 42, 43, 42, ...
Resampling results across tuning parameters:
alpha lambda RMSE Rsquared MAE
0.0 0.0 252.0577 0.6338331 209.5087
0.0 0.1 252.0577 0.6338331 209.5087
0.0 0.2 252.0577 0.6338331 209.5087
0.0 0.3 252.0577 0.6338331 209.5087
0.0 0.4 252.0577 0.6338331 209.5087
0.0 0.5 252.0577 0.6338331 209.5087
0.0 0.6 252.0577 0.6338331 209.5087
0.0 0.7 252.0577 0.6338331 209.5087
0.0 0.8 252.0577 0.6338331 209.5087
0.0 0.9 252.0577 0.6338331 209.5087
0.0 1.0 252.0577 0.6338331 209.5087
0.0 2.0 252.0577 0.6338331 209.5087
0.0 3.0 252.0577 0.6338331 209.5087
0.0 4.0 252.0577 0.6338331 209.5087
0.0 5.0 252.0577 0.6338331 209.5087
0.0 6.0 252.0577 0.6338331 209.5087
0.0 7.0 252.0577 0.6338331 209.5087
0.0 8.0 252.0577 0.6338331 209.5087
0.0 9.0 252.0577 0.6338331 209.5087
0.0 10.0 252.0577 0.6338331 209.5087
0.0 20.0 252.0577 0.6338331 209.5087
0.0 30.0 251.8833 0.6359965 209.0304
0.0 40.0 251.5029 0.6351448 208.5232
0.0 50.0 251.5239 0.6348445 208.7456
0.0 60.0 251.7620 0.6347860 208.9675
0.0 70.0 252.1285 0.6347130 209.2971
0.0 80.0 252.5891 0.6346010 209.6699
0.0 90.0 253.1020 0.6344293 210.1630
0.0 100.0 253.6557 0.6342201 210.6293
0.1 0.0 264.4560 0.6484636 216.1292
0.1 0.1 264.4560 0.6484636 216.1292
0.1 0.2 264.4560 0.6484636 216.1292
0.1 0.3 264.3354 0.6486511 216.1255
0.1 0.4 264.0564 0.6487558 216.0538
0.1 0.5 263.8210 0.6487729 216.0407
0.1 0.6 263.5958 0.6487715 216.0142
0.1 0.7 263.3870 0.6487389 216.0163
0.1 0.8 263.1792 0.6486855 215.9956
0.1 0.9 262.9730 0.6486171 215.9741
0.1 1.0 262.7747 0.6485257 215.9520
[ reached getOption("max.print") -- omitted 279 rows ]
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 7.
The values $\alpha=$ 1 and $\lambda=$ 7 are selected for the final model, which is shown below.
enmod <- glmnet(x, y, nfolds=10, alpha=0.3, lambda=seq(1, 10, by=1), standardize=FALSE)
coef(enmod)[, which(enmod$lambda==7)]
(Intercept) M So Ed Po1 Po2 LF M.F Pop NW U1
898.70955 102.19631 18.77561 165.81002 249.74394 34.17948 0.00000 59.31391 -15.96940 24.34366 -76.71121
U2 Wealth Ineq Prob Time
119.28741 58.38457 238.73346 -89.21639 0.00000