Variable selection methods for linear regression in R

Introduction

In this post I cover variable selection methods in a linear regression framework. We often select variables for two reasons:

  • We wish to present the simplest model possible. By removing insignificant variables, the model may be easier to interpret.

  • We want to limit the number of variables when this number is close to or larger than the number of data points. If we do not limit the variables, then the model may be fitted to random effects in the data, rather than the real effects.

I will review three variable selection methods: stepwise selection, Lasso regression, and Elastic net.

Stepwise regression

Stepwise regression selects a subset of variables based on model fit and model complexity. A common criterion for model fit and complexity is the Akaike Information Criterion (AIC):

$$ \text{AIC} = n, \text{log}(\text{RSS}/n) + k, $$

where RSS is the residual sum of squares, $n$ is the number of data points, and $k$ is the number of variables. Smaller AIC values are preferred and indicate a better model fit. With the AIC, we can use forward elimination, backward elimination, or stepwise selection to select the subset of variables. I briefly describe each below, with a more thorough treatment given here.

  • Forward elimination: Start with a model that consists of only the intercept and add one variable that is most highly correlated with the response, thus giving the smallest RSS. Then consider adding one variable to the previous step so that the second variable selected has the largest partial correlation with the response. Repeat the last step and stop when the addition of one variable increases the AIC.

  • Backward elimination: Start with all variables in the model. At the next step, remove one variable to get the best model, where best is defined by the smallest RSS. Continue until the next deletion increases the AIC.

  • Stepwise selection: this is a combination of forward and backward elimination. We either start with all the variables or no variables except the intercept. As we add variables, we immediately eliminate variables that do not reduce the AIC.

Stepwise selection is considered as greedy: at each step it does the one thing that looks best at the time without considering future options.

To demonstrate stepwise regression, I use US crime data from 1960, which is available here.

crime <- suppressMessages(
  readr::read_delim("http://www.statsci.org/data/general/uscrime.txt", delim="\t"))
head(crime)
# A tibble: 6 x 16
      M    So    Ed   Po1   Po2    LF   M.F   Pop    NW    U1    U2 Wealth  Ineq   Prob  Time Crime
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
1  15.1     1   9.1   5.8   5.6 0.51   95      33  30.1 0.108   4.1   3940  26.1 0.0846  26.2   791
2  14.3     0  11.3  10.3   9.5 0.583 101.     13  10.2 0.096   3.6   5570  19.4 0.0296  25.3  1635
3  14.2     1   8.9   4.5   4.4 0.533  96.9    18  21.9 0.094   3.3   3180  25   0.0834  24.3   578
4  13.6     0  12.1  14.9  14.1 0.577  99.4   157   8   0.102   3.9   6730  16.7 0.0158  29.9  1969
5  14.1     0  12.1  10.9  10.1 0.591  98.5    18   3   0.091   2     5780  17.4 0.0414  21.3  1234
6  12.1     0  11    11.8  11.5 0.547  96.4    25   4.4 0.084   2.9   6890  12.6 0.0342  21.0   682

I treat all variables as numeric and scale the data using R’s scale function (with the exception of So [dichotomous variable] and Crime [the response]).

crime[, c(1, 3:15)] <- lapply(crime[, c(1,3:15)], function(x) round(scale(x), 2))
head(crime)
# A tibble: 6 x 16
  M[,1]    So Ed[,1] Po1[,1] Po2[,1] LF[,1] M.F[,1] Pop[,1] NW[,1] U1[,1] U2[,1] Wealth[,1] Ineq[,1] Prob[,1] Time[,1] Crime
  <dbl> <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>      <dbl>    <dbl>    <dbl>    <dbl> <dbl>
1  0.99     1  -1.31   -0.91   -0.87  -1.27   -1.12   -0.1    1.94   0.7    0.83      -1.36     1.68    1.65     -0.06   791
2  0.35     0   0.66    0.61    0.53   0.54    0.98   -0.62   0.01   0.03   0.24       0.33     0      -0.77     -0.18  1635
3  0.27     1  -1.49   -1.35   -1.3   -0.7    -0.48   -0.49   1.15  -0.08  -0.12      -2.15     1.4     1.6      -0.32   578
4 -0.2      0   1.37    2.15    2.17   0.39    0.37    3.16  -0.21   0.36   0.59       1.53    -0.68   -1.38      0.47  1969
5  0.19     0   1.37    0.81    0.74   0.74    0.07   -0.49  -0.69  -0.25  -1.66       0.55    -0.5    -0.25     -0.75  1234
6 -1.4      0   0.39    1.11    1.24  -0.35   -0.65   -0.31  -0.56  -0.64  -0.59       1.7     -1.7    -0.570    -0.79   682

Lets start with the intercept only model and the full model with all variables. The respective AIC and $R^2$ values are shown.

# Model with all variables
mod_all <- lm(Crime ~ ., data=crime)
AIC(mod_all)
[1] 650.1969
summary(mod_all)$r.squared
[1] 0.8023824
# Intercept only model
mod0 <- lm(Crime ~ 1, data=crime)
AIC(mod0)
[1] 696.4037
summary(mod0)$r.squared
[1] 0

I use the stepAIC from the MASS library, which performs variable selection based on the AIC. Let’s start with forward elimination, by selecting the direction="forward" argument. I pass the intercept only model (mod0) to the object argument and the full model (mod_all) to the scope argument.

library(MASS)
fmod <- stepAIC(object = mod0, direction = "forward", 
  scope = formula(mod_all), trace=FALSE)
coef(fmod)
(Intercept)         Po1        Ineq          Ed           M        Prob          U2 
  905.06369   341.66723   269.46799   219.39474   132.39880   -86.33974    75.78490 
AIC(fmod)
[1] 640.2218
summary(fmod)$r.squared
[1] 0.7655889

Using forward elimination, 7 variables are selected with an $R^2=$ 0.77. Compared with the full model, the AIC is indeed lower, but the full model has a higher $R^2$. We can see the change in the AIC for each variable added.

fmod$anova
Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
Crime ~ 1

Final Model:
Crime ~ Po1 + Ineq + Ed + M + Prob + U2

    Step Df  Deviance Resid. Df Resid. Dev      AIC
1                            46    6880928 561.0235
2  + Po1  1 3251554.7        45    3629373 532.9579
3 + Ineq  1  738010.4        44    2891363 524.2733
4   + Ed  1  586033.7        43    2305329 515.6276
5    + M  1  240654.7        42    2064674 512.4458
6 + Prob  1  257982.3        41    1806692 508.1724
7   + U2  1  193726.5        40    1612966 504.8416

The output shows that with the addition of each variable the AIC get smaller, until the model stops at the addition of U2 (Step 7). Let’s try with a backward elimination approach. This time I reverse the arguments to object and scope so that the model starts with all the variables, and eliminates until with only the intercept model (if needed).

bmod <- stepAIC(mod_all, direction = "backward", 
  scope = formula(mod0), trace=FALSE)
coef(bmod)
(Intercept)           M          Ed         Po1         M.F          U1          U2        Ineq        Prob 
  905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063   158.30781   244.67357   -86.20085 
AIC(bmod)
[1] 639.378
summary(bmod)$r.squared
[1] 0.7885439

Backwards elimination selects 9 variables with an $R^2=$ 0.79. The change in AIC for removing one variable is shown below:

bmod$anova
Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + 
    U2 + Wealth + Ineq + Prob + Time

Final Model:
Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob

      Step Df     Deviance Resid. Df Resid. Dev      AIC
1                                 31    1359793 514.8167
2     - So  1     1.300167        32    1359794 512.8167
3   - Time  1  9665.503811        33    1369459 511.1496
4     - LF  1  9782.997381        34    1379242 509.4842
5     - NW  1 11688.074179        35    1390930 507.8808
6    - Po2  1 16178.099053        36    1407108 506.4243
7    - Pop  1 21603.437077        37    1428712 505.1404
8 - Wealth  1 26302.520114        38    1455014 503.9978

Both forward and backward elimination can be done by selecting the direction = "both" argument.

smod <- stepAIC(mod_all, direction = "both", trace=FALSE)
coef(smod)
(Intercept)           M          Ed         Po1         M.F          U1          U2        Ineq        Prob 
  905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063   158.30781   244.67357   -86.20085 
AIC(smod)
[1] 639.378
summary(smod)$r.squared
[1] 0.7885439

The analysis thus far has been demonstrated using the complete dataset. To avoid fitting the model to random effects in the training dataset, we use a cross-validation approach. I do this using the caret library and its functions. Following the previous analysis, I select the backward elimination method, and choose $k=10$ for the repeated k-fold cross-validation approach.

library(caret)
set.seed(100599)
set_train <- trainControl(method="repeatedcv", number=10, repeats=3)
cvmod <- train(Crime ~ ., data=crime,  scope = formula(mod0),
  method="lmStepAIC", direction="backward", trace=FALSE, trControl=set_train)
coef(cvmod$finalModel)
(Intercept)           M          Ed         Po1         M.F          U1          U2        Ineq        Prob 
  905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063   158.30781   244.67357   -86.20085 
AIC(cvmod$finalModel)
[1] 639.378

Results show the final model for the cross-validation approach includes the same variables as the first backward elimination model. Overall, stepwise regression is good for performing preliminary analyses and getting a better understanding of which variables may or may not be significant in your model. However, stepwise regression can select a final set of variables that fit more to random effects than real effects. This method may not perform as well when testing the final set of variables on other data.

Ridge regression

Another variable selection method is ridge regression. Ridge regression fits all variables in a model using a technique that shrinks the coefficients toward zero. To understand this technique, let’s begin with a review of least squares regression. For least squares, we estimate $\beta_0, \beta_1,\dots,\beta_p$ using values that minimize

$$ \text{RSS} = \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij}\right)^2. $$

In ridge regression, the coefficient estimates $\hat\beta^R$ are the values that minimize

$$ \text{RSS} + \lambda \sum_{j=1}^p \beta^2_j, $$

where $\lambda\geq 0$ is a tuning parameter. The second term, $\sum_{j=1}^p \beta^2_j$ is called a shrinkage penalty, and is small when the $\beta_j$ (for $j=1,\dots,p$) estimates are small. We do not shrink the intercept ($\beta_0$) because it is an estimate of the mean when the predictor values $x_{i1}=,\dots,=x_{\text{ip}} = 0$. When $\lambda=0$, the penalty term has no effect and we have the least squares estimates. However, as $\lambda$ gets bigger, the ridge regression estimates will approach zero. Because a value for $\lambda$ is chosen, the set of estimates, $\hat{\beta}^{\star}_{\lambda}$ are generated for each $\lambda$.

We can use the glmnet library to run ridge regression by selecting $\alpha=0$ and iterating through the $\lambda$ values.

library(glmnet)
y <- as.matrix(dplyr::select(crime, Crime))
x <- as.matrix(dplyr::select(crime, -Crime))
lgrid <- c(seq(0, 1, by=0.1), seq(2, 9, by=1), seq(10, 100, by=10))
rmod <- glmnet(x, y, alpha=1, lambda=lgrid, standardize=FALSE)
rmod

Call:  glmnet(x = x, y = y, alpha = 1, lambda = lgrid, standardize = FALSE) 

   Df  %Dev Lambda
1   1 40.28  100.0
2   1 41.61   90.0
3   1 42.79   80.0
4   3 44.40   70.0
5   4 48.12   60.0
6   5 54.85   50.0
7   5 60.68   40.0
8   6 65.37   30.0
9   8 71.60   20.0
10 10 76.76   10.0
11 10 77.25    9.0
12 11 77.69    8.0
13 11 78.15    7.0
14 11 78.55    6.0
15 11 78.89    5.0
16 11 79.16    4.0
17 12 79.39    3.0
18 13 79.55    2.0
19 15 79.80    1.0
20 15 79.88    0.9
21 15 79.95    0.8
22 15 80.01    0.7
23 15 80.07    0.6
24 15 80.12    0.5
25 15 80.16    0.4
26 15 80.19    0.3
27 15 80.21    0.2
28 15 80.23    0.1
29 15 80.24    0.0

The deviance ratio shows the percentage of deviance explained for each value of $\lambda$. Here, it seems $\lambda=0$ is the best fit, which reduces to the least squares fit. We can also run the ridge model using a cross-validated approach, which I show below:

rmod <- train(Crime~., data = crime, method = 'glmnet',
   tuneGrid = expand.grid(alpha=0, lambda=lgrid), trControl = set_train)
rmod
glmnet 

47 samples
15 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 42, 44, 42, 42, 42, 41, ... 
Resampling results across tuning parameters:

  lambda  RMSE      Rsquared   MAE     
    0.0   248.9878  0.5699106  206.6500
    0.1   248.9878  0.5699106  206.6500
    0.2   248.9878  0.5699106  206.6500
    0.3   248.9878  0.5699106  206.6500
    0.4   248.9878  0.5699106  206.6500
    0.5   248.9878  0.5699106  206.6500
    0.6   248.9878  0.5699106  206.6500
    0.7   248.9878  0.5699106  206.6500
    0.8   248.9878  0.5699106  206.6500
    0.9   248.9878  0.5699106  206.6500
    1.0   248.9878  0.5699106  206.6500
    2.0   248.9878  0.5699106  206.6500
    3.0   248.9878  0.5699106  206.6500
    4.0   248.9878  0.5699106  206.6500
    5.0   248.9878  0.5699106  206.6500
    6.0   248.9878  0.5699106  206.6500
    7.0   248.9878  0.5699106  206.6500
    8.0   248.9878  0.5699106  206.6500
    9.0   248.9878  0.5699106  206.6500
   10.0   248.9878  0.5699106  206.6500
   20.0   248.9878  0.5699106  206.6500
   30.0   248.1680  0.5700836  205.9538
   40.0   247.0281  0.5654477  205.1688
   50.0   246.7117  0.5616184  204.9661
   60.0   246.8697  0.5586736  205.0718
   70.0   247.3154  0.5563842  205.2987
   80.0   247.9284  0.5546016  205.5623
   90.0   248.6588  0.5531579  205.9801
  100.0   249.4650  0.5519836  206.3832

Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 50.

Using this approach, a value $\lambda=50$ is selected. To see how the $\beta$ coefficients shrink toward zero with increasing values of $\lambda$, I plot the results of the final model below. Essentially, when $\lambda$ is large, the $\beta$ coefficients are zero.

plot(rmod$finalModel, xvar="lambda", label=TRUE)

Ridge regression has an advantage over least squares regression in that variance is reduced as the coefficients shrink toward zero, as demonstrated in the above plot. However, this benefit comes at the cost of increased bias. In short, ridge regression tends to underestimate the coefficients of very predictive variables. Ridge regression is recommended in situations where the least squares estimates have high variance.

Lasso regression

Lasso regression is another variable selection approach that minimizes the quantity

$$ \text{RSS} + \lambda \sum_{j=1}^p |\beta_j|. $$

By using a different penalty term (with an $\ell_1$ norm rather than $\ell_2$ norm), some coefficients can be forced to exactly zero when $\lambda$ is sufficiently large. This is different to ridge regression where all $p$ elements are included in the model (unless $\lambda=\infty$). The glmnet function that we used for ridge regression can also be used for Lasso regression by specifying $\alpha = 1$.

lmod <- glmnet(x, y, alpha=1, lambda=lgrid, standardize=FALSE)
lmod

Call:  glmnet(x = x, y = y, alpha = 1, lambda = lgrid, standardize = FALSE) 

   Df  %Dev Lambda
1   1 40.28  100.0
2   1 41.61   90.0
3   1 42.79   80.0
4   3 44.40   70.0
5   4 48.12   60.0
6   5 54.85   50.0
7   5 60.68   40.0
8   6 65.37   30.0
9   8 71.60   20.0
10 10 76.76   10.0
11 10 77.25    9.0
12 11 77.69    8.0
13 11 78.15    7.0
14 11 78.55    6.0
15 11 78.89    5.0
16 11 79.16    4.0
17 12 79.39    3.0
18 13 79.55    2.0
19 15 79.80    1.0
20 15 79.88    0.9
21 15 79.95    0.8
22 15 80.01    0.7
23 15 80.07    0.6
24 15 80.12    0.5
25 15 80.16    0.4
26 15 80.19    0.3
27 15 80.21    0.2
28 15 80.23    0.1
29 15 80.24    0.0

We can now examine the coefficients for each $\lambda$. For example, with $\lambda=10$, we can already see some coefficients are exactly zero.

g <- which(rev(lgrid==10))
lmod$lambda[g]
[1] 10
coef(lmod)[, g]
(Intercept)           M          So          Ed         Po1         Po2          LF         M.F         Pop          NW          U1 
 905.070218   90.059221    0.000000  129.547911  306.323406    0.000000    0.000000   53.324806    0.000000   12.220954  -35.163512 
         U2      Wealth        Ineq        Prob        Time 
  69.807741    4.605232  193.579685  -77.984628    0.000000 

For $\lambda=100$, we have only one non-zero coefficient (Po1).

g <- which(rev(lgrid==100))
lmod$lambda[g]
[1] 100
coef(lmod)[, g]
(Intercept)           M          So          Ed         Po1         Po2          LF         M.F         Pop          NW          U1 
   905.0851      0.0000      0.0000      0.0000    163.6732      0.0000      0.0000      0.0000      0.0000      0.0000      0.0000 
         U2      Wealth        Ineq        Prob        Time 
     0.0000      0.0000      0.0000      0.0000      0.0000 

One disadvantage of Lasso regression is that it arbitrarily eliminates variables that are strongly correlated with each other. For example, if a dataset has two variables that are strongly correlated, Lasso will pick one to have a non-zero coefficient with the other being eliminated from the model.

Elastic net

Elastic net regression is a variation of Lasso regression, where we estimate $\beta_0, \beta_1,\dots,\beta_p$ using values that minimize

$$ \text{RSS} + \lambda \sum_{j=1}^p |\beta_j| + (1-\lambda) \sum_{j=1}^p \beta_i^2. $$

Elastic net combines the variable benefits of Lasso regression with the prediction benefits of ridge regression. Lets implement this method using the glmnet function. We allow the alpha parameter to range between 0 and 1.

emod <- train(Crime~., data = crime, method = 'glmnet',
  tuneGrid = expand.grid(alpha=seq(0, 1, by=0.1), lambda=lgrid), trControl = set_train)
emod
glmnet 

47 samples
15 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 42, 42, 42, 42, 43, 42, ... 
Resampling results across tuning parameters:

  alpha  lambda  RMSE      Rsquared   MAE     
  0.0      0.0   252.0577  0.6338331  209.5087
  0.0      0.1   252.0577  0.6338331  209.5087
  0.0      0.2   252.0577  0.6338331  209.5087
  0.0      0.3   252.0577  0.6338331  209.5087
  0.0      0.4   252.0577  0.6338331  209.5087
  0.0      0.5   252.0577  0.6338331  209.5087
  0.0      0.6   252.0577  0.6338331  209.5087
  0.0      0.7   252.0577  0.6338331  209.5087
  0.0      0.8   252.0577  0.6338331  209.5087
  0.0      0.9   252.0577  0.6338331  209.5087
  0.0      1.0   252.0577  0.6338331  209.5087
  0.0      2.0   252.0577  0.6338331  209.5087
  0.0      3.0   252.0577  0.6338331  209.5087
  0.0      4.0   252.0577  0.6338331  209.5087
  0.0      5.0   252.0577  0.6338331  209.5087
  0.0      6.0   252.0577  0.6338331  209.5087
  0.0      7.0   252.0577  0.6338331  209.5087
  0.0      8.0   252.0577  0.6338331  209.5087
  0.0      9.0   252.0577  0.6338331  209.5087
  0.0     10.0   252.0577  0.6338331  209.5087
  0.0     20.0   252.0577  0.6338331  209.5087
  0.0     30.0   251.8833  0.6359965  209.0304
  0.0     40.0   251.5029  0.6351448  208.5232
  0.0     50.0   251.5239  0.6348445  208.7456
  0.0     60.0   251.7620  0.6347860  208.9675
  0.0     70.0   252.1285  0.6347130  209.2971
  0.0     80.0   252.5891  0.6346010  209.6699
  0.0     90.0   253.1020  0.6344293  210.1630
  0.0    100.0   253.6557  0.6342201  210.6293
  0.1      0.0   264.4560  0.6484636  216.1292
  0.1      0.1   264.4560  0.6484636  216.1292
  0.1      0.2   264.4560  0.6484636  216.1292
  0.1      0.3   264.3354  0.6486511  216.1255
  0.1      0.4   264.0564  0.6487558  216.0538
  0.1      0.5   263.8210  0.6487729  216.0407
  0.1      0.6   263.5958  0.6487715  216.0142
  0.1      0.7   263.3870  0.6487389  216.0163
  0.1      0.8   263.1792  0.6486855  215.9956
  0.1      0.9   262.9730  0.6486171  215.9741
  0.1      1.0   262.7747  0.6485257  215.9520
 [ reached getOption("max.print") -- omitted 279 rows ]

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 7.

The values $\alpha=$ 1 and $\lambda=$ 7 are selected for the final model, which is shown below.

enmod <- glmnet(x, y, nfolds=10, alpha=0.3, lambda=seq(1, 10, by=1), standardize=FALSE)
coef(enmod)[, which(enmod$lambda==7)]
(Intercept)           M          So          Ed         Po1         Po2          LF         M.F         Pop          NW          U1 
  898.70955   102.19631    18.77561   165.81002   249.74394    34.17948     0.00000    59.31391   -15.96940    24.34366   -76.71121 
         U2      Wealth        Ineq        Prob        Time 
  119.28741    58.38457   238.73346   -89.21639     0.00000 
Alain Vandormael
Alain Vandormael
Senior Data Scientist, PhD, MSc