Contents

Part I

Basic Supervised Learning (James Ch4)

  • Logistic Regression, LDA, QDA, Naive Bayes, KNN

  • Tutorial: Comparison of classification methods with “Stock Market”, “Caravan Insurance” data

Resampling (James Ch5)

  • Validation Set, Leave-One-Out, K-fold Cross Validatoin, Bootstrap

  • Tutorial: Validation of the performance of different classification methods and resampling methods with “Auto” data

Part II

Model Selection (James Ch6)

  • Subset selection (Best subset selection, Stepwise selection)

  • Shrinkage Methods (Ridge Regression, Lasso)

  • Dimension Reduction Methods (PCR, PLS)

  • Tutorial: Test of the subset selection methods with validation of feature selection & dimension reduction (“Hitter” data)

Model selection (s27)

In the regression setting, the standard liner model \[Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \epsilon\] is commonly used to describe the relationship between a response \(Y\) and a set of variables \(X_1, X_2,... ,X_p\). We have seen that one typically fits this model using least squares. Why might we want to use another fitting procedure instead of least squares? As we will see, alternative fitting procedures can yield better prediction accuracy and model interpretability.

Prediction Accuracy

  • \(n\gg p\), least square shows low bias and low variance, hence perform well on test set.

  • \(n \approx p\), least square shows high variance, resulting in overfitting.

  • \(p > n\), there is no longer a unique least squares coefficient estimate and the variance is infinite. By constraining or shrinking the estimated coefficients, we can reduce the variance at the cost of a negligible increase in bias.

in \(p>n\) case, there is noleast square solution because \((X^TX)\) is singular, or not invertible, so that we can not compute equation below. \[\hat\beta = (X^TX)^{-1}X^Ty\]

Model Interpretability

  • Not associated, irrelevant variables leads to unnecessary complexity in the resulting model.

  • Performing feature selection or variable selection, we can improve interpretability of model.

  • Some of apporaches such as Lasso, Ridge, Elasticnet can automatically performing this.

Here we will discuss three important classes of methods

Subset selection: Identifying a subset of the best \(p\) predictors.

Shrinkage: The estimated coefficients are shrunken towards zero relative to the least squares estimates. (a.k.a Regularization)

Dimension Reduction: Projecting the \(p\) predictors into a \(M\)-dimensional subspace, where \(M<p\). This is achieved by computing \(M\) different linear combinations, or projections, of the variables.

Best subset selection (s28)

To perform best subset selection, we fit a separate least squares regression for each possible combination of the \(p\) predictors. That is, we fit all \(p\) models that contain exactly one predictor, all \(\begin{pmatrix} p \\ 2 \end{pmatrix} = \frac{p(p-1)}{2}\) models that contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is best. The number of possible model is given as \(2^p\). For example, If we have three (\(3\)) candidate predictors, the there are \(2^3 = 8\) possible regression models we can consider. 3 univariable, 3 bivariable, 1 no variable, 1 multivariable case. If \(p=10\), we have \(2^10 = 1024\) possible models. Therefore, the problem of selecting the best model from among the \(2^p\) possibilities considered by best subset selection is not trivial. This is usually broken up into two stages, as described in algorithm below.

In the above Algorithm, Step 2 identifies the best model (on the training data) for each subset size, in order to reduce the problem from one of \(2^p\) possible models to one of \(p + 1\) possible models including null model (which has 0 variable). Now in order to select a single best model, we must simply choose among these \(p + 1\) options. Since we want to have best model in terms of test error, in Step 3, we use cross-validated prediction error, \(C_p\), BIC, or adjusted \(R^2\) in order to select among \(M_0,M_1,...,M_p\).In case of logistic regression, instead of ordering models by RSS, we use the deviance a measure that plays the role of RSS for a broader class of models. It is defined as \[D(y, \hat \theta) = 2(\log[p(y|\hat\theta_s)] - \log[p(y|\hat\theta_0)])\] where \(\hat\theta_s\) denotes the fitted parameter of saturated model \(\hat\theta_0\) denotes parameter of null model.

since the computational complexity of best subset selection grows as \(2^p\), it becomes computationally infeasible for values of \(p\) greater than around 40, even with extremely fast modern computers. There are computational shortcuts-so called branch-and-bound techniques-for eliminating some choices, but these have their limitations as \(p\) gets large.

library(ISLR)
# credit dataset
library(leaps)
credit <- Credit[,-1]; dim(credit)
## [1] 400  11
regfit.credit <- regsubsets(Balance ~ ., data = credit, nvmax = dim(credit)[2]-1)
credit_summary <- summary(regfit.credit)

par(mfrow=c(1,2))
plot(credit_summary$rss, type = 'l', col = "red", ylim = c(0, 8e+07),
     xlab = "Number of Predictors", ylab = "Residual Sum of Squares")
points(credit_summary$rss, col = "red", pch = 16)
plot(credit_summary$rsq, type = 'l', col = "red", ylim = c(0,1), 
     xlab = "Number of Predictors", ylab = "R^2")
points(credit_summary$rsq, col = "red", pch = 16)

par(mfrow=c(1,1))

# inspect data
library(ISLR)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
dim(Hitters)
## [1] 322  20
glimpse(Hitters)
## Observations: 322
## Variables: 20
## $ AtBat     <int> 293, 315, 479, 496, 321, 594, 185, 298, 323, 401, 57...
## $ Hits      <int> 66, 81, 130, 141, 87, 169, 37, 73, 81, 92, 159, 53, ...
## $ HmRun     <int> 1, 7, 18, 20, 10, 4, 1, 0, 6, 17, 21, 4, 13, 0, 7, 3...
## $ Runs      <int> 30, 24, 66, 65, 39, 74, 23, 24, 26, 49, 107, 31, 48,...
## $ RBI       <int> 29, 38, 72, 78, 42, 51, 8, 24, 32, 66, 75, 26, 61, 1...
## $ Walks     <int> 14, 39, 76, 37, 30, 35, 21, 7, 8, 65, 59, 27, 47, 22...
## $ Years     <int> 1, 14, 3, 11, 2, 11, 2, 3, 2, 13, 10, 9, 4, 6, 13, 3...
## $ CAtBat    <int> 293, 3449, 1624, 5628, 396, 4408, 214, 509, 341, 520...
## $ CHits     <int> 66, 835, 457, 1575, 101, 1133, 42, 108, 86, 1332, 13...
## $ CHmRun    <int> 1, 69, 63, 225, 12, 19, 1, 0, 6, 253, 90, 15, 41, 4,...
## $ CRuns     <int> 30, 321, 224, 828, 48, 501, 30, 41, 32, 784, 702, 19...
## $ CRBI      <int> 29, 414, 266, 838, 46, 336, 9, 37, 34, 890, 504, 186...
## $ CWalks    <int> 14, 375, 263, 354, 33, 194, 24, 12, 8, 866, 488, 161...
## $ League    <fct> A, N, A, N, N, A, N, A, N, A, A, N, N, A, N, A, N, A...
## $ Division  <fct> E, W, W, E, E, W, E, W, W, E, E, W, E, E, E, W, W, W...
## $ PutOuts   <int> 446, 632, 880, 200, 805, 282, 76, 121, 143, 0, 238, ...
## $ Assists   <int> 33, 43, 82, 11, 40, 421, 127, 283, 290, 0, 445, 45, ...
## $ Errors    <int> 20, 10, 14, 3, 4, 25, 7, 9, 19, 0, 22, 11, 7, 6, 8, ...
## $ Salary    <dbl> NA, 475.000, 480.000, 500.000, 91.500, 750.000, 70.0...
## $ NewLeague <fct> A, N, A, N, N, A, A, A, N, A, A, N, N, A, N, A, N, A...
sum(is.na(Hitters))
## [1] 59
hitters <- na.omit(Hitters)
sum(is.na(hitters))
## [1] 0
dim(hitters)
## [1] 263  20
# subset selection
regfit.full <- regsubsets(Salary ~ ., data = hitters, nvmax = 19)
regsummary <- summary(regfit.full)
regsummary$rsq # R^2
##  [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
##  [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
# plotting results
par(mfrow=c(2,2))
plot(regsummary$rss, xlab = "Number of Variables", ylab = "RSS", type = 'l')
plot(regsummary$adjr2, xlab = "Number of Variables", ylab = "Adjusted Rsq", type = 'l')
points(which.max(regsummary$adjr2), regsummary$adjr2[which.max(regsummary$adjr2)], 
      col = "red", cex = 2, pch = 20)
plot(regsummary$cp, xlab = "Number of Variables", ylab = "Cp", type = 'l')
points.default(which.min(regsummary$cp), regsummary$cp[which.min(regsummary$cp)], 
      col = "red", cex = 2, pch = 20)
plot(regsummary$bic, xlab = "Number of Variables", ylab = "BIC", type = 'l')
points(which.min(regsummary$bic), regsummary$bic[which.min(regsummary$bic)], 
      col = "red", cex = 2, pch = 20)

par(mfrow=c(1,1))

#plot rss with ggvis package
library(ggvis)
rsq <- as.data.frame(regsummary$rsq)
names(rsq) <- "R2"
rsq %>%
        ggvis(x=~ c(1:nrow(rsq)), y= ~ R2 ) %>%
        layer_points(fill = ~ R2 ) %>%
        add_axis("y", title = "R2") %>%
        add_axis("x", title = "Number of variables")
# built-in plot method
plot(regfit.full, scale = "r2", col = "red")

plot(regfit.full, scale = "adjr2", col = "red")

plot(regfit.full, scale = "Cp", col = "red")

plot(regfit.full, scale = "bic", col = "red")

# best model for each measure with BIC measure
coef(regfit.full, which.min(regsummary$bic))
##  (Intercept)        AtBat         Hits        Walks         CRBI 
##   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 
##    DivisionW      PutOuts 
## -122.9515338    0.2643076

Stepwise selection (s29)

For computational reasons, best subset selection cannot be applied with very large \(p\). Best subset selection may also suffer from statistical problems when \(p\) is large because an enormous search space can lead to overfitting and high variance of the coefficient estimates. For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model. More formally, the forward stepwise selection procedure is given in the below Algorithm 2

 

Forward Stepwise Selection

Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.

Unlike best subset selection, which involved fitting \(2^p\) models, forward stepwise selection involves fitting one null model, along with \(p - k\) models in the \(k_{th}\) iteration, for \(k = 0, ...,p - 1\). This amounts to a total of \(1 + \sum_{k=0}^{p - 1}(p - k) = 1+\frac{p(p+1)}{2}\) models. This is a substantial difference: when \(p = 20\), best subset selection requires fitting \(2^{20} = 1,048,576\) models, whereas forward stepwise selection requires fitting only \(1+\frac{20(21)}{2}=211\) models. However, forward stepwise is not guaranteed to find the best possible model out of all \(2^p\) models containing subsets of the \(p\) predictors because if the best model does not contain single best predictor in their final model.

Forward stepwise selection can be applied even in the high-dimensional setting where \(n<p\); however, in this case, it is possible to construct submodels \(M_0, ... ,M_{n-1}\) only, since each submodel is fit using least squares, which will not yield a unique solution if \(p \geq n\).

 

Backward Stepwise Selection

Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all \(p\) predictors, and then iteratively removes the least useful predictor, one-at-a-time.

Like forward stepwise selection, the backward selection approach searches through only \(1+\frac{p(p+1)}{2}\) models, and so can be applied in settings where \(p\) is too large to apply best subset selection. Also like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the \(p\) predictors. Backward selection requires that the number of samples \(n\) is larger than the number of variables \(p\) (so that the full model can be fit). In contrast, forward stepwise can be used even when \(n<p\), and so it is the only viable subset method when \(p\) is very large.

 

Hybrid Approaches

The best subset, forward stepwise, and backward stepwise selection approaches generally give similar but not identical models. As another alternative, hybrid versions of forward and backward stepwise selection are available, in which variables are added to the model sequentially, in analogy to forward selection. However, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit. Such an approach attempts to more closely mimic best subset selection while retaining the computational advantages of forward and backward stepwise selection.

credit <- Credit[,-1]; dim(credit)
## [1] 400  11
library(leaps)

# best subset
regfit.best <- regsubsets(Balance ~ ., data = credit, 
                          nvmax = dim(credit)[2]-1, method = "exhaustive") # default, best
res.best <- summary(regfit.best)

# forward
regfit.fwd <- regsubsets(Balance~.,data=credit, 
                         nvmax=dim(credit)[2]-1, method="forward")
res.fwd <- summary(regfit.fwd)

# backward
regfit.bwd <- regsubsets(Balance~.,data=credit, 
                         nvmax=dim(credit)[2]-1, method="backward")
res.bwd <- summary(regfit.bwd)

# hybrid
regfit.hbd <- regsubsets(Balance~.,data=credit, 
                         nvmax=dim(credit)[2]-1, method="seqrep")
res.hbd <- summary(regfit.hbd)

## Which variables are selected by each method?
# best subset selection
subset_list <- c()
for (i in 1:(dim(credit)[2]-1)) {
  subset_list[i] <- list(colnames(res.best$which[,res.best$which[i,]]))
} 
message("Best Subset Selection")
## Best Subset Selection
names(subset_list) <- paste(1:10, "Variables"); subset_list
## $`1 Variables`
## [1] "(Intercept)" "Rating"     
## 
## $`2 Variables`
## [1] "(Intercept)" "Income"      "Rating"     
## 
## $`3 Variables`
## [1] "(Intercept)" "Income"      "Rating"      "StudentYes" 
## 
## $`4 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Cards"       "StudentYes" 
## 
## $`5 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Rating"      "Cards"      
## [6] "StudentYes" 
## 
## $`6 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Rating"      "Cards"      
## [6] "Age"         "StudentYes" 
## 
## $`7 Variables`
## [1] "(Intercept)"  "Income"       "Limit"        "Rating"      
## [5] "Cards"        "Age"          "GenderFemale" "StudentYes"  
## 
## $`8 Variables`
## [1] "(Intercept)"    "Income"         "Limit"          "Rating"        
## [5] "Cards"          "Age"            "GenderFemale"   "StudentYes"    
## [9] "EthnicityAsian"
## 
## $`9 Variables`
##  [1] "(Intercept)"    "Income"         "Limit"          "Rating"        
##  [5] "Cards"          "Age"            "GenderFemale"   "StudentYes"    
##  [9] "MarriedYes"     "EthnicityAsian"
## 
## $`10 Variables`
##  [1] "(Intercept)"        "Income"             "Limit"             
##  [4] "Rating"             "Cards"              "Age"               
##  [7] "GenderFemale"       "StudentYes"         "MarriedYes"        
## [10] "EthnicityAsian"     "EthnicityCaucasian"
# foward selection
fwd_list <- c()
for (i in 1:(dim(credit)[2]-1)) {
  fwd_list[i] <- list(colnames(res.fwd$which[,res.fwd$which[i,]]))
} 
message("Forward Selection")
## Forward Selection
names(fwd_list) <- paste(1:10, "Variables"); fwd_list
## $`1 Variables`
## [1] "(Intercept)" "Rating"     
## 
## $`2 Variables`
## [1] "(Intercept)" "Income"      "Rating"     
## 
## $`3 Variables`
## [1] "(Intercept)" "Income"      "Rating"      "StudentYes" 
## 
## $`4 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Rating"      "StudentYes" 
## 
## $`5 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Rating"      "Cards"      
## [6] "StudentYes" 
## 
## $`6 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Rating"      "Cards"      
## [6] "Age"         "StudentYes" 
## 
## $`7 Variables`
## [1] "(Intercept)"  "Income"       "Limit"        "Rating"      
## [5] "Cards"        "Age"          "GenderFemale" "StudentYes"  
## 
## $`8 Variables`
## [1] "(Intercept)"    "Income"         "Limit"          "Rating"        
## [5] "Cards"          "Age"            "GenderFemale"   "StudentYes"    
## [9] "EthnicityAsian"
## 
## $`9 Variables`
##  [1] "(Intercept)"    "Income"         "Limit"          "Rating"        
##  [5] "Cards"          "Age"            "GenderFemale"   "StudentYes"    
##  [9] "MarriedYes"     "EthnicityAsian"
## 
## $`10 Variables`
##  [1] "(Intercept)"        "Income"             "Limit"             
##  [4] "Rating"             "Cards"              "Age"               
##  [7] "GenderFemale"       "StudentYes"         "MarriedYes"        
## [10] "EthnicityAsian"     "EthnicityCaucasian"
# backward selection
bwd_list <- c()
for (i in 1:(dim(credit)[2]-1)) {
  bwd_list[i] <- list(colnames(res.bwd$which[,res.bwd$which[i,]]))
} 
message("Backward Selection")
## Backward Selection
names(bwd_list) <- paste(1:10, "Variables"); bwd_list
## $`1 Variables`
## [1] "(Intercept)" "Limit"      
## 
## $`2 Variables`
## [1] "(Intercept)" "Income"      "Limit"      
## 
## $`3 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "StudentYes" 
## 
## $`4 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Cards"       "StudentYes" 
## 
## $`5 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Rating"      "Cards"      
## [6] "StudentYes" 
## 
## $`6 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Rating"      "Cards"      
## [6] "Age"         "StudentYes" 
## 
## $`7 Variables`
## [1] "(Intercept)"  "Income"       "Limit"        "Rating"      
## [5] "Cards"        "Age"          "GenderFemale" "StudentYes"  
## 
## $`8 Variables`
## [1] "(Intercept)"    "Income"         "Limit"          "Rating"        
## [5] "Cards"          "Age"            "GenderFemale"   "StudentYes"    
## [9] "EthnicityAsian"
## 
## $`9 Variables`
##  [1] "(Intercept)"    "Income"         "Limit"          "Rating"        
##  [5] "Cards"          "Age"            "GenderFemale"   "StudentYes"    
##  [9] "MarriedYes"     "EthnicityAsian"
## 
## $`10 Variables`
##  [1] "(Intercept)"        "Income"             "Limit"             
##  [4] "Rating"             "Cards"              "Age"               
##  [7] "GenderFemale"       "StudentYes"         "MarriedYes"        
## [10] "EthnicityAsian"     "EthnicityCaucasian"
# Hybrid selection
hbd_list <- c()
for (i in 1:(dim(credit)[2]-1)) {
  hbd_list[i] <- list(colnames(res.hbd$which[,res.hbd$which[i,]]))
} 
message("Hybrid Approach")
## Hybrid Approach
names(hbd_list) <- paste(1:10, "Variables"); hbd_list
## $`1 Variables`
## [1] "(Intercept)" "Rating"     
## 
## $`2 Variables`
## [1] "(Intercept)" "Income"      "Limit"      
## 
## $`3 Variables`
## [1] "(Intercept)" "Income"      "Rating"      "StudentYes" 
## 
## $`4 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Cards"       "StudentYes" 
## 
## $`5 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Rating"      "Cards"      
## [6] "Age"        
## 
## $`6 Variables`
## [1] "(Intercept)" "Income"      "Limit"       "Rating"      "Cards"      
## [6] "Age"         "Education"  
## 
## $`7 Variables`
## [1] "(Intercept)"  "Income"       "Limit"        "Rating"      
## [5] "Cards"        "Age"          "GenderFemale" "StudentYes"  
## 
## $`8 Variables`
## [1] "(Intercept)"    "Income"         "Limit"          "Rating"        
## [5] "Cards"          "Age"            "GenderFemale"   "StudentYes"    
## [9] "EthnicityAsian"
## 
## $`9 Variables`
##  [1] "(Intercept)"    "Income"         "Limit"          "Rating"        
##  [5] "Cards"          "Age"            "GenderFemale"   "StudentYes"    
##  [9] "MarriedYes"     "EthnicityAsian"
## 
## $`10 Variables`
##  [1] "(Intercept)"        "Income"             "Limit"             
##  [4] "Rating"             "Cards"              "Age"               
##  [7] "GenderFemale"       "StudentYes"         "MarriedYes"        
## [10] "EthnicityAsian"     "EthnicityCaucasian"

Choosing the optimal model - Adjustment for the estimation of test error (s30)

Now we want to determine which of these models is best. As we discussed in previous section, the model containing all of the predictors will always have the smallest RSS and the largest \(R^2\), since these quantities are related to the training error. Instead, we wish to choose a model with a low test error. There are two common approaches:

  1. We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.

  2. We can directly estimate the test error, using either a validation set approach or a cross-validation approach

a number of techniques for adjusting the training error for the model size are available. These approaches can be used to select among a set of models with different numbers of variables. We now consider four such approaches: \(C_p\), Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted \(R^2\).

 

Mallow’s \(C_p\)

Mallow’s \(C_p\) is defined as \[C_p = \frac{1}{n} (RSS + 2d\hat\sigma^2) = MSE_{train} + \frac{2d}{n}\hat\sigma^2\] where \(\hat\sigma^2\) is an estimate of the variance of the error \(\epsilon\) associated with each response measurement in the regression model. Typically \(\hat\sigma^2\) is estimated using the full model containing all predictors. Essentially, the \(C_p\) statistic adds a penalty of \(2d\hat\sigma^2\) to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error. If \(\hat\sigma^2\) is an unbiased estimate of \(\sigma^2\) then \(C_p\) is an unbiased estimate of test MSE.We choose the model with the lowest \(C_p\) value. Alternatively, \(C_p\) is defined as \[C_p' = \frac{RSS}{\hat\sigma^2}+2d-n = \frac{n}{\hat\sigma^2}\bigg[MSE_{train} + \bigg(\frac{2d}{n} -1\bigg)\hat\sigma^2\bigg] \] This is equivalent to the definition given above in the sense that \(C_p = \frac{1}{n}\hat\sigma^2(C_p'+n) = \hat\sigma^2 + C_p'\frac{\hat\sigma^2}{n}\), and so the model with smallest \(C_p\) also has smallest \(C_p'\).

 

Akaike Information Criterion (AIC)

The AIC criterion is defined for a large class of models fit by maximum likelihood. In the case of the linear model with Gaussian errors, maximum likelihood and least squares are the same thing. In this case AIC is given by \[\begin{aligned}AIC &= \frac{1}{n\hat\sigma^2}(RSS + 2d\hat\sigma^2) + Const. \\ &= \frac{MSE_{train}}{\hat\sigma^2} + \frac{2d}{n} + Const. \\ &= C_p\hat\sigma^2 + Const.\end{aligned}\] where, for simplicity, we have omitted an additive constant. Hence for squares models, \(C_p\) and AIC are proportional to each other.

 

Baysian Information Criterion (BIC)

BIC is derived from a Bayesian point of view, but ends up looking similar to \(C_p\) (and AIC) as well. For the least squares model with d predictors, the BIC is, up to irrelevant constants, given by \[\begin{aligned} BIC &= \frac{1}{n\hat\sigma^2}(RSS+log(n)d\hat\sigma^2) \\ &= \frac{1}{\hat\sigma^2}\bigg[MSE_{train}+\frac{log(n)d}{n}\hat\sigma^2\bigg] \end{aligned}\] like \(C_p\), the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value. Since \(\log n > 2\) for any \(n >7\), the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than \(C_p\). (Here \(\log\) means \(\log_e\) or \(\ln\))

 

Adjsted \(R^2\)

The adjusted \(R^2\) statistic is another popular approach for selecting among a set of models that contain different numbers of variables. Recall that usual \(R^2\) is defined as \(1-\frac{RSS}{TSS}\), where \(TSS = \sum_{i=1}^n(y_i - \bar y)^2\) is the total sum of squares for the response. Since RSS always decreases as more variables are added to the model, the \(R^2\) always increases as more variables are added. For a least squares model with \(d\) variables, the adjusted \(R^2\) statistic is calculated as \[\text{Adjusted} \ \ R^2 = 1-\frac{RSS/(n-d-1)}{TSS/(n-1)}\] Unlike \(C_p\), AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted \(R^2\) indicates a model with a small test error. Maximizing the adjusted \(R^2\) is equivalent to minimizing \(\frac{RSS}{n-d-1}\). While RSS always decreases as the number of variables in the model increases, \(\frac{RSS}{n-d-1}\) may increase or decrease, due to the presence of \(d\) in the denominator. The intuition behind the adjusted \(R^2\) is that once all of the correct variables have been included in the model, adding additional noise variables will lead to only a very small decrease in RSS. Since adding noise variables leads to an increase in \(d\), such variables will lead to an increase in \(\frac{RSS}{n-d-1}\), and consequently a decrease in the adjusted \(R^2\). Therefore, in theory, the model with the largest adjusted \(R^2\) will have only correct variables and no noise variables. Unlike the \(R^2\) statistic, the adjusted \(R^2\) statistic pays a price for the inclusion of unnecessary variables in the model.

# define predict function for regsubsets object
predict.regsubsets <- function(object, newdata, id, ...){
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}

# best subset
regfit.best <- regsubsets(Balance ~ ., data = credit, 
                          nvmax = (dim(credit)[2]-1)) # default, best
res.best <- summary(regfit.best)

# Cp
par(mfrow = c(1,3))
plot(res.best$cp, xlab = "Number of Predictors", 
     ylab = "C_p", col = "hot pink", type = 'l')
points(res.best$cp, col = "blue", pch = 16)
points(which.min(res.best$cp), min(res.best$cp), pch = 4, col = "blue", cex = 3)

# BIC
# BIC of null model 
null.bic <- BIC(glm(Balance ~ 1, data = credit))
# True BIC of each model 
bic.actual <- null.bic + res.best$bic

plot(bic.actual, xlab = "Number of Predictors", 
     ylab = "BIC", col = "hot pink", type = 'l')
points(bic.actual, col = "blue", pch = 16)
points(which.min(bic.actual), min(bic.actual), pch = 4, col = "blue", cex = 3)

# adjust R^2
plot(res.best$adjr2, xlab = "Number of Predictors", 
     ylab = "Adjusted R^2", col = "hot pink", type = 'l')
points(res.best$adjr2, col = "blue", pch = 16)
points(which.max(res.best$adjr2), max(res.best$adjr2), pch = 4, col = "blue", cex = 3)

# final models
coef(regfit.best, which.min(res.hbd$cp))
##  (Intercept)       Income        Limit       Rating        Cards 
## -488.6158695   -7.8036338    0.1936237    1.0940490   18.1091708 
##          Age GenderFemale   StudentYes 
##   -0.6206538  -10.4531521  426.5812620
coef(regfit.best, which.min(res.hbd$bic))
##  (Intercept)       Income        Limit        Cards   StudentYes 
## -499.7272117   -7.8392288    0.2666445   23.1753794  429.6064203
coef(regfit.best, which.min(res.hbd$adjr2))
## (Intercept)      Rating 
##  -390.84634     2.56624
# Manual calculation of each criteria
# Perform manual calculation & compare results

# validation set approach
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(credit), 0.75*nrow(credit))
test <- (!train)

regfit.best <- regsubsets(Balance ~ ., data = credit[train,], nvmax = dim(credit)[2]-1)
test.mat <- model.matrix(Balance ~ ., data = credit[test,])
val.errors <- rep(NA, dim(credit)[2]-1)

for (i in 1:(dim(credit)[2]-1)){
  coefi <- coef(regfit.best, id = i)
  pred <- test.mat[, names(coefi)] %*% coefi
  val.errors[i] <- mean((credit$Balance[test]-pred)^2)
}
val.errors
##  [1] 51754.40 24326.78 12905.05 11616.12 11582.29 12176.73 12041.24
##  [8] 12024.54 11980.66 11930.86
# Cross-validation
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(credit), replace = TRUE)
cv.errors <- matrix(NA, k, (dim(credit)[2]-1), dimnames = list(NULL, paste(1:(dim(credit)[2]-1))))

for (j in 1:k){ # for each jth fold,
  best.fit <- regsubsets(Balance ~ ., data = credit[folds !=j,], 
                         nvmax = (dim(credit)[2]-1)) # training without jth fold
  for (i in 1:(dim(credit)[2]-1)) { # for each model 
    pred = stats::predict(best.fit, credit[folds == j,], id = i) # test with jth fold
    cv.errors[j, i] <- mean((credit$Balance[folds==j]-pred)^2) # test MSE
  }
}

# find minimum cross-validation error 
mean.cv.errors <- apply(cv.errors, 2, mean) # column mean
mean.cv.errors; colMeans(cv.errors) # identical
##        1        2        3        4        5        6        7        8 
## 54501.60 27339.31 11389.85 10208.96 10314.11 10077.26 10246.64 10363.61 
##        9       10 
## 10391.83 10333.97
##        1        2        3        4        5        6        7        8 
## 54501.60 27339.31 11389.85 10208.96 10314.11 10077.26 10246.64 10363.61 
##        9       10 
## 10391.83 10333.97
### Plotting all errors
plot(sqrt(bic.actual), type = 'l', col = "hot pink", 
     xlab = "Number of Predictors", ylab = "Square Root of BIC")
points(sqrt(bic.actual), pch = 16, col = "blue")
points(which.min(bic.actual), min(sqrt(bic.actual)), pch = 4, col = "blue", cex = 3)
# Validation set
plot(val.errors, type = 'l', col = "hot pink", 
     xlab = "Number of Predictors", ylab = "Validation Set Error")
points(val.errors, pch = 16, col = "blue")
points(which.min(val.errors), min(val.errors), pch = 4, col = "blue", cex = 3)

# Cross-validation error
plot(mean.cv.errors, type = 'l', col = "hot pink", 
     xlab = "Number of Predictors", ylab = "Cross-Validation Error")
points(mean.cv.errors, pch = 16, col = "blue")
points(which.min(mean.cv.errors), min(mean.cv.errors), pch = 4, col = "blue", cex = 3)

par(mfrow = c(1,1))

 

Validation and Cross-validation

As we could see in the previous section, validation and cross-validation approach can be used for model selection. In here, we must use only the training observations to perform all aspects of model-fitting-including variable selection. Therefore, the determination of which model of a given size is best must be made using only the training observations. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.

## cleaning data
hitters <- na.omit(Hitters)

## validation set approach
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(hitters), rep = TRUE)
test <- (!train)

library(leaps)
regfit.best <- regsubsets(Salary ~ ., data = hitters[train,], nvmax = 19)
test.mat <- model.matrix(Salary ~ ., data = hitters[test,])
val.errors <- rep(NA, 19)
for (i in 1:19){
  coefi <- coef(regfit.best, id = i)
  pred <- test.mat[, names(coefi)] %*% coefi
  val.errors[i] <- mean((hitters$Salary[test]-pred)^2)
}

val.errors
##  [1] 220968.0 169157.1 178518.2 163426.1 168418.1 171270.6 162377.1
##  [8] 157909.3 154055.7 148162.1 151156.4 151742.5 152214.5 157358.7
## [15] 158541.4 158743.3 159972.7 159859.8 160105.6
which.min(val.errors)
## [1] 10
coef(regfit.best, which.min(val.errors))
## (Intercept)       AtBat        Hits       Walks      CAtBat       CHits 
## -80.2751499  -1.4683816   7.1625314   3.6430345  -0.1855698   1.1053238 
##      CHmRun      CWalks     LeagueN   DivisionW     PutOuts 
##   1.3844863  -0.7483170  84.5576103 -53.0289658   0.2381662
# full model regsubset
reg.fit.best <- regsubsets(Salary ~ ., data = hitters, nvmax = 19)
coef(regfit.best, 10)
## (Intercept)       AtBat        Hits       Walks      CAtBat       CHits 
## -80.2751499  -1.4683816   7.1625314   3.6430345  -0.1855698   1.1053238 
##      CHmRun      CWalks     LeagueN   DivisionW     PutOuts 
##   1.3844863  -0.7483170  84.5576103 -53.0289658   0.2381662
## Cross-Validation
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(hitters), replace = TRUE)
cv.errors = matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))

for (j in 1:k){ # for each jth fold,
  best.fit <- regsubsets(Salary ~ ., data = hitters[folds !=j,], 
                         nvmax = 19) # training without jth fold
  for (i in 1:19) { # for each model 
    pred = predict(best.fit, hitters[folds == j,], id = i) # test with jth fold
    cv.errors[j, i] <- mean((hitters$Salary[folds==j]-pred)^2) # test MSE
  }
}

# find minimum cross-validation error 
mean.cv.errors <- apply(cv.errors, 2, mean) # column mean
mean.cv.errors; colMeans(cv.errors) # identical
##        1        2        3        4        5        6        7        8 
## 160093.5 140196.8 153117.0 151159.3 146841.3 138302.6 144346.2 130207.7 
##        9       10       11       12       13       14       15       16 
## 129459.6 125334.7 125153.8 128273.5 133461.0 133974.6 131825.7 131882.8 
##       17       18       19 
## 132750.9 133096.2 132804.7
##        1        2        3        4        5        6        7        8 
## 160093.5 140196.8 153117.0 151159.3 146841.3 138302.6 144346.2 130207.7 
##        9       10       11       12       13       14       15       16 
## 129459.6 125334.7 125153.8 128273.5 133461.0 133974.6 131825.7 131882.8 
##       17       18       19 
## 132750.9 133096.2 132804.7
plot(mean.cv.errors, type = 'b', ylab = "Mean Cross-Validation Errors", 
     xlab = "Number of Predictors", pch=16)

# full model & get the best coefficients for our final model
reg.best <- regsubsets(Salary ~ ., data = hitters, nvmax = 19)
coef(reg.best, which.min(mean.cv.errors))
##  (Intercept)        AtBat         Hits        Walks       CAtBat 
##  135.7512195   -2.1277482    6.9236994    5.6202755   -0.1389914 
##        CRuns         CRBI       CWalks      LeagueN    DivisionW 
##    1.4553310    0.7852528   -0.8228559   43.1116152 -111.1460252 
##      PutOuts      Assists 
##    0.2894087    0.2688277

Shrinkage Methods - Ridge Regression (s31)

As an alternative of least square method, we can fit a model containing all \(p\) predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.

Recall that the least squares fitting procedure estimates \(\beta_0, \beta_1, ..., \beta_p\) using the values that minimize

\[RSS = \sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p\beta_j x_{ij}\Bigg)^2\] Ridge regression is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity. In particular, the ridge regression coefficient estimates \(\hat\beta^R\) are the values that minimize \[\sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p\beta_j x_{ij}\Bigg)^2+\lambda\sum_{j=1}^p\beta_j^2 = RSS + \lambda\sum_{j=1}^p\beta_j^2\] where \(\lambda \geq 0\) is a tuning parameter, to be determined separately. As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the second term, \(\lambda\sum_{j=1}^p\beta_j^2\) called a shrinkage penalty is small when \(\beta_0, \beta_1, ..., \beta_p\) are close to zero, and so it has the effect of shrinking the estimates of \(\beta_j\) towards zero. The tuning parameter \(\lambda\) serves to control the relative impact of these two terms on the regression coefficient estimates. When \(\lambda=0\), the penalty term has no effect, and ridge regression will produce the least squares estimates. However, as \(\lambda \rightarrow \infty\), the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero. Unlike least squares, which generates only one set of coefficient estimates, ridge regression will produce a different set of coefficient estimates, \(\hat\beta_{\lambda}^R\), for each value of \(\lambda\). Selecting a good value for \(\lambda\) is critical. Note that in the above equation, the shrinkage penalty is applied to \(\beta_1, ..., \beta_p\), but not to the intercept \(\beta_0\). We want to shrink the estimated association of each variable with the response. If we assume that the variables-that is, the columns of the data matrix X-have been centered to have mean zero before ridge regression is performed, then the estimated intercept will take the form \(\hat\beta_0=\bar y=\sum_{i=1}^n \frac{y_i}{n}\).

library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:ggvis':
## 
##     band
## Loading required package: foreach
## Loaded glmnet 2.0-16
# 0-1 normalization
normalize <- function(x)
{
    return((x- min(x)) /(max(x)-min(x)))
}

# Prep data
credit <- Credit[,-1] 
x <- model.matrix(Balance ~ ., credit)[,-1]
y <- credit$Balance

# scaling of data
# x[,1:6] <- apply(x[,1:6],2, scale) - different normalization scheme for dummy variable
x <- scale(x)

# fit model
grid <- 10^seq(10, -2, length = 100)
fit <-  glmnet(x, y, alpha = 0, lambda = grid, standardize = TRUE) # ridge regression
fit.lsq <- lm(y ~ x)

# calculating L2 norm
fit$lambda # display all lambda values
##   [1] 1.000000e+10 7.564633e+09 5.722368e+09 4.328761e+09 3.274549e+09
##   [6] 2.477076e+09 1.873817e+09 1.417474e+09 1.072267e+09 8.111308e+08
##  [11] 6.135907e+08 4.641589e+08 3.511192e+08 2.656088e+08 2.009233e+08
##  [16] 1.519911e+08 1.149757e+08 8.697490e+07 6.579332e+07 4.977024e+07
##  [21] 3.764936e+07 2.848036e+07 2.154435e+07 1.629751e+07 1.232847e+07
##  [26] 9.326033e+06 7.054802e+06 5.336699e+06 4.037017e+06 3.053856e+06
##  [31] 2.310130e+06 1.747528e+06 1.321941e+06 1.000000e+06 7.564633e+05
##  [36] 5.722368e+05 4.328761e+05 3.274549e+05 2.477076e+05 1.873817e+05
##  [41] 1.417474e+05 1.072267e+05 8.111308e+04 6.135907e+04 4.641589e+04
##  [46] 3.511192e+04 2.656088e+04 2.009233e+04 1.519911e+04 1.149757e+04
##  [51] 8.697490e+03 6.579332e+03 4.977024e+03 3.764936e+03 2.848036e+03
##  [56] 2.154435e+03 1.629751e+03 1.232847e+03 9.326033e+02 7.054802e+02
##  [61] 5.336699e+02 4.037017e+02 3.053856e+02 2.310130e+02 1.747528e+02
##  [66] 1.321941e+02 1.000000e+02 7.564633e+01 5.722368e+01 4.328761e+01
##  [71] 3.274549e+01 2.477076e+01 1.873817e+01 1.417474e+01 1.072267e+01
##  [76] 8.111308e+00 6.135907e+00 4.641589e+00 3.511192e+00 2.656088e+00
##  [81] 2.009233e+00 1.519911e+00 1.149757e+00 8.697490e-01 6.579332e-01
##  [86] 4.977024e-01 3.764936e-01 2.848036e-01 2.154435e-01 1.629751e-01
##  [91] 1.232847e-01 9.326033e-02 7.054802e-02 5.336699e-02 4.037017e-02
##  [96] 3.053856e-02 2.310130e-02 1.747528e-02 1.321941e-02 1.000000e-02
coef(fit)[1:5, 1:5] # display some of coefficients
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                       s0           s1           s2           s3
## (Intercept) 5.200150e+02 5.200150e+02 5.200150e+02 5.200150e+02
## Income      9.788429e-06 1.293973e-05 1.710555e-05 2.261253e-05
## Limit       1.819162e-05 2.404825e-05 3.179037e-05 4.202500e-05
## Rating      1.823232e-05 2.410205e-05 3.186149e-05 4.211902e-05
## Cards       1.825213e-06 2.412824e-06 3.189611e-06 4.216478e-06
##                       s4
## (Intercept) 5.200150e+02
## Income      2.989244e-05
## Limit       5.555457e-05
## Rating      5.567886e-05
## Cards       5.573936e-06
fit$l2 <- sqrt(colSums(coef(fit)[-1,]^2)) # L2 Norms
# Normalized L2 Norms (w.r.t L2 Norm of least square)
fit$norml2 <- fit$l2/sqrt(sum(coef(fit.lsq)[-1]^2))


# plot results
par(mfrow = c(1,2))
library(RColorBrewer)
palette <- brewer.pal(11, "Spectral")

# lambda vs coefficient
plot(fit, xvar = "lambda", col = palette, xlim = c(-5,6),
     ylab = "Standardized Coefficients")
legend(x=-5, y=-50, fit$beta@Dimnames[[1]], 
       col = palette, lty = rep(1,11), cex = 0.35)

# L2 Norm vs coefficients
plot(fit$norml2, matrix(fit$beta@x, nrow=11)[1,], type = 'l', col = palette[1], xlim = c(0, 1),
     ylim = c(-300, 450), xlab = "Normalized L2Norm", ylab = "Standardized Coefficients")
for (i in 2:11) lines(fit$norml2, matrix(fit$beta@x, nrow=11)[i,], type = 'l', col = palette[[i]])
legend(x=0, y=-50, fit$beta@Dimnames[[1]], 
       col = palette, lty = rep(1,11), cex = 0.35)

par(mfrow = c(1,1))

# training & testing for ridge regression
set.seed(1)
train <- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.test <- y[test]

# prediction with test data set with parameter lambda = 4
ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
ridge.pred <- predict(ridge.mod, s=4, newx = x[test,]) # lambda = 4
mean((ridge.pred - y.test)^2) # MSE
## [1] 10634.78
mean((mean(y[train]) - y.test)^2) # MSE for lambda -> infty, using only intercept
## [1] 194734.7
ridge.pred <- predict(ridge.mod, s=1e10, newx = x[test,]) # lambda = 4
mean((ridge.pred - y.test)^2) # MSE
## [1] 194734.7
leastsq.pred <- predict(glmnet(x, y, alpha = 0, lambda = 4), 
                        newx = x[test,], s=4, exact = TRUE) # lambda = 4
mean((leastsq.pred - y.test)^2) # MSE for least sq
## [1] 10165.56
lm(y ~ x, subset = train)
## 
## Call:
## lm(formula = y ~ x, subset = train)
## 
## Coefficients:
##         (Intercept)              xIncome               xLimit  
##             517.397             -278.198              486.398  
##             xRating               xCards                 xAge  
##             142.636               29.462              -10.110  
##          xEducation        xGenderFemale          xStudentYes  
##              -5.943              -10.043              131.612  
##         xMarriedYes      xEthnicityAsian  xEthnicityCaucasian  
##              -5.132               13.299               11.423
predict(glmnet(x, y, alpha = 0, lambda = 4), s=4, exact = T, type = "coefficients")[1:12,]
##        (Intercept)             Income              Limit 
##         520.015000        -263.241642         334.820628 
##             Rating              Cards                Age 
##         270.035821          20.431666         -11.542037 
##          Education       GenderFemale         StudentYes 
##          -2.746844          -4.965227         125.834139 
##         MarriedYes     EthnicityAsian EthnicityCaucasian 
##          -5.025895           7.555101           5.068985
# cv of ridge regression
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha = 0)
plot(cv.out)

bestlambda <- cv.out$lambda.min; bestlambda
## [1] 45.57503
ridge.pred <- predict(ridge.mod, s = bestlambda, newx = x[test,])
mean((ridge.pred - y.test)^2)
## [1] 14871.68
# final model 
out <- glmnet(x, y, alpha = 0, lambda = grid)
predict(out, type = "coefficients", s = bestlambda)[1:12,]
##        (Intercept)             Income              Limit 
##         520.015000        -172.698485         258.498797 
##             Rating              Cards                Age 
##         252.257647          21.853938         -16.947831 
##          Education       GenderFemale         StudentYes 
##          -1.353316          -2.131229         113.079064 
##         MarriedYes     EthnicityAsian EthnicityCaucasian 
##          -5.952610           5.490850           4.475777

In the above figures, normalized L2 Norm of ridge regression parameters denoted as \(||\hat\beta_{\lambda}^R||_2/||\hat\beta||_2\) indicates the amount that the ridge regression coefficient estimates have been shrunken towards zero. The standard least squares coefficient estimates are scale equivariant: multiplying \(X_j\) by a constant \(c\) simply leads to a scaling of the least squares coefficient estimates by a factor of \(\frac{1}{c}\). In other words, regardless of how the \(j_{th}\) predictor is scaled, \(X_j\hat\beta_j\) will remain the same. In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant. That is, \(X_j\hat\beta_{j,\lambda}^R\) will depend not only on the value of \(\lambda\), but also on the scaling of the \(j_{th}\) predictor. In fact, the value of \(X_j\hat\beta_{j, \lambda}^R\) may even depend on the scaling of the other predictors! Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula \[\tilde x_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar x_j)^2}}\] so that they are all on the same scale. (Note that our figure and figure in the text book is bit different because the scaling scheme we used is different from the scheme used in the text book but the results are almost the same)

Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off. As \(\lambda\) increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.

As you can see above figures, because ridge regression reduces variance and increases bias, at some value of lambda, it reduces test error effectively. In general in situations where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. In particular, when the number of variables \(p\) is almost as large as the number of observations \(n\), the least squares estimates will be extremly variable but ridge regression can do well on the situation even if \(p>n\), by decreasing variance and slightly increasing bias. In addition, ridge regression is computationally favorable, in fact, the computations required to solve ridge regression equation, simultaneously for all values of \(\lambda\), are almost identical to those for fitting a model using least squares.

Shrinkage Methods - Lasso (s32)

Ridge regression does have one obvious disadvantage - it will include all \(p\) predictors in the final model. Unless \(\lambda = \infty\), the penalty \(\lambda\sum_{j}^p \beta_j^2\) will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero. The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, \(\hat\beta_{\lambda}^L\), minimize the quantity

\[\sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p\beta_j x_{ij}\Bigg)^2+\lambda\sum_{j=1}^p|\beta_j| = RSS + \lambda\sum_{j=1}^p|\beta_j|\] The only difference from ridge to lasso is that the \(\hat\beta_j^2\) term in the ridge regression penalty has been replaced by \(|\beta_j|\) in the lasso penalty. In statistical language, the lasso uses an \(\ell_1\) penalty instead of an \(\ell_2\) penalty. The \(\ell_1\) norm of a coefficient vector \(\beta\) is given by \(||\beta||_1 = \sum_{j}^p|\beta_j|\). Unlike ridge regression, lasso performs variable selection when the tuning parameter \(\lambda\) is sufficiently large. We say that the lasso yields sparse models - that is, models that involve only a subset of the variables.

grid <- 10^seq(10, -2, length = 100)
fit <-  glmnet(x, y, alpha = 1, lambda = grid, standardize = TRUE) # lasso regression
fit.lsq <- lm(y ~ x)

# calculating L1 norm
fit$lambda # display all lambda values
##   [1] 1.000000e+10 7.564633e+09 5.722368e+09 4.328761e+09 3.274549e+09
##   [6] 2.477076e+09 1.873817e+09 1.417474e+09 1.072267e+09 8.111308e+08
##  [11] 6.135907e+08 4.641589e+08 3.511192e+08 2.656088e+08 2.009233e+08
##  [16] 1.519911e+08 1.149757e+08 8.697490e+07 6.579332e+07 4.977024e+07
##  [21] 3.764936e+07 2.848036e+07 2.154435e+07 1.629751e+07 1.232847e+07
##  [26] 9.326033e+06 7.054802e+06 5.336699e+06 4.037017e+06 3.053856e+06
##  [31] 2.310130e+06 1.747528e+06 1.321941e+06 1.000000e+06 7.564633e+05
##  [36] 5.722368e+05 4.328761e+05 3.274549e+05 2.477076e+05 1.873817e+05
##  [41] 1.417474e+05 1.072267e+05 8.111308e+04 6.135907e+04 4.641589e+04
##  [46] 3.511192e+04 2.656088e+04 2.009233e+04 1.519911e+04 1.149757e+04
##  [51] 8.697490e+03 6.579332e+03 4.977024e+03 3.764936e+03 2.848036e+03
##  [56] 2.154435e+03 1.629751e+03 1.232847e+03 9.326033e+02 7.054802e+02
##  [61] 5.336699e+02 4.037017e+02 3.053856e+02 2.310130e+02 1.747528e+02
##  [66] 1.321941e+02 1.000000e+02 7.564633e+01 5.722368e+01 4.328761e+01
##  [71] 3.274549e+01 2.477076e+01 1.873817e+01 1.417474e+01 1.072267e+01
##  [76] 8.111308e+00 6.135907e+00 4.641589e+00 3.511192e+00 2.656088e+00
##  [81] 2.009233e+00 1.519911e+00 1.149757e+00 8.697490e-01 6.579332e-01
##  [86] 4.977024e-01 3.764936e-01 2.848036e-01 2.154435e-01 1.629751e-01
##  [91] 1.232847e-01 9.326033e-02 7.054802e-02 5.336699e-02 4.037017e-02
##  [96] 3.053856e-02 2.310130e-02 1.747528e-02 1.321941e-02 1.000000e-02
coef(fit)[1:5, 1:5] # display some of coefficients
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                  s0      s1      s2      s3      s4
## (Intercept) 520.015 520.015 520.015 520.015 520.015
## Income        .       .       .       .       .    
## Limit         .       .       .       .       .    
## Rating        .       .       .       .       .    
## Cards         .       .       .       .       .
fit$l1 <- colSums(abs(as.matrix(coef(fit)[-1,]))) # L1 Norms
# Normalized L1 Norms (w.r.t L1 Norm of least square)
fit$norml1 <- fit$l1/sum(abs(coef(fit.lsq)[-1]))

# plot results
par(mfrow = c(1,2))
library(RColorBrewer)
palette <- brewer.pal(11, "Spectral")

# lambda vs coefficient
plot(fit, xvar = "lambda", col = palette, xlim = c(-5,7), 
     ylab = "Standardized Coefficients")
legend(x=-5, y=-25, fit$beta@Dimnames[[1]], 
       col = palette, lty = rep(1,11), cex = 0.35)

# L1 Norm vs coefficients
res_beta <- as.matrix(fit$beta)
plot(fit$norml1, res_beta[1,], type = 'l', col = palette[1], xlim = c(0, 1),
     ylim = c(-300, 400), xlab = "Normalized L1Norm", ylab = "Standardized Coefficients")
for (i in 2:11) lines(fit$norml1, res_beta[i,], type = 'l', col = palette[[i]])
legend(x=0, y=-25, fit$beta@Dimnames[[1]], 
       col = palette, lty = rep(1,11), cex = 0.35)

par(mfrow = c(1,1))

# lasso model
lasso.mod <- glmnet(x[train,], y[train], alpha = 1, lambda = grid)

# cv of lasso
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha = 1)
plot(cv.out)

bestlambda <- cv.out$lambda.min; bestlambda
## [1] 0.6767748
lasso.pred <- predict(lasso.mod, s = bestlambda, newx = x[test,])
mean((lasso.pred - y.test)^2)
## [1] 10616.67
# final model 
out <- glmnet(x, y, alpha = 1, lambda = grid)
predict(out, type = "coefficients", s = bestlambda)[1:12,]
##        (Intercept)             Income              Limit 
##         520.015000        -271.797813         401.236258 
##             Rating              Cards                Age 
##         211.932352          22.192181         -10.225534 
##          Education       GenderFemale         StudentYes 
##          -2.580207          -4.540163         126.872596 
##         MarriedYes     EthnicityAsian EthnicityCaucasian 
##          -3.665522           5.784060           3.434557

One can show that the lasso and ridge regression coefficient estimates solve the problems \[\min_{\beta} \Bigg\{\sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij}\Bigg)^2\Bigg\} \hspace{1cm} \text{subject to} \hspace{1cm} \sum_{j=1}^p|\beta_j|\leq s\] and \[\min_{\beta} \Bigg\{\sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij}\Bigg)^2\Bigg\} \hspace{1cm} \text{subject to} \hspace{1cm} \sum_{j=1}^p\beta_j^2\leq s\] respectively. When \(p=2\), the lasso coefficient estimates have the smallest RSS out of all points that lie within the diamond defined by \(|\beta_1|+|\beta_2|\leq s\). Similarly, the ridge regression estimates have the smallest RSS out of all points that lie within the circle defined by \(\beta_1^2+\beta_2^2\leq s\). Consider the problem \[\min_{\beta} \Bigg\{\sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij}\Bigg)^2\Bigg\} \hspace{1cm} \text{subject to} \hspace{1cm} \sum_{j=1}^p I(\beta_j\neq 0)\leq s\] Here \(I(\beta_j \neq 0)\) is an indicator variable. It takes on a value of 1 if \(\beta_j \neq 0\), and equals zero otherwise. This is equivalent to best subset selection. Unfortunately, solving the above equation is computationally infeasible when \(p\) is large, since it requires considering all \(p \choose s\) models containing \(s\) predictors. Therefore, we can interpret ridge regression and the lasso as computationally feasible alternatives to best subset selection that replace the intractable form of the best subset selection.

As you can see in the picture above, the constant RSS lines marked by the elipse intersect the circles and therefore have non-zero \(\beta_1\) and \(\beta_2\) for ridge regression. On the other hand, for lasso, the intersection occurs in the \(\beta_2\) axis, so only the \(\beta_2\) coefficient has the value and \(\beta_1\) has exactly zero. When \(p = 3\), then the constraint region for ridge regression becomes a sphere, and the constraint region for the lasso becomes a polyhedron. When \(p > 3\), the constraint for ridge regression becomes a hypersphere, and the constraint for the lasso becomes a polytope. However, the key ideas still hold. In particular, the lasso leads to feature selection when \(p > 2\) due to the sharp corners of the polyhedron or polytope.

So what is the better method? lasso or ridge?

In general, the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size. However, the number of predictors that is related to the response is never known a priori for real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.

In order to obtain a better intuition about the behavior of ridge regression and the lasso, consider a simple special case with \(n=p\), and \(X = I\), identity matrix. Assume we perform regression without intercept. Then least squares problem simplifies to finding \(\beta_1, ..., \beta_p\) that minimize \[\sum_{j=1}^p(y_j - \beta_j)^2\] then the least squares solution is given by \[\hat\beta_j = y_j\] And in this setting, ridge regression estimates take the form \[\hat\beta_j^R = \frac{y_j}{(1+\lambda)}\] and the lasso take the form \[\hat\beta_j^L =\Bigg\{\begin{array}{lll} {y_j-\lambda/2} \hspace{1cm} \text{if } y_j > \lambda/2;\\ {y_j-\lambda/2} \hspace{1cm} \text{if } y_j < -\lambda/2;\\ 0 \hspace{2.3cm} \text{if } |y_j| \leq \lambda/2.\end{array} \]

As illustrated in the above figure, in ridge regression, each least squares coefficient estimate is shrunken by the same proportion. In contrast, the lasso shrinks each least squares coefficient towards zero by a constant amount, \(\lambda/2\); the least squares coefficients that are less than \(\lambda/2\) in absolute value are shrunken entirely to zero. The type of shrinkage performed by the lasso in this simple setting is known as softthresholding. The fact that some lasso coefficients are shrunken entirely to softzero explains why the lasso performs feature selection. In general, though it may little complicated, the main ideas still hold approximately. ridge regression more or less shrinks every dimension of the data by the same proportion, whereas the lasso more or less shrinks all coefficients toward zero by a similar amount, and sufficiently small coefficients are shrunken all the way to zero.

The Ridge and lasso can also interpreted as Bayesian way. Let’s assume that the coefficient vector \(\beta\) has some prior distribution, say \(p(\beta)\), where \(\beta=(\beta_0, \beta_1, ..., \beta_p)^T\). The likelihood of the data can be written as \(f(Y |X, \beta)\), where \(X = (X_1,...,X_p)\). Multiplying the prior distribution by the likelihood gives us (up to a proportionality constant) the posterior distribution which takes the form \[p(\beta|X, Y) \propto f(Y|X,\beta)p(\beta|X) = f(Y|X, \beta)p(\beta)\] We assume the usual linear model \[Y = \beta_0 + X_1\beta_1 + ... + X_p\beta_p + \epsilon\] and suppose that the errors are independent and drawn from a normal distribution. Furthermore, assume thaat \(p(\beta) = \prod_{j=1}^p g(\beta_j)\), for some density function \(g\). It turns out that ridge regression and the lasso follow naturally from two special cases of \(g\)

  • If \(g\) is a Gaussian distribution with mean zero and standard deviation a function of \(\lambda\), then it follows that the posterior mode for \(\beta\) - that is, the most likely value for \(\beta\), given the data - is given by the ridge regression solution. (In fact, the ridge regression solution is also the posterior mean.)

  • If \(g\) is a double-exponential (Laplace) distribution with mean zero and scale parameter a function of \(\lambda\), then it follows that the posterior mode for \(\beta\) is the lasso solution. (However, the lasso solution is not the posterior mean, and in fact, the posterior mean does not yield a sparse coefficient vector)

Therefore, from a Bayesian viewpoint, ridge regression and the lasso follow directly from assuming the usual linear model with normal errors, together with a simple prior distribution for \(\beta\). Notice that the lasso prior is steeply peaked at zero, while the Gaussian is flatter and fatter at zero. Hence, the lasso expects a priori that many of the coefficients are (exactly) zero, while ridge assumes the coefficients are randomly distributed about zero. (for more information, see this post and this material)

so far we discussed about how ridge regression and the lasso works. To properly perform these methods, we need to choose right parameter \(\lambda\) for each specific problem. Cross-validation provides a simple way to tackle this problem. as we already seen the above example, we can choose \(\lambda\) at the minimum cross-validation error occur.

Dimension Reduction Methods - PCR, PLS (s33)

We discussed methods that are defined using the original predictors. We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods. Let \(Z_1, Z_2, ..., Z_M\) represent \(M<p\) linear combinations of our original \(p\) predictors. That is, \[Z_m = \sum_{j=1}^p \phi_{jm}X_j\] for some constants \(\phi_{1m},\phi_{2m},..., \phi_{pm}\), \(m=1, ..., M\). We can then fit the linear regression model \[y_i = \theta_0 + \sum_{m=1}^M \theta_mZ_{im}+\epsilon_i, \hspace{1cm} i=1,...,n\] using least squares. The term dimension reduction comes from the fact that this approach reduces the problem of estimating the \(p+1\) coefficients \(\beta_0, \beta_1, ..., \beta_p\) to the simpler problem of estimating the \(M + 1\) coefficients \(\theta_0, \theta)1, ..., \theta_M\), where \(M<p\). Notice that \[\sum_{m=1}^M\theta_MZ_{im} = \sum_{m=1}^M \theta_m \sum_{j=1}^p \phi_{jm}x_{ij} = \sum_{j=1}^p\sum_{m=1}^M \theta_m \phi_{jm}x_{ij}=\sum_{j=1}^p\beta_jx_{ij}\] where \[\beta_j = \sum_{m=1}^M =\theta_m\phi_{jm}\] Hence now the above regression equation can be thought of as a special case of the typical original linear regression model. Dimension reduction serves to constrain the estimated \(\beta_j\) coefficients, since now they must take the form above. This constraint on the form of the coefficients has the potential to bias the coefficient estimates. However, in situations where \(p\) is large relative to \(n\), selecting a value of \(M \ll\ p\) can significantly reduce the variance of the fitted coefficients.

All dimension reduction methods work in two steps. First, the transformed predictors \(Z_1, Z_2,...,Z_M\) are obtained. Second, the model is fit using these \(M\) predictors. However, the choice of \(Z_1, Z_2,...,Z_M\) , or equivalently, the selection of the \(\phi_{jm}\)’s, can be achieved in different ways. We will consider principal components and partial least squares here.

 

Principal Components Regression

Principal components analysis (PCA) is a technique for reducing the dimension of a \(n\times p\) data matrix \(X\). The first principal component direction of the data is that along which the observations vary the most. For example, \(X \in R^2\), the first PC can be represented as \[Z_1 = \phi_{11} \times (X_1 - \bar X_1) + \phi_{21} \times (X_2 - \bar X_2)\] Here \(\phi_{11}\) and \(\phi_{21}\) are the principal component loadings, which define the direction of PCA plot. The idea is that out of every possible linear combination of \(X_1\) and \(X_2\) such that \(\phi_{11}^2 + \phi_{21}^2=1\), this particular linear combination yields the highest varianc: i.e. this is the linear combination for which \(Var(Z_1)\) is maximized. If the two loadings are both positive and have similar size, then \(Z_1\) is almost an average of the two variables. The values of \(z_{11},...,z_{n1}\) are known as the principal component scores. There is also another interpretation for PCA: the first principal component vector defines the line that is as close as possible to the data.

The Principal components regression (PCR) approach involves constructing the first \(M\) principal components \(Z_1, ..., Z_M\), and then using these components as the predictors in a linear regression model that is fit using least squares. The key idea is that often a small number of principal components suffice to explain most of the variability in the data, as well as the relationship with the response. In other words, we assume that the directions in which \(X_1,...,X_p\) show the most variation are the directions that are associated with \(Y\).

We note that even though PCR provides a simple way to perform regression using \(M<p\) predictors, it is not a feature selection method. This is because each of the \(M\) principal components used in the regression is a linear combination of all \(p\) of the original features. In this sense, PCR is more closely related to ridge regression than to the lasso. In fact, one can show that PCR and ridge regression are very closely related. One can even think of ridge regression as a continuous version of PCR.

In PCR, the number of principal components, \(M\), is typically chosen by cross-validation. When performing PCR, we generally recommend standardizing each predictor, prior to generating the principal components. This standardization ensures that all variables are on the same scale. In the absence of standardization, the high-variance variables will tend to play a larger role in the principal components obtained, and the scale on which the variables are measured will ultimately have an effect on the final PCR model

# toy example
library(boot)
data(bigcity); head(bigcity) # 1920 and 1930 US population data set
##     u   x
## 1 138 143
## 2  93 104
## 3  61  69
## 4 179 260
## 5  48  75
## 6  37  63
city.pca <- prcomp(bigcity, center = TRUE, scale.=TRUE)
summary(city.pca)
## Importance of components:
##                           PC1     PC2
## Standard deviation     1.4077 0.13512
## Proportion of Variance 0.9909 0.00913
## Cumulative Proportion  0.9909 1.00000
pc1_score <- city.pca$sdev[1]*(bigcity$u-mean(bigcity$u)) + 
  city.pca$sdev[2] * (bigcity$x-mean(bigcity$x))
pc2_score <- city.pca$sdev[2]*(bigcity$u-mean(bigcity$u)) - 
  city.pca$sdev[1] * (bigcity$x-mean(bigcity$x))
df1 <- data.frame(u = bigcity$u, pc1_score, x = bigcity$x); df1 <- df1[order(df1$u),]
df2 <- data.frame(u = bigcity$u, pc2_score, x = bigcity$x); df2 <- df2[order(df2$u),]

# plot regression line
plot(pc1_score ~ u, data=df1, type = 'l', col = "firebrick",
     xlab = "US population in 1920", ylab = "US population in 1930")
points(bigcity$u[order(bigcity$u)], bigcity$x[order(bigcity$u)], 
       pch=16, col = "cornflowerblue")

# PC vs predictors
par(mfrow=c(2,2))
plot(df1$pc1_score[order(df1$pc1_score)], df1$u[order(df1$pc1_score)], col = "firebrick", 
     pch = 16, xlab = "1st Principal Components", ylab = "US population in 1920")
plot(df1$pc1_score[order(df1$pc1_score)], df1$x[order(df1$pc1_score)], col = "firebrick", 
     pch = 16, xlab = "1st Principal Components", ylab = "US population in 1930")
plot(df2$pc2_score[order(df2$pc2_score)], df2$u[order(df2$pc2_score)], col = "firebrick", 
     pch = 16, xlab = "2nd Principal Components", ylab = "US population in 1920")
plot(df2$pc2_score[order(df2$pc2_score)], df2$x[order(df2$pc2_score)], col = "firebrick", 
     pch = 16, xlab = "2nd Principal Components", ylab = "US population in 1930")

par(mfrow=c(1,1))

# visualize pca
# library(devtools)
# install_github("vqv/ggbiplot")
library(ggbiplot)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:ggvis':
## 
##     resolution
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:ggvis':
## 
##     fullseq, zero_range
## Loading required package: grid
ggbiplot(city.pca, labels = rownames(bigcity))

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(2)
# Prep data
credit <- Credit[,-1] 
x <- model.matrix(Balance ~ ., credit)[,-1]
y <- credit$Balance
x <- scale(x)

pcr.fit <- pcr(y ~ x, scale = TRUE, validation = "CV") # 10-fold CV
summary(pcr.fit)
## Data:    X dimension: 400 11 
##  Y dimension: 400 1
## Fit method: svdpc
## Number of components considered: 11
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           460.3    299.2    299.4    293.3    291.6    289.0    286.6
## adjCV        460.3    298.9    299.1    288.7    291.9    290.9    293.6
##        7 comps  8 comps  9 comps  10 comps  11 comps
## CV       264.4    264.9    265.8     100.3     100.3
## adjCV    263.8    264.5    265.5     100.1     100.1
## 
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X    25.05    39.64    49.73    59.74    68.89    77.73    86.43    93.91
## y    58.07    58.37    60.78    60.90    61.46    63.11    68.70    68.71
##    9 comps  10 comps  11 comps
## X    97.60     99.98    100.00
## y    68.72     95.47     95.51
validationplot(pcr.fit, val.type = "MSEP")

# train vs test
set.seed(1)
pcr.fit <- pcr(Balance ~., data = credit, subset = train, scale = TRUE, validation = "CV") # 10-fold CV
validationplot(pcr.fit, val.type = "MSEP")

pcr.pred1 <- predict(pcr.fit, x[test,], ncomp = 10)
pcr.pred2 <- predict(pcr.fit, x[test,], ncomp = 1)
mean((pcr.pred1 - y.test)^2)
## [1] 1357772
mean((pcr.pred2 - y.test)^2)
## [1] 945116.9
pcr.fit <- pcr(y ~ x, scale = TRUE, ncomp = 10)
summary(pcr.fit)
## Data:    X dimension: 400 11 
##  Y dimension: 400 1
## Fit method: svdpc
## Number of components considered: 10
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X    25.05    39.64    49.73    59.74    68.89    77.73    86.43    93.91
## y    58.07    58.37    60.78    60.90    61.46    63.11    68.70    68.71
##    9 comps  10 comps
## X    97.60     99.98
## y    68.72     95.47

 

Partial Least Squares

The PCR approach that we just described involves identifying linear combinations, or directions, that best represent the predictors \(X_1,...,X_p\). We now present partial least squares (PLS), a supervised alternative to partial least PCR. Like PCR, PLS is a dimension reduction method, which first identifies squares a new set of features \(Z_1,...,Z_M\) that are linear combinations of the original features, and then fits a linear model via least squares using these \(M\) new features. But unlike PCR, PLS identifies these new features in a supervised way - that is, it makes use of the response \(Y\) in order to identify new features that not only approximate the old features well, but also that are related to the response. Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.

PLS computes the first direction \(Z_1\) by setting each \(\phi_{j1}\) in the PCA equation equal to the coefficient from the simple linear regression of \(Y\) onto \(X_j\). This coefficient is proportional to the correlation between \(Y\) and \(X_j\). Hence, in computing \(Z_1 = \sum_{j=1}^p \phi_{j1}X_j\), PLS places the highest weight on the variables that are most strongly related to the response. The second PLS direction can be identified by first adjust each of the variables for \(Z_1\), by regressing each variable on \(Z_1\) and taking residuals. These residuals can be interpreted as the remaining information that has not been explained by the first PLS direction. We then compute \(Z_2\) using this orthogonalized data in exactly the same fashion as \(Z_1\) was computed based on the original data. This iterative approach can be repeated \(M\) times to identify multiple PLS components \(Z_1,...,Z_M\). Finally, at the end of this procedure, we use least squares to fit a linear model to predict \(Y\) using \(Z_1,...,Z_M\) in exactly the same fashion as for PCR. As with PCR, the number \(M\) of partial least squares directions used in PLS is a tuning parameter that is typically chosen by cross-validation. We generally standardize the predictors and response before performing PLS.

set.seed(1)
# Prep data
credit <- Credit[,-1] 
x <- model.matrix(Balance ~ ., credit)[,-1]
y <- credit$Balance
x <- scale(x)

pls.fit <- plsr(y ~ x, subset = train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data:    X dimension: 200 11 
##  Y dimension: 200 1
## Fit method: kernelpls
## Number of components considered: 11
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           480.1    266.0    198.8    110.4    101.4    100.4    100.4
## adjCV        480.1    265.6    197.7    107.5    100.9    100.0    100.0
##        7 comps  8 comps  9 comps  10 comps  11 comps
## CV      100.26    100.2    99.90    100.14    100.23
## adjCV    99.89     99.8    99.44     99.74     99.83
## 
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X    25.41    34.17    38.13    50.48    59.30    64.01    71.70    80.47
## y    70.98    85.09    95.77    96.14    96.19    96.19    96.19    96.20
##    9 comps  10 comps  11 comps
## X    82.66     90.41    100.00
## y    96.24     96.24     96.24
validationplot(pls.fit, val.type = "MSEP")

pls.pred <- predict(pls.fit, x[test,], ncomp=3)
mean((pls.pred - y.test)^2)
## [1] 12191.2
pls.fit <- plsr(y ~ x, scale = TRUE, ncomp = 3)
summary(pls.fit)
## Data:    X dimension: 400 11 
##  Y dimension: 400 1
## Fit method: kernelpls
## Number of components considered: 3
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps
## X    24.58    32.53    37.84
## y    69.67    86.53    94.95

Considerations in High Dimensions

High-Dimensional Data

Most traditional statistical techniques for regression and classification are intended for the low-dimensional setting in which \(n\), the number of observations, is much greater than \(p\), the number of features. In the past 20 years, new technologies have changed the way that data are collected in fields as diverse as finance, marketing, and medicine. It is now commonplace to collect an almost unlimited number of feature measurements (\(p\) very large). Data sets containing more features than observations are often referred to as high-dimensional.

What goes wrong in high dimensions?

When the number of features \(p\) is as large as, or larger than, the number of observations \(n\), least squares cannot (or rather, should not) be performed. The reason is simple: regardless of whether or not there truly is a relationship between the features and the response, least squares will yield a set of coefficient estimates that result in a perfect fit to the data, such that the residuals are zero, as you can see figure below. The problem is simple: when \(p>n\) or \(p \approx n\), a simple least squares regression line is too flexible and hence overfits the data.

Regression in High Dimensions

The figure above shows linear regression results of unrelated variables. As shown in the figure above, the model \(R^2\) increases to 1 as the number of features included in the model increases, and correspondingly the training set MSE decreases to 0 as the number of features increases, even though the features are completely unrelated to the response. On the other hand, the MSE on an independent test set becomes extremely large as the number of features included in the model increases, because including the additional predictors leads to a vast increase in the variance of the coefficient estimates. Looking at the test set MSE, it is clear that the best model contains at most a few variables. However, someone who carelessly examines only the \(R^2\) or the training set MSE might erroneously conclude that the model with the greatest number of variables is best. This indicates the importance of applying extra care when analyzing data sets with a large number of variables, and of always evaluating model performance on an independent test set. Unfortunately, the \(C_p\), AIC, and BIC approaches are not appropriate in the high-dimensional setting, because estimating \(\hat\sigma^2\) is problematic.

It turns out that many of the methods seen in this tutorial for fitting less flexible least squares models, such as forward stepwise selection, ridge regression, the lasso, and principal components regression, are particularly useful for performing regression in the high-dimensional setting. Essentially, these approaches avoid overfitting by using a less flexible fitting approach than least squares.

The figure above illustrates condition that lasso was performed with \(n = 100\) observations and three values of \(p\), the number of features. Of the \(p\) features, 20 were associated with the response. As you can see in the figure, the lasso is sensitive when \(p\) is relatively similar to the number of truly associated feautes but it lose the power of discriminating relevant features when \(p\) is way higher than number of true features. The figure above highlights three important points.

  1. regularization or shrinkage plays a key role in high-dimensional problems,
  2. appropriate tuning parameter selection is crucial for good predictive performance
  3. the test error tends to increase as the dimensionality of the problem (i.e. the number of features or predictors) increases, unless the additional features are truly associated with the response.

The third point above is in fact a key principle in the analysis of highdimensional data, which is known as the curse of dimensionality. In general, adding additional signal features that are truly associated with the response will improve the fitted model, in the sense of leading to a reduction in test set error. However, adding noise features that are not truly associated with the response will lead to a deterioration in the fitted model, and consequently an increased test set error.

Interpreting Results in High Dimensions

When we perform the lasso, ridge regression, or other regression procedures in the high-dimensional setting, we must be quite cautious in the way that we report the results obtained. In the high-dimensional setting, the multicollinearity problem is extreme: any variable in the model can be written as a linear combination of all of the other variables in the model. Essentially, this means that we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients for use in the regression.

For instance, suppose that we are trying to predict blood pressure on the basis of half a million SNPs, and that forward stepwise selection indicates that 17 of those SNPs lead to a good predictive model on the training data. It would be incorrect to conclude that these 17 SNPs predict blood pressure more effectively than the other SNPs not included in the model. There are likely to be many sets of 17 SNPs that would predict blood pressure just as well as the selected model. If we were to obtain an independent data set and perform forward stepwise selection on that data set, we would likely obtain a model containing a different, and perhaps even non-overlapping, set of SNPs. This does not detract from the value of the model obtained - for instance, the model might turn out to be very effective in predicting blood pressure on an independent set of patients, and might be clinically useful for physicians. But we must be careful not to overstate the results obtained, and to make it clear that what we have identified is simply one of many possible models for predicting blood pressure, and that it must be further validated on independent data sets.

It is also important to be particularly careful in reporting errors and measures of model fit in the high-dimensional setting. We have seen that when \(p>n\), it is easy to obtain a useless model that has zero residuals. Therefore, one should never use sum of squared errors, p-values, \(R^2\) statistics, or other traditional measures of model fit on the training data as evidence of a good model fit in the high-dimensional setting. Reporting this fact might mislead others into thinking that a statistically valid and useful model has been obtained, whereas in fact this provides absolutely no evidence of a compelling model. It is important to instead report results on an independent test set, or cross-validation errors. For instance, the MSE or \(R^2\) on an independent test set is a valid measure of model fit, but the MSE on the training set certainly is not.

Closing example - Hitter data set

As a closing example, we are going to apply all the methods we have learned to analyze hitter data set.Here we wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year. Analysis pipeline is given below.

  1. preprocessing - remove NA values (player which do not have salary information)
  2. Best subset, Foward and Backward stepwise selection
  3. Validation of model (from best subset selection method) using validation set approach and cross validation approach
  4. Ridge regression & The Lasso
  5. PCR & PLS
  6. Which method is the best?
## 1. preprocessing ################
# data cleaning
library(ISLR)
sum(is.na(Hitters))
## [1] 59
hitters <- na.omit(Hitters)
names(hitters)
##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"      
##  [6] "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"   
## [11] "CRuns"     "CRBI"      "CWalks"    "League"    "Division" 
## [16] "PutOuts"   "Assists"   "Errors"    "Salary"    "NewLeague"
## 2. Best subset selection ################
# model fit
library(leaps)
regfit.full <- regsubsets(Salary~., hitters, nvmax=19)
reg.summary <- summary(regfit.full)
names(reg.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
# print R^2
reg.summary$rsq
##  [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
##  [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
# plot rss and adjr2
par(mfrow = c(2,2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type ='l')
points(which.min(reg.summary$rss), 
       reg.summary$rss[which.min(reg.summary$rss)], col = "red", cex = 2, pch=20)
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted Rsq", type ='l')
points(which.max(reg.summary$adjr2), 
       reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red", cex = 2, pch=20)

# plot cp and bic
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type ='l')
points(which.min(reg.summary$cp), 
       reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch=20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type ='l')
points(which.min(reg.summary$bic), 
       reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch=20)

par(mfrow = c(1,1))

# Using base plot
plot(regfit.full, scale = "r2")

plot(regfit.full, scale = "adjr2")

plot(regfit.full, scale = "Cp")

plot(regfit.full, scale = "bic")

# confirm coefficient 
coef(regfit.full, which.min(reg.summary$bic))
##  (Intercept)        AtBat         Hits        Walks         CRBI 
##   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 
##    DivisionW      PutOuts 
## -122.9515338    0.2643076
## 2. Foward and Backward Stepwise selection ################
regfit.fwd <- regsubsets(Salary~., data = hitters, nvmax = 19, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = hitters, nvmax = 19, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"
regfit.bwd <- regsubsets(Salary~., data = hitters, nvmax = 19, method = "backward")
regfit.hbd <- regsubsets(Salary~., data = hitters, nvmax = 19, method = "seqrep")
coef(regfit.full, 7)
##  (Intercept)         Hits        Walks       CAtBat        CHits 
##   79.4509472    1.2833513    3.2274264   -0.3752350    1.4957073 
##       CHmRun    DivisionW      PutOuts 
##    1.4420538 -129.9866432    0.2366813
coef(regfit.fwd, 7)
##  (Intercept)        AtBat         Hits        Walks         CRBI 
##  109.7873062   -1.9588851    7.4498772    4.9131401    0.8537622 
##       CWalks    DivisionW      PutOuts 
##   -0.3053070 -127.1223928    0.2533404
coef(regfit.bwd, 7)
##  (Intercept)        AtBat         Hits        Walks        CRuns 
##  105.6487488   -1.9762838    6.7574914    6.0558691    1.1293095 
##       CWalks    DivisionW      PutOuts 
##   -0.7163346 -116.1692169    0.3028847
coef(regfit.hbd, 7)
##  (Intercept)        AtBat         Hits        Walks        CRuns 
##  105.6487488   -1.9762838    6.7574914    6.0558691    1.1293095 
##       CWalks    DivisionW      PutOuts 
##   -0.7163346 -116.1692169    0.3028847
## 3. validation set & cross-validation
# validation set
set.seed(1)
p <- dim(hitters)[2]-1
train <- sample(c(TRUE, FALSE), nrow(hitters), rep = TRUE)
test <- (!train)
regfit.best <- regsubsets(Salary ~ ., data = hitters[train,], nvmax = 19)
test.mat <- model.matrix(Salary~., data = hitters[test,])
val.errors <- rep(NA, p)
for (i in 1:p){
  coefi <- coef(regfit.best, id = i)
  pred <- test.mat[, names(coefi)] %*% coefi
  val.errors[i] <- mean((hitters$Salary[test] - pred)^2)
}
val.errors
##  [1] 220968.0 169157.1 178518.2 163426.1 168418.1 171270.6 162377.1
##  [8] 157909.3 154055.7 148162.1 151156.4 151742.5 152214.5 157358.7
## [15] 158541.4 158743.3 159972.7 159859.8 160105.6
coef(regfit.best, which.min(val.errors))
## (Intercept)       AtBat        Hits       Walks      CAtBat       CHits 
## -80.2751499  -1.4683816   7.1625314   3.6430345  -0.1855698   1.1053238 
##      CHmRun      CWalks     LeagueN   DivisionW     PutOuts 
##   1.3844863  -0.7483170  84.5576103 -53.0289658   0.2381662
predict.regsubsets <- function(object, newdata, id, ...){
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars = names(coefi)
  mat[, xvars]%*%coefi
}
regfit.best <- regsubsets(Salary~., data = hitters, nvmax = p)
coef(regfit.best, which.min(val.errors))
##  (Intercept)        AtBat         Hits        Walks       CAtBat 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798 
##        CRuns         CRBI       CWalks    DivisionW      PutOuts 
##    1.4082490    0.7743122   -0.8308264 -112.3800575    0.2973726 
##      Assists 
##    0.2831680
# CV
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(hitters), replace = TRUE)
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:19)))
for(j in 1:k){
  best.fit <- regsubsets(Salary~., data = hitters[folds !=j,], nvmax = 19)
  for(i in 1:19){
    pred <- predict(best.fit, hitters[folds == j,], id=i)
    cv.errors[j,i] = mean((hitters$Salary[folds==j]-pred)^2)
  }
}

mean.cv.errors <- apply(cv.errors, 2, mean)
reg.best <- regsubsets(Salary~., data = hitters, nvmax=19)
coef(reg.best, which.min(mean.cv.errors))
##  (Intercept)        AtBat         Hits        Walks       CAtBat 
##  135.7512195   -2.1277482    6.9236994    5.6202755   -0.1389914 
##        CRuns         CRBI       CWalks      LeagueN    DivisionW 
##    1.4553310    0.7852528   -0.8228559   43.1116152 -111.1460252 
##      PutOuts      Assists 
##    0.2894087    0.2688277
## 4. Ridge regression and the lasso ################
# Ridge regression
x <- model.matrix(Salary~., hitters)[, -1]
y <- hitters$Salary

library(glmnet)
grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x,y,alpha = 0, lambda = grid)
dim(coef(ridge.mod))
## [1]  20 100
ridge.mod$lambda[50]
## [1] 11497.57
coef(ridge.mod)[,50]
##   (Intercept)         AtBat          Hits         HmRun          Runs 
## 407.356050200   0.036957182   0.138180344   0.524629976   0.230701523 
##           RBI         Walks         Years        CAtBat         CHits 
##   0.239841459   0.289618741   1.107702929   0.003131815   0.011653637 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##   0.087545670   0.023379882   0.024138320   0.025015421   0.085028114 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
##  -6.215440973   0.016482577   0.002612988  -0.020502690   0.301433531
sqrt(sum(coef(ridge.mod)[-1,50]^2))
## [1] 6.360612
ridge.mod$lambda[60]
## [1] 705.4802
coef(ridge.mod)[,60]
##  (Intercept)        AtBat         Hits        HmRun         Runs 
##  54.32519950   0.11211115   0.65622409   1.17980910   0.93769713 
##          RBI        Walks        Years       CAtBat        CHits 
##   0.84718546   1.31987948   2.59640425   0.01083413   0.04674557 
##       CHmRun        CRuns         CRBI       CWalks      LeagueN 
##   0.33777318   0.09355528   0.09780402   0.07189612  13.68370191 
##    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
## -54.65877750   0.11852289   0.01606037  -0.70358655   8.61181213
sqrt(sum(coef(ridge.mod)[-1,60]^2))
## [1] 57.11001
predict(ridge.mod, s=50, type = "coefficients")[1:20,]
##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  4.876610e+01 -3.580999e-01  1.969359e+00 -1.278248e+00  1.145892e+00 
##           RBI         Walks         Years        CAtBat         CHits 
##  8.038292e-01  2.716186e+00 -6.218319e+00  5.447837e-03  1.064895e-01 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##  6.244860e-01  2.214985e-01  2.186914e-01 -1.500245e-01  4.592589e+01 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -1.182011e+02  2.502322e-01  1.215665e-01 -3.278600e+00 -9.496680e+00
# ridge validation set
set.seed(1)
train <- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.test <- y[test]
ridge.mod <- glmnet(x[train,], y[train], alpha=0, lambda=grid, thresh=1e-12)
ridge.pred <- predict(ridge.mod, s=4, newx = x[test,])
mean((ridge.pred-y.test)^2) # test error
## [1] 101036.8
mean((mean(y[train])-y.test)^2) # test error of intercept
## [1] 193253.1
ridge.pred <- predict(ridge.mod, s= 1e10, newx = x[test,]) # test error of lambda = infty
mean((ridge.pred - y.test)^2)
## [1] 193253.1
ridge.pred <- predict(ridge.mod, s=0, newx = x[test,], exact = TRUE, x=x[train,], y=y[train])
mean((ridge.pred - y.test)^2)
## [1] 114783.1
lm(y~x, subset = train)
## 
## Call:
## lm(formula = y ~ x, subset = train)
## 
## Coefficients:
## (Intercept)       xAtBat        xHits       xHmRun        xRuns  
##   299.42849     -2.54027      8.36682     11.64512     -9.09923  
##        xRBI       xWalks       xYears      xCAtBat       xCHits  
##     2.44105      9.23440    -22.93673     -0.18154     -0.11598  
##     xCHmRun       xCRuns        xCRBI      xCWalks     xLeagueN  
##    -1.33888      3.32838      0.07536     -1.07841     59.76065  
##  xDivisionW     xPutOuts     xAssists      xErrors  xNewLeagueN  
##   -98.86233      0.34087      0.34165     -0.64207     -0.67442
predict(ridge.mod, alpha=0, type="coefficients", exact=TRUE, x=x[train,], y=y[train])[1:20,1:4]
## 20 x 4 sparse Matrix of class "dgCMatrix"
##                        s0            s1            s2            s3
## (Intercept)  5.479846e+02  5.479845e+02  5.479844e+02  5.479843e+02
## AtBat        5.192368e-08  6.864003e-08  9.073807e-08  1.199504e-07
## Hits         1.790580e-07  2.367041e-07  3.129088e-07  4.136470e-07
## HmRun        8.926999e-07  1.180096e-06  1.560018e-06  2.062251e-06
## Runs         3.141580e-07  4.152984e-07  5.489999e-07  7.257455e-07
## RBI          3.533162e-07  4.670632e-07  6.174300e-07  8.162059e-07
## Walks        4.732016e-07  6.255446e-07  8.269331e-07  1.093157e-06
## Years        1.198689e-06  1.584595e-06  2.094742e-06  2.769125e-06
## CAtBat       3.981558e-09  5.263384e-09  6.957883e-09  9.197911e-09
## CHits        1.501273e-08  1.984594e-08  2.623516e-08  3.468133e-08
## CHmRun       9.980324e-08  1.319340e-07  1.744089e-07  2.305583e-07
## CRuns        3.232510e-08  4.273187e-08  5.648901e-08  7.467513e-08
## CRBI         2.741340e-08  3.623890e-08  4.790568e-08  6.332848e-08
## CWalks       3.625065e-08  4.792122e-08  6.334902e-08  8.374366e-08
## LeagueN      3.471972e-06  4.589742e-06  6.067369e-06  8.020705e-06
## DivisionW   -8.090096e-06 -1.069463e-05 -1.413767e-05 -1.868917e-05
## PutOuts      2.727067e-08  3.605021e-08  4.765626e-08  6.299876e-08
## Assists      1.288616e-08  1.703474e-08  2.251892e-08  2.976869e-08
## Errors       2.804015e-07  3.706743e-07  4.900095e-07  6.477636e-07
## NewLeagueN   3.730442e-06  4.931425e-06  6.519053e-06  8.617804e-06
# ridge CV
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha=0)
plot(cv.out)

bestlam = cv.out$lambda.min
bestlam
## [1] 211.7416
ridge.pred <- predict(ridge.mod, s=bestlam, newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 96015.51
out <- glmnet(x,y,alpha=0)
predict(out, type="coefficients", s=bestlam)[1:20,]
##  (Intercept)        AtBat         Hits        HmRun         Runs 
##   9.88487157   0.03143991   1.00882875   0.13927624   1.11320781 
##          RBI        Walks        Years       CAtBat        CHits 
##   0.87318990   1.80410229   0.13074383   0.01113978   0.06489843 
##       CHmRun        CRuns         CRBI       CWalks      LeagueN 
##   0.45158546   0.12900049   0.13737712   0.02908572  27.18227527 
##    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
## -91.63411282   0.19149252   0.04254536  -1.81244470   7.21208394
# The lasso
lasso.mod <- glmnet(x[train,], y[train], alpha=1, lambda=grid)
plot(lasso.mod)

# lasso CV
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha=1)
plot(cv.out)

bestlam <- cv.out$lambda.min
lasso.pred <- predict(lasso.mod, s=bestlam, newx = x[test,])
mean((lasso.pred - y.test)^2)
## [1] 100743.4
out <- glmnet(x, y, alpha=1, lambda = grid)
lasso.coef <- predict(out, type = "coefficients", s= bestlam)[1:20, ]
lasso.coef
##  (Intercept)        AtBat         Hits        HmRun         Runs 
##   18.5394844    0.0000000    1.8735390    0.0000000    0.0000000 
##          RBI        Walks        Years       CAtBat        CHits 
##    0.0000000    2.2178444    0.0000000    0.0000000    0.0000000 
##       CHmRun        CRuns         CRBI       CWalks      LeagueN 
##    0.0000000    0.2071252    0.4130132    0.0000000    3.2666677 
##    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
## -103.4845458    0.2204284    0.0000000    0.0000000    0.0000000
## 5. PCR and PLS Regression ################
# PCR
library(pls)
set.seed(2)

# cv error for full model
pcr.fit <- pcr(Salary~., data=hitters, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV             452    348.9    352.2    353.5    352.8    350.1    349.1
## adjCV          452    348.7    351.8    352.9    352.1    349.3    348.0
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       349.6    350.9    352.9     353.8     355.0     356.2     363.5
## adjCV    348.5    349.8    351.6     352.3     353.4     354.5     361.6
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        355.2     357.4     347.6     350.1     349.2     352.6
## adjCV     352.8     355.2     345.5     347.6     346.7     349.8
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         94.96    96.28     97.26     97.98     98.65     99.15     99.47
## Salary    46.75    46.86     47.76     47.82     47.85     48.10     50.40
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.75     99.89     99.97     99.99    100.00
## Salary     50.55     53.01     53.85     54.61     54.61
validationplot(pcr.fit, val.type = "MSEP")

# test CV error
set.seed(1)
pcr.fit <- pcr(Salary~., data=hitters, subset = train, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")

summary(pcr.fit) # inspect minimum cv error components
## Data:    X dimension: 131 19 
##  Y dimension: 131 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           464.6    406.1    397.1    399.1    398.6    395.2    386.9
## adjCV        464.6    405.2    396.3    398.1    397.4    394.5    384.5
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       384.8    386.5    394.1     406.1     406.5     412.3     407.7
## adjCV    383.3    384.8    392.0     403.4     403.7     409.3     404.6
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        406.2     417.8     417.6     413.0     407.0     410.2
## adjCV     402.8     413.9     413.5     408.3     402.4     405.5
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.89    60.25    70.85    79.06    84.01    88.51    92.61
## Salary    28.44    31.33    32.53    33.69    36.64    40.28    40.41
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         95.20    96.78     97.63     98.27     98.89     99.27     99.56
## Salary    41.07    41.25     41.27     41.41     41.44     43.20     44.24
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.78     99.91     99.97    100.00    100.00
## Salary     44.30     45.50     49.66     51.13     51.18
pcr.pred <- predict(pcr.fit, x[test,], ncomp = 7)
mean((pcr.pred - y.test)^2)
## [1] 96556.22
pcr.fit <- pcr(y ~ x, scale = TRUE, ncomp = 7)
summary(pcr.fit)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X    38.31    60.16    70.84    79.03    84.29    88.63    92.26
## y    40.63    41.58    42.17    43.22    44.90    46.48    46.69
# PLS
set.seed(1)
pls.fit <- plsr(Salary~., data=hitters, subset = train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data:    X dimension: 131 19 
##  Y dimension: 131 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           464.6    394.2    391.5    393.1    395.0    415.0    424.0
## adjCV        464.6    393.4    390.2    391.1    392.9    411.5    418.8
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       424.5    415.8    404.6     407.1     412.0     414.4     410.3
## adjCV    418.9    411.4    400.7     402.2     407.2     409.3     405.6
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        406.2     408.6     410.5     408.8     407.8     410.2
## adjCV     401.8     403.9     405.6     404.1     403.2     405.5
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.12    53.46    66.05    74.49    79.33    84.56    87.09
## Salary    33.58    38.96    41.57    42.43    44.04    45.59    47.05
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         90.74    92.55     93.94     97.23     97.88     98.35     98.85
## Salary    47.53    48.42     49.68     50.04     50.54     50.78     50.92
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.11     99.43     99.78     99.99    100.00
## Salary     51.04     51.11     51.15     51.16     51.18
pls.pred <- predict(pls.fit, x[test,], ncomp = 2)
mean((pls.pred - y.test)^2)
## [1] 101417.5
pls.fit <- plsr(Salary~., data = hitters, scale = TRUE, ncomp = 2)
summary(pls.fit)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
##         1 comps  2 comps
## X         38.08    51.03
## Salary    43.05    46.40
## 6. Conclusion ################
error.bestsubset <- min(mean.cv.errors)
error.ridge <- mean((ridge.pred-y.test)^2)
error.lasso <- mean((lasso.pred - y.test)^2)
error.pcr <- mean((pcr.pred - y.test)^2)
error.pls <- mean((pls.pred - y.test)^2)

# CV foward
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(hitters), replace = TRUE)
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:19)))
for(j in 1:k){
  best.fit <- regsubsets(Salary~., data = hitters[folds !=j,], nvmax = 19, method = "forward")
  for(i in 1:19){
    pred <- predict(best.fit, hitters[folds == j,], id=i)
    cv.errors[j,i] = mean((hitters$Salary[folds==j]-pred)^2)
  }
}
error.forward <- min(apply(cv.errors, 2, mean))

# CV backward
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(hitters), replace = TRUE)
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:19)))
for(j in 1:k){
  best.fit <- regsubsets(Salary~., data = hitters[folds !=j,], nvmax = 19, method = "backward")
  for(i in 1:19){
    pred <- predict(best.fit, hitters[folds == j,], id=i)
    cv.errors[j,i] = mean((hitters$Salary[folds==j]-pred)^2)
  }
}
error.backward <- min(apply(cv.errors, 2, mean))

# CV hybrid
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(hitters), replace = TRUE)
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:19)))
for(j in 1:k){
  best.fit <- regsubsets(Salary~., data = hitters[folds !=j,], nvmax = 19, method = "seqrep")
  for(i in 1:19){
    pred <- predict(best.fit, hitters[folds == j,], id=i)
    cv.errors[j,i] = mean((hitters$Salary[folds==j]-pred)^2)
  }
}
error.hybrid <- min(apply(cv.errors, 2, mean))

# print result
result <- data.frame(bestsub = error.bestsubset, fwd = error.forward, 
          bwd = error.backward, hyb = error.hybrid, ridge = error.ridge,
          lasso = error.lasso, pcr = error.pcr, pls = error.pls)

barplot(as.numeric(result), names.arg = colnames(result), ylab = "Test MSE", col = rainbow(8))

knitr::kable(result)
bestsub fwd bwd hyb ridge lasso pcr pls
125153.8 125654.9 127701.7 120589.3 96015.51 100743.4 96556.22 101417.5

In conclusion, ridge regression is the best method for this data set.

Advanced Topic

# Install Gurobi optimizer & Gurobi R package prior to install bestsubset package
# See http://www.gurobi.com/index, for more information.
# library(devtools)
# install_github(repo="ryantibs/best-subset", subdir="bestsubset")
library(bestsubset)
## Loading required package: gurobi
## Loading required package: slam
X <- hitters %>% select(-c(Salary, League, Division, NewLeague))
Y <- hitters$Salary
res <- bestsubset::bs(X, Y)

Ref: Hastie, Trevor, Robert Tibshirani, and Ryan J. Tibshirani. “Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso.” arXiv preprint arXiv:1707.08692 (2017).

You can practice these techniques with datasets given as this Databank