Basic Supervised Learning (James Ch4)
Logistic Regression, LDA, QDA, Naive Bayes, KNN
Tutorial: Comparison of classification methods with “Stock Market”, “Caravan Insurance” data
Resampling (James Ch5)
Validation Set, Leave-One-Out, K-fold Cross Validatoin, Bootstrap
Tutorial: Validation of the performance of different classification methods and resampling methods with “Auto” data
Model Selection (James Ch6)
Subset selection (Best subset selection, Stepwise selection)
Shrinkage Methods (Ridge Regression, Lasso)
Dimension Reduction Methods (PCR, PLS)
Tutorial: Test of the subset selection methods with validation of feature selection & dimension reduction (“Hitter” data)
In the regression setting, the standard liner model \[Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \epsilon\] is commonly used to describe the relationship between a response \(Y\) and a set of variables \(X_1, X_2,... ,X_p\). We have seen that one typically fits this model using least squares. Why might we want to use another fitting procedure instead of least squares? As we will see, alternative fitting procedures can yield better prediction accuracy and model interpretability.
Prediction Accuracy
\(n\gg p\), least square shows low bias and low variance, hence perform well on test set.
\(n \approx p\), least square shows high variance, resulting in overfitting.
\(p > n\), there is no longer a unique least squares coefficient estimate and the variance is infinite. By constraining or shrinking the estimated coefficients, we can reduce the variance at the cost of a negligible increase in bias.
in \(p>n\) case, there is noleast square solution because \((X^TX)\) is singular, or not invertible, so that we can not compute equation below. \[\hat\beta = (X^TX)^{-1}X^Ty\]
Model Interpretability
Not associated, irrelevant variables leads to unnecessary complexity in the resulting model.
Performing feature selection or variable selection, we can improve interpretability of model.
Some of apporaches such as Lasso, Ridge, Elasticnet can automatically performing this.
Here we will discuss three important classes of methods
Subset selection: Identifying a subset of the best \(p\) predictors.
Shrinkage: The estimated coefficients are shrunken towards zero relative to the least squares estimates. (a.k.a Regularization)
Dimension Reduction: Projecting the \(p\) predictors into a \(M\)-dimensional subspace, where \(M<p\). This is achieved by computing \(M\) different linear combinations, or projections, of the variables.
To perform best subset selection, we fit a separate least squares regression for each possible combination of the \(p\) predictors. That is, we fit all \(p\) models that contain exactly one predictor, all \(\begin{pmatrix} p \\ 2 \end{pmatrix} = \frac{p(p-1)}{2}\) models that contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is best. The number of possible model is given as \(2^p\). For example, If we have three (\(3\)) candidate predictors, the there are \(2^3 = 8\) possible regression models we can consider. 3 univariable, 3 bivariable, 1 no variable, 1 multivariable case. If \(p=10\), we have \(2^10 = 1024\) possible models. Therefore, the problem of selecting the best model from among the \(2^p\) possibilities considered by best subset selection is not trivial. This is usually broken up into two stages, as described in algorithm below.
In the above Algorithm, Step 2 identifies the best model (on the training data) for each subset size, in order to reduce the problem from one of \(2^p\) possible models to one of \(p + 1\) possible models including null model (which has 0 variable). Now in order to select a single best model, we must simply choose among these \(p + 1\) options. Since we want to have best model in terms of test error, in Step 3, we use cross-validated prediction error, \(C_p\), BIC, or adjusted \(R^2\) in order to select among \(M_0,M_1,...,M_p\).In case of logistic regression, instead of ordering models by RSS, we use the deviance a measure that plays the role of RSS for a broader class of models. It is defined as \[D(y, \hat \theta) = 2(\log[p(y|\hat\theta_s)] - \log[p(y|\hat\theta_0)])\] where \(\hat\theta_s\) denotes the fitted parameter of saturated model \(\hat\theta_0\) denotes parameter of null model.
since the computational complexity of best subset selection grows as \(2^p\), it becomes computationally infeasible for values of \(p\) greater than around 40, even with extremely fast modern computers. There are computational shortcuts-so called branch-and-bound techniques-for eliminating some choices, but these have their limitations as \(p\) gets large.
library(ISLR)
# credit dataset
library(leaps)
credit <- Credit[,-1]; dim(credit)
## [1] 400 11
regfit.credit <- regsubsets(Balance ~ ., data = credit, nvmax = dim(credit)[2]-1)
credit_summary <- summary(regfit.credit)
par(mfrow=c(1,2))
plot(credit_summary$rss, type = 'l', col = "red", ylim = c(0, 8e+07),
xlab = "Number of Predictors", ylab = "Residual Sum of Squares")
points(credit_summary$rss, col = "red", pch = 16)
plot(credit_summary$rsq, type = 'l', col = "red", ylim = c(0,1),
xlab = "Number of Predictors", ylab = "R^2")
points(credit_summary$rsq, col = "red", pch = 16)
par(mfrow=c(1,1))
# inspect data
library(ISLR)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dim(Hitters)
## [1] 322 20
glimpse(Hitters)
## Observations: 322
## Variables: 20
## $ AtBat <int> 293, 315, 479, 496, 321, 594, 185, 298, 323, 401, 57...
## $ Hits <int> 66, 81, 130, 141, 87, 169, 37, 73, 81, 92, 159, 53, ...
## $ HmRun <int> 1, 7, 18, 20, 10, 4, 1, 0, 6, 17, 21, 4, 13, 0, 7, 3...
## $ Runs <int> 30, 24, 66, 65, 39, 74, 23, 24, 26, 49, 107, 31, 48,...
## $ RBI <int> 29, 38, 72, 78, 42, 51, 8, 24, 32, 66, 75, 26, 61, 1...
## $ Walks <int> 14, 39, 76, 37, 30, 35, 21, 7, 8, 65, 59, 27, 47, 22...
## $ Years <int> 1, 14, 3, 11, 2, 11, 2, 3, 2, 13, 10, 9, 4, 6, 13, 3...
## $ CAtBat <int> 293, 3449, 1624, 5628, 396, 4408, 214, 509, 341, 520...
## $ CHits <int> 66, 835, 457, 1575, 101, 1133, 42, 108, 86, 1332, 13...
## $ CHmRun <int> 1, 69, 63, 225, 12, 19, 1, 0, 6, 253, 90, 15, 41, 4,...
## $ CRuns <int> 30, 321, 224, 828, 48, 501, 30, 41, 32, 784, 702, 19...
## $ CRBI <int> 29, 414, 266, 838, 46, 336, 9, 37, 34, 890, 504, 186...
## $ CWalks <int> 14, 375, 263, 354, 33, 194, 24, 12, 8, 866, 488, 161...
## $ League <fct> A, N, A, N, N, A, N, A, N, A, A, N, N, A, N, A, N, A...
## $ Division <fct> E, W, W, E, E, W, E, W, W, E, E, W, E, E, E, W, W, W...
## $ PutOuts <int> 446, 632, 880, 200, 805, 282, 76, 121, 143, 0, 238, ...
## $ Assists <int> 33, 43, 82, 11, 40, 421, 127, 283, 290, 0, 445, 45, ...
## $ Errors <int> 20, 10, 14, 3, 4, 25, 7, 9, 19, 0, 22, 11, 7, 6, 8, ...
## $ Salary <dbl> NA, 475.000, 480.000, 500.000, 91.500, 750.000, 70.0...
## $ NewLeague <fct> A, N, A, N, N, A, A, A, N, A, A, N, N, A, N, A, N, A...
sum(is.na(Hitters))
## [1] 59
hitters <- na.omit(Hitters)
sum(is.na(hitters))
## [1] 0
dim(hitters)
## [1] 263 20
# subset selection
regfit.full <- regsubsets(Salary ~ ., data = hitters, nvmax = 19)
regsummary <- summary(regfit.full)
regsummary$rsq # R^2
## [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
## [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
# plotting results
par(mfrow=c(2,2))
plot(regsummary$rss, xlab = "Number of Variables", ylab = "RSS", type = 'l')
plot(regsummary$adjr2, xlab = "Number of Variables", ylab = "Adjusted Rsq", type = 'l')
points(which.max(regsummary$adjr2), regsummary$adjr2[which.max(regsummary$adjr2)],
col = "red", cex = 2, pch = 20)
plot(regsummary$cp, xlab = "Number of Variables", ylab = "Cp", type = 'l')
points.default(which.min(regsummary$cp), regsummary$cp[which.min(regsummary$cp)],
col = "red", cex = 2, pch = 20)
plot(regsummary$bic, xlab = "Number of Variables", ylab = "BIC", type = 'l')
points(which.min(regsummary$bic), regsummary$bic[which.min(regsummary$bic)],
col = "red", cex = 2, pch = 20)
par(mfrow=c(1,1))
#plot rss with ggvis package
library(ggvis)
rsq <- as.data.frame(regsummary$rsq)
names(rsq) <- "R2"
rsq %>%
ggvis(x=~ c(1:nrow(rsq)), y= ~ R2 ) %>%
layer_points(fill = ~ R2 ) %>%
add_axis("y", title = "R2") %>%
add_axis("x", title = "Number of variables")
# built-in plot method
plot(regfit.full, scale = "r2", col = "red")
plot(regfit.full, scale = "adjr2", col = "red")
plot(regfit.full, scale = "Cp", col = "red")
plot(regfit.full, scale = "bic", col = "red")
# best model for each measure with BIC measure
coef(regfit.full, which.min(regsummary$bic))
## (Intercept) AtBat Hits Walks CRBI
## 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169
## DivisionW PutOuts
## -122.9515338 0.2643076
For computational reasons, best subset selection cannot be applied with very large \(p\). Best subset selection may also suffer from statistical problems when \(p\) is large because an enormous search space can lead to overfitting and high variance of the coefficient estimates. For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model. More formally, the forward stepwise selection procedure is given in the below Algorithm 2
Forward Stepwise Selection
Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.
Unlike best subset selection, which involved fitting \(2^p\) models, forward stepwise selection involves fitting one null model, along with \(p - k\) models in the \(k_{th}\) iteration, for \(k = 0, ...,p - 1\). This amounts to a total of \(1 + \sum_{k=0}^{p - 1}(p - k) = 1+\frac{p(p+1)}{2}\) models. This is a substantial difference: when \(p = 20\), best subset selection requires fitting \(2^{20} = 1,048,576\) models, whereas forward stepwise selection requires fitting only \(1+\frac{20(21)}{2}=211\) models. However, forward stepwise is not guaranteed to find the best possible model out of all \(2^p\) models containing subsets of the \(p\) predictors because if the best model does not contain single best predictor in their final model.
Forward stepwise selection can be applied even in the high-dimensional setting where \(n<p\); however, in this case, it is possible to construct submodels \(M_0, ... ,M_{n-1}\) only, since each submodel is fit using least squares, which will not yield a unique solution if \(p \geq n\).
Backward Stepwise Selection
Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all \(p\) predictors, and then iteratively removes the least useful predictor, one-at-a-time.
Like forward stepwise selection, the backward selection approach searches through only \(1+\frac{p(p+1)}{2}\) models, and so can be applied in settings where \(p\) is too large to apply best subset selection. Also like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the \(p\) predictors. Backward selection requires that the number of samples \(n\) is larger than the number of variables \(p\) (so that the full model can be fit). In contrast, forward stepwise can be used even when \(n<p\), and so it is the only viable subset method when \(p\) is very large.
Hybrid Approaches
The best subset, forward stepwise, and backward stepwise selection approaches generally give similar but not identical models. As another alternative, hybrid versions of forward and backward stepwise selection are available, in which variables are added to the model sequentially, in analogy to forward selection. However, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit. Such an approach attempts to more closely mimic best subset selection while retaining the computational advantages of forward and backward stepwise selection.
credit <- Credit[,-1]; dim(credit)
## [1] 400 11
library(leaps)
# best subset
regfit.best <- regsubsets(Balance ~ ., data = credit,
nvmax = dim(credit)[2]-1, method = "exhaustive") # default, best
res.best <- summary(regfit.best)
# forward
regfit.fwd <- regsubsets(Balance~.,data=credit,
nvmax=dim(credit)[2]-1, method="forward")
res.fwd <- summary(regfit.fwd)
# backward
regfit.bwd <- regsubsets(Balance~.,data=credit,
nvmax=dim(credit)[2]-1, method="backward")
res.bwd <- summary(regfit.bwd)
# hybrid
regfit.hbd <- regsubsets(Balance~.,data=credit,
nvmax=dim(credit)[2]-1, method="seqrep")
res.hbd <- summary(regfit.hbd)
## Which variables are selected by each method?
# best subset selection
subset_list <- c()
for (i in 1:(dim(credit)[2]-1)) {
subset_list[i] <- list(colnames(res.best$which[,res.best$which[i,]]))
}
message("Best Subset Selection")
## Best Subset Selection
names(subset_list) <- paste(1:10, "Variables"); subset_list
## $`1 Variables`
## [1] "(Intercept)" "Rating"
##
## $`2 Variables`
## [1] "(Intercept)" "Income" "Rating"
##
## $`3 Variables`
## [1] "(Intercept)" "Income" "Rating" "StudentYes"
##
## $`4 Variables`
## [1] "(Intercept)" "Income" "Limit" "Cards" "StudentYes"
##
## $`5 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating" "Cards"
## [6] "StudentYes"
##
## $`6 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating" "Cards"
## [6] "Age" "StudentYes"
##
## $`7 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
##
## $`8 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
## [9] "EthnicityAsian"
##
## $`9 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
## [9] "MarriedYes" "EthnicityAsian"
##
## $`10 Variables`
## [1] "(Intercept)" "Income" "Limit"
## [4] "Rating" "Cards" "Age"
## [7] "GenderFemale" "StudentYes" "MarriedYes"
## [10] "EthnicityAsian" "EthnicityCaucasian"
# foward selection
fwd_list <- c()
for (i in 1:(dim(credit)[2]-1)) {
fwd_list[i] <- list(colnames(res.fwd$which[,res.fwd$which[i,]]))
}
message("Forward Selection")
## Forward Selection
names(fwd_list) <- paste(1:10, "Variables"); fwd_list
## $`1 Variables`
## [1] "(Intercept)" "Rating"
##
## $`2 Variables`
## [1] "(Intercept)" "Income" "Rating"
##
## $`3 Variables`
## [1] "(Intercept)" "Income" "Rating" "StudentYes"
##
## $`4 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating" "StudentYes"
##
## $`5 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating" "Cards"
## [6] "StudentYes"
##
## $`6 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating" "Cards"
## [6] "Age" "StudentYes"
##
## $`7 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
##
## $`8 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
## [9] "EthnicityAsian"
##
## $`9 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
## [9] "MarriedYes" "EthnicityAsian"
##
## $`10 Variables`
## [1] "(Intercept)" "Income" "Limit"
## [4] "Rating" "Cards" "Age"
## [7] "GenderFemale" "StudentYes" "MarriedYes"
## [10] "EthnicityAsian" "EthnicityCaucasian"
# backward selection
bwd_list <- c()
for (i in 1:(dim(credit)[2]-1)) {
bwd_list[i] <- list(colnames(res.bwd$which[,res.bwd$which[i,]]))
}
message("Backward Selection")
## Backward Selection
names(bwd_list) <- paste(1:10, "Variables"); bwd_list
## $`1 Variables`
## [1] "(Intercept)" "Limit"
##
## $`2 Variables`
## [1] "(Intercept)" "Income" "Limit"
##
## $`3 Variables`
## [1] "(Intercept)" "Income" "Limit" "StudentYes"
##
## $`4 Variables`
## [1] "(Intercept)" "Income" "Limit" "Cards" "StudentYes"
##
## $`5 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating" "Cards"
## [6] "StudentYes"
##
## $`6 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating" "Cards"
## [6] "Age" "StudentYes"
##
## $`7 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
##
## $`8 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
## [9] "EthnicityAsian"
##
## $`9 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
## [9] "MarriedYes" "EthnicityAsian"
##
## $`10 Variables`
## [1] "(Intercept)" "Income" "Limit"
## [4] "Rating" "Cards" "Age"
## [7] "GenderFemale" "StudentYes" "MarriedYes"
## [10] "EthnicityAsian" "EthnicityCaucasian"
# Hybrid selection
hbd_list <- c()
for (i in 1:(dim(credit)[2]-1)) {
hbd_list[i] <- list(colnames(res.hbd$which[,res.hbd$which[i,]]))
}
message("Hybrid Approach")
## Hybrid Approach
names(hbd_list) <- paste(1:10, "Variables"); hbd_list
## $`1 Variables`
## [1] "(Intercept)" "Rating"
##
## $`2 Variables`
## [1] "(Intercept)" "Income" "Limit"
##
## $`3 Variables`
## [1] "(Intercept)" "Income" "Rating" "StudentYes"
##
## $`4 Variables`
## [1] "(Intercept)" "Income" "Limit" "Cards" "StudentYes"
##
## $`5 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating" "Cards"
## [6] "Age"
##
## $`6 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating" "Cards"
## [6] "Age" "Education"
##
## $`7 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
##
## $`8 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
## [9] "EthnicityAsian"
##
## $`9 Variables`
## [1] "(Intercept)" "Income" "Limit" "Rating"
## [5] "Cards" "Age" "GenderFemale" "StudentYes"
## [9] "MarriedYes" "EthnicityAsian"
##
## $`10 Variables`
## [1] "(Intercept)" "Income" "Limit"
## [4] "Rating" "Cards" "Age"
## [7] "GenderFemale" "StudentYes" "MarriedYes"
## [10] "EthnicityAsian" "EthnicityCaucasian"
Now we want to determine which of these models is best. As we discussed in previous section, the model containing all of the predictors will always have the smallest RSS and the largest \(R^2\), since these quantities are related to the training error. Instead, we wish to choose a model with a low test error. There are two common approaches:
We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
We can directly estimate the test error, using either a validation set approach or a cross-validation approach
a number of techniques for adjusting the training error for the model size are available. These approaches can be used to select among a set of models with different numbers of variables. We now consider four such approaches: \(C_p\), Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted \(R^2\).
Mallow’s \(C_p\)
Mallow’s \(C_p\) is defined as \[C_p = \frac{1}{n} (RSS + 2d\hat\sigma^2) = MSE_{train} + \frac{2d}{n}\hat\sigma^2\] where \(\hat\sigma^2\) is an estimate of the variance of the error \(\epsilon\) associated with each response measurement in the regression model. Typically \(\hat\sigma^2\) is estimated using the full model containing all predictors. Essentially, the \(C_p\) statistic adds a penalty of \(2d\hat\sigma^2\) to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error. If \(\hat\sigma^2\) is an unbiased estimate of \(\sigma^2\) then \(C_p\) is an unbiased estimate of test MSE.We choose the model with the lowest \(C_p\) value. Alternatively, \(C_p\) is defined as \[C_p' = \frac{RSS}{\hat\sigma^2}+2d-n = \frac{n}{\hat\sigma^2}\bigg[MSE_{train} + \bigg(\frac{2d}{n} -1\bigg)\hat\sigma^2\bigg] \] This is equivalent to the definition given above in the sense that \(C_p = \frac{1}{n}\hat\sigma^2(C_p'+n) = \hat\sigma^2 + C_p'\frac{\hat\sigma^2}{n}\), and so the model with smallest \(C_p\) also has smallest \(C_p'\).
Akaike Information Criterion (AIC)
The AIC criterion is defined for a large class of models fit by maximum likelihood. In the case of the linear model with Gaussian errors, maximum likelihood and least squares are the same thing. In this case AIC is given by \[\begin{aligned}AIC &= \frac{1}{n\hat\sigma^2}(RSS + 2d\hat\sigma^2) + Const. \\ &= \frac{MSE_{train}}{\hat\sigma^2} + \frac{2d}{n} + Const. \\ &= C_p\hat\sigma^2 + Const.\end{aligned}\] where, for simplicity, we have omitted an additive constant. Hence for squares models, \(C_p\) and AIC are proportional to each other.
Baysian Information Criterion (BIC)
BIC is derived from a Bayesian point of view, but ends up looking similar to \(C_p\) (and AIC) as well. For the least squares model with d predictors, the BIC is, up to irrelevant constants, given by \[\begin{aligned} BIC &= \frac{1}{n\hat\sigma^2}(RSS+log(n)d\hat\sigma^2) \\ &= \frac{1}{\hat\sigma^2}\bigg[MSE_{train}+\frac{log(n)d}{n}\hat\sigma^2\bigg] \end{aligned}\] like \(C_p\), the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value. Since \(\log n > 2\) for any \(n >7\), the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than \(C_p\). (Here \(\log\) means \(\log_e\) or \(\ln\))
Adjsted \(R^2\)
The adjusted \(R^2\) statistic is another popular approach for selecting among a set of models that contain different numbers of variables. Recall that usual \(R^2\) is defined as \(1-\frac{RSS}{TSS}\), where \(TSS = \sum_{i=1}^n(y_i - \bar y)^2\) is the total sum of squares for the response. Since RSS always decreases as more variables are added to the model, the \(R^2\) always increases as more variables are added. For a least squares model with \(d\) variables, the adjusted \(R^2\) statistic is calculated as \[\text{Adjusted} \ \ R^2 = 1-\frac{RSS/(n-d-1)}{TSS/(n-1)}\] Unlike \(C_p\), AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted \(R^2\) indicates a model with a small test error. Maximizing the adjusted \(R^2\) is equivalent to minimizing \(\frac{RSS}{n-d-1}\). While RSS always decreases as the number of variables in the model increases, \(\frac{RSS}{n-d-1}\) may increase or decrease, due to the presence of \(d\) in the denominator. The intuition behind the adjusted \(R^2\) is that once all of the correct variables have been included in the model, adding additional noise variables will lead to only a very small decrease in RSS. Since adding noise variables leads to an increase in \(d\), such variables will lead to an increase in \(\frac{RSS}{n-d-1}\), and consequently a decrease in the adjusted \(R^2\). Therefore, in theory, the model with the largest adjusted \(R^2\) will have only correct variables and no noise variables. Unlike the \(R^2\) statistic, the adjusted \(R^2\) statistic pays a price for the inclusion of unnecessary variables in the model.
# define predict function for regsubsets object
predict.regsubsets <- function(object, newdata, id, ...){
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
# best subset
regfit.best <- regsubsets(Balance ~ ., data = credit,
nvmax = (dim(credit)[2]-1)) # default, best
res.best <- summary(regfit.best)
# Cp
par(mfrow = c(1,3))
plot(res.best$cp, xlab = "Number of Predictors",
ylab = "C_p", col = "hot pink", type = 'l')
points(res.best$cp, col = "blue", pch = 16)
points(which.min(res.best$cp), min(res.best$cp), pch = 4, col = "blue", cex = 3)
# BIC
# BIC of null model
null.bic <- BIC(glm(Balance ~ 1, data = credit))
# True BIC of each model
bic.actual <- null.bic + res.best$bic
plot(bic.actual, xlab = "Number of Predictors",
ylab = "BIC", col = "hot pink", type = 'l')
points(bic.actual, col = "blue", pch = 16)
points(which.min(bic.actual), min(bic.actual), pch = 4, col = "blue", cex = 3)
# adjust R^2
plot(res.best$adjr2, xlab = "Number of Predictors",
ylab = "Adjusted R^2", col = "hot pink", type = 'l')
points(res.best$adjr2, col = "blue", pch = 16)
points(which.max(res.best$adjr2), max(res.best$adjr2), pch = 4, col = "blue", cex = 3)
# final models
coef(regfit.best, which.min(res.hbd$cp))
## (Intercept) Income Limit Rating Cards
## -488.6158695 -7.8036338 0.1936237 1.0940490 18.1091708
## Age GenderFemale StudentYes
## -0.6206538 -10.4531521 426.5812620
coef(regfit.best, which.min(res.hbd$bic))
## (Intercept) Income Limit Cards StudentYes
## -499.7272117 -7.8392288 0.2666445 23.1753794 429.6064203
coef(regfit.best, which.min(res.hbd$adjr2))
## (Intercept) Rating
## -390.84634 2.56624
# Manual calculation of each criteria
# Perform manual calculation & compare results
# validation set approach
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(credit), 0.75*nrow(credit))
test <- (!train)
regfit.best <- regsubsets(Balance ~ ., data = credit[train,], nvmax = dim(credit)[2]-1)
test.mat <- model.matrix(Balance ~ ., data = credit[test,])
val.errors <- rep(NA, dim(credit)[2]-1)
for (i in 1:(dim(credit)[2]-1)){
coefi <- coef(regfit.best, id = i)
pred <- test.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((credit$Balance[test]-pred)^2)
}
val.errors
## [1] 51754.40 24326.78 12905.05 11616.12 11582.29 12176.73 12041.24
## [8] 12024.54 11980.66 11930.86
# Cross-validation
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(credit), replace = TRUE)
cv.errors <- matrix(NA, k, (dim(credit)[2]-1), dimnames = list(NULL, paste(1:(dim(credit)[2]-1))))
for (j in 1:k){ # for each jth fold,
best.fit <- regsubsets(Balance ~ ., data = credit[folds !=j,],
nvmax = (dim(credit)[2]-1)) # training without jth fold
for (i in 1:(dim(credit)[2]-1)) { # for each model
pred = stats::predict(best.fit, credit[folds == j,], id = i) # test with jth fold
cv.errors[j, i] <- mean((credit$Balance[folds==j]-pred)^2) # test MSE
}
}
# find minimum cross-validation error
mean.cv.errors <- apply(cv.errors, 2, mean) # column mean
mean.cv.errors; colMeans(cv.errors) # identical
## 1 2 3 4 5 6 7 8
## 54501.60 27339.31 11389.85 10208.96 10314.11 10077.26 10246.64 10363.61
## 9 10
## 10391.83 10333.97
## 1 2 3 4 5 6 7 8
## 54501.60 27339.31 11389.85 10208.96 10314.11 10077.26 10246.64 10363.61
## 9 10
## 10391.83 10333.97
### Plotting all errors
plot(sqrt(bic.actual), type = 'l', col = "hot pink",
xlab = "Number of Predictors", ylab = "Square Root of BIC")
points(sqrt(bic.actual), pch = 16, col = "blue")
points(which.min(bic.actual), min(sqrt(bic.actual)), pch = 4, col = "blue", cex = 3)
# Validation set
plot(val.errors, type = 'l', col = "hot pink",
xlab = "Number of Predictors", ylab = "Validation Set Error")
points(val.errors, pch = 16, col = "blue")
points(which.min(val.errors), min(val.errors), pch = 4, col = "blue", cex = 3)
# Cross-validation error
plot(mean.cv.errors, type = 'l', col = "hot pink",
xlab = "Number of Predictors", ylab = "Cross-Validation Error")
points(mean.cv.errors, pch = 16, col = "blue")
points(which.min(mean.cv.errors), min(mean.cv.errors), pch = 4, col = "blue", cex = 3)
par(mfrow = c(1,1))
Validation and Cross-validation
As we could see in the previous section, validation and cross-validation approach can be used for model selection. In here, we must use only the training observations to perform all aspects of model-fitting-including variable selection. Therefore, the determination of which model of a given size is best must be made using only the training observations. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.
## cleaning data
hitters <- na.omit(Hitters)
## validation set approach
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(hitters), rep = TRUE)
test <- (!train)
library(leaps)
regfit.best <- regsubsets(Salary ~ ., data = hitters[train,], nvmax = 19)
test.mat <- model.matrix(Salary ~ ., data = hitters[test,])
val.errors <- rep(NA, 19)
for (i in 1:19){
coefi <- coef(regfit.best, id = i)
pred <- test.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((hitters$Salary[test]-pred)^2)
}
val.errors
## [1] 220968.0 169157.1 178518.2 163426.1 168418.1 171270.6 162377.1
## [8] 157909.3 154055.7 148162.1 151156.4 151742.5 152214.5 157358.7
## [15] 158541.4 158743.3 159972.7 159859.8 160105.6
which.min(val.errors)
## [1] 10
coef(regfit.best, which.min(val.errors))
## (Intercept) AtBat Hits Walks CAtBat CHits
## -80.2751499 -1.4683816 7.1625314 3.6430345 -0.1855698 1.1053238
## CHmRun CWalks LeagueN DivisionW PutOuts
## 1.3844863 -0.7483170 84.5576103 -53.0289658 0.2381662
# full model regsubset
reg.fit.best <- regsubsets(Salary ~ ., data = hitters, nvmax = 19)
coef(regfit.best, 10)
## (Intercept) AtBat Hits Walks CAtBat CHits
## -80.2751499 -1.4683816 7.1625314 3.6430345 -0.1855698 1.1053238
## CHmRun CWalks LeagueN DivisionW PutOuts
## 1.3844863 -0.7483170 84.5576103 -53.0289658 0.2381662
## Cross-Validation
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(hitters), replace = TRUE)
cv.errors = matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
for (j in 1:k){ # for each jth fold,
best.fit <- regsubsets(Salary ~ ., data = hitters[folds !=j,],
nvmax = 19) # training without jth fold
for (i in 1:19) { # for each model
pred = predict(best.fit, hitters[folds == j,], id = i) # test with jth fold
cv.errors[j, i] <- mean((hitters$Salary[folds==j]-pred)^2) # test MSE
}
}
# find minimum cross-validation error
mean.cv.errors <- apply(cv.errors, 2, mean) # column mean
mean.cv.errors; colMeans(cv.errors) # identical
## 1 2 3 4 5 6 7 8
## 160093.5 140196.8 153117.0 151159.3 146841.3 138302.6 144346.2 130207.7
## 9 10 11 12 13 14 15 16
## 129459.6 125334.7 125153.8 128273.5 133461.0 133974.6 131825.7 131882.8
## 17 18 19
## 132750.9 133096.2 132804.7
## 1 2 3 4 5 6 7 8
## 160093.5 140196.8 153117.0 151159.3 146841.3 138302.6 144346.2 130207.7
## 9 10 11 12 13 14 15 16
## 129459.6 125334.7 125153.8 128273.5 133461.0 133974.6 131825.7 131882.8
## 17 18 19
## 132750.9 133096.2 132804.7
plot(mean.cv.errors, type = 'b', ylab = "Mean Cross-Validation Errors",
xlab = "Number of Predictors", pch=16)
# full model & get the best coefficients for our final model
reg.best <- regsubsets(Salary ~ ., data = hitters, nvmax = 19)
coef(reg.best, which.min(mean.cv.errors))
## (Intercept) AtBat Hits Walks CAtBat
## 135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914
## CRuns CRBI CWalks LeagueN DivisionW
## 1.4553310 0.7852528 -0.8228559 43.1116152 -111.1460252
## PutOuts Assists
## 0.2894087 0.2688277
As an alternative of least square method, we can fit a model containing all \(p\) predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.
Recall that the least squares fitting procedure estimates \(\beta_0, \beta_1, ..., \beta_p\) using the values that minimize
\[RSS = \sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p\beta_j x_{ij}\Bigg)^2\] Ridge regression is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity. In particular, the ridge regression coefficient estimates \(\hat\beta^R\) are the values that minimize \[\sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p\beta_j x_{ij}\Bigg)^2+\lambda\sum_{j=1}^p\beta_j^2 = RSS + \lambda\sum_{j=1}^p\beta_j^2\] where \(\lambda \geq 0\) is a tuning parameter, to be determined separately. As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the second term, \(\lambda\sum_{j=1}^p\beta_j^2\) called a shrinkage penalty is small when \(\beta_0, \beta_1, ..., \beta_p\) are close to zero, and so it has the effect of shrinking the estimates of \(\beta_j\) towards zero. The tuning parameter \(\lambda\) serves to control the relative impact of these two terms on the regression coefficient estimates. When \(\lambda=0\), the penalty term has no effect, and ridge regression will produce the least squares estimates. However, as \(\lambda \rightarrow \infty\), the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero. Unlike least squares, which generates only one set of coefficient estimates, ridge regression will produce a different set of coefficient estimates, \(\hat\beta_{\lambda}^R\), for each value of \(\lambda\). Selecting a good value for \(\lambda\) is critical. Note that in the above equation, the shrinkage penalty is applied to \(\beta_1, ..., \beta_p\), but not to the intercept \(\beta_0\). We want to shrink the estimated association of each variable with the response. If we assume that the variables-that is, the columns of the data matrix X-have been centered to have mean zero before ridge regression is performed, then the estimated intercept will take the form \(\hat\beta_0=\bar y=\sum_{i=1}^n \frac{y_i}{n}\).
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:ggvis':
##
## band
## Loading required package: foreach
## Loaded glmnet 2.0-16
# 0-1 normalization
normalize <- function(x)
{
return((x- min(x)) /(max(x)-min(x)))
}
# Prep data
credit <- Credit[,-1]
x <- model.matrix(Balance ~ ., credit)[,-1]
y <- credit$Balance
# scaling of data
# x[,1:6] <- apply(x[,1:6],2, scale) - different normalization scheme for dummy variable
x <- scale(x)
# fit model
grid <- 10^seq(10, -2, length = 100)
fit <- glmnet(x, y, alpha = 0, lambda = grid, standardize = TRUE) # ridge regression
fit.lsq <- lm(y ~ x)
# calculating L2 norm
fit$lambda # display all lambda values
## [1] 1.000000e+10 7.564633e+09 5.722368e+09 4.328761e+09 3.274549e+09
## [6] 2.477076e+09 1.873817e+09 1.417474e+09 1.072267e+09 8.111308e+08
## [11] 6.135907e+08 4.641589e+08 3.511192e+08 2.656088e+08 2.009233e+08
## [16] 1.519911e+08 1.149757e+08 8.697490e+07 6.579332e+07 4.977024e+07
## [21] 3.764936e+07 2.848036e+07 2.154435e+07 1.629751e+07 1.232847e+07
## [26] 9.326033e+06 7.054802e+06 5.336699e+06 4.037017e+06 3.053856e+06
## [31] 2.310130e+06 1.747528e+06 1.321941e+06 1.000000e+06 7.564633e+05
## [36] 5.722368e+05 4.328761e+05 3.274549e+05 2.477076e+05 1.873817e+05
## [41] 1.417474e+05 1.072267e+05 8.111308e+04 6.135907e+04 4.641589e+04
## [46] 3.511192e+04 2.656088e+04 2.009233e+04 1.519911e+04 1.149757e+04
## [51] 8.697490e+03 6.579332e+03 4.977024e+03 3.764936e+03 2.848036e+03
## [56] 2.154435e+03 1.629751e+03 1.232847e+03 9.326033e+02 7.054802e+02
## [61] 5.336699e+02 4.037017e+02 3.053856e+02 2.310130e+02 1.747528e+02
## [66] 1.321941e+02 1.000000e+02 7.564633e+01 5.722368e+01 4.328761e+01
## [71] 3.274549e+01 2.477076e+01 1.873817e+01 1.417474e+01 1.072267e+01
## [76] 8.111308e+00 6.135907e+00 4.641589e+00 3.511192e+00 2.656088e+00
## [81] 2.009233e+00 1.519911e+00 1.149757e+00 8.697490e-01 6.579332e-01
## [86] 4.977024e-01 3.764936e-01 2.848036e-01 2.154435e-01 1.629751e-01
## [91] 1.232847e-01 9.326033e-02 7.054802e-02 5.336699e-02 4.037017e-02
## [96] 3.053856e-02 2.310130e-02 1.747528e-02 1.321941e-02 1.000000e-02
coef(fit)[1:5, 1:5] # display some of coefficients
## 5 x 5 sparse Matrix of class "dgCMatrix"
## s0 s1 s2 s3
## (Intercept) 5.200150e+02 5.200150e+02 5.200150e+02 5.200150e+02
## Income 9.788429e-06 1.293973e-05 1.710555e-05 2.261253e-05
## Limit 1.819162e-05 2.404825e-05 3.179037e-05 4.202500e-05
## Rating 1.823232e-05 2.410205e-05 3.186149e-05 4.211902e-05
## Cards 1.825213e-06 2.412824e-06 3.189611e-06 4.216478e-06
## s4
## (Intercept) 5.200150e+02
## Income 2.989244e-05
## Limit 5.555457e-05
## Rating 5.567886e-05
## Cards 5.573936e-06
fit$l2 <- sqrt(colSums(coef(fit)[-1,]^2)) # L2 Norms
# Normalized L2 Norms (w.r.t L2 Norm of least square)
fit$norml2 <- fit$l2/sqrt(sum(coef(fit.lsq)[-1]^2))
# plot results
par(mfrow = c(1,2))
library(RColorBrewer)
palette <- brewer.pal(11, "Spectral")
# lambda vs coefficient
plot(fit, xvar = "lambda", col = palette, xlim = c(-5,6),
ylab = "Standardized Coefficients")
legend(x=-5, y=-50, fit$beta@Dimnames[[1]],
col = palette, lty = rep(1,11), cex = 0.35)
# L2 Norm vs coefficients
plot(fit$norml2, matrix(fit$beta@x, nrow=11)[1,], type = 'l', col = palette[1], xlim = c(0, 1),
ylim = c(-300, 450), xlab = "Normalized L2Norm", ylab = "Standardized Coefficients")
for (i in 2:11) lines(fit$norml2, matrix(fit$beta@x, nrow=11)[i,], type = 'l', col = palette[[i]])
legend(x=0, y=-50, fit$beta@Dimnames[[1]],
col = palette, lty = rep(1,11), cex = 0.35)
par(mfrow = c(1,1))
# training & testing for ridge regression
set.seed(1)
train <- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.test <- y[test]
# prediction with test data set with parameter lambda = 4
ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
ridge.pred <- predict(ridge.mod, s=4, newx = x[test,]) # lambda = 4
mean((ridge.pred - y.test)^2) # MSE
## [1] 10634.78
mean((mean(y[train]) - y.test)^2) # MSE for lambda -> infty, using only intercept
## [1] 194734.7
ridge.pred <- predict(ridge.mod, s=1e10, newx = x[test,]) # lambda = 4
mean((ridge.pred - y.test)^2) # MSE
## [1] 194734.7
leastsq.pred <- predict(glmnet(x, y, alpha = 0, lambda = 4),
newx = x[test,], s=4, exact = TRUE) # lambda = 4
mean((leastsq.pred - y.test)^2) # MSE for least sq
## [1] 10165.56
lm(y ~ x, subset = train)
##
## Call:
## lm(formula = y ~ x, subset = train)
##
## Coefficients:
## (Intercept) xIncome xLimit
## 517.397 -278.198 486.398
## xRating xCards xAge
## 142.636 29.462 -10.110
## xEducation xGenderFemale xStudentYes
## -5.943 -10.043 131.612
## xMarriedYes xEthnicityAsian xEthnicityCaucasian
## -5.132 13.299 11.423
predict(glmnet(x, y, alpha = 0, lambda = 4), s=4, exact = T, type = "coefficients")[1:12,]
## (Intercept) Income Limit
## 520.015000 -263.241642 334.820628
## Rating Cards Age
## 270.035821 20.431666 -11.542037
## Education GenderFemale StudentYes
## -2.746844 -4.965227 125.834139
## MarriedYes EthnicityAsian EthnicityCaucasian
## -5.025895 7.555101 5.068985
# cv of ridge regression
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha = 0)
plot(cv.out)
bestlambda <- cv.out$lambda.min; bestlambda
## [1] 45.57503
ridge.pred <- predict(ridge.mod, s = bestlambda, newx = x[test,])
mean((ridge.pred - y.test)^2)
## [1] 14871.68
# final model
out <- glmnet(x, y, alpha = 0, lambda = grid)
predict(out, type = "coefficients", s = bestlambda)[1:12,]
## (Intercept) Income Limit
## 520.015000 -172.698485 258.498797
## Rating Cards Age
## 252.257647 21.853938 -16.947831
## Education GenderFemale StudentYes
## -1.353316 -2.131229 113.079064
## MarriedYes EthnicityAsian EthnicityCaucasian
## -5.952610 5.490850 4.475777
In the above figures, normalized L2 Norm of ridge regression parameters denoted as \(||\hat\beta_{\lambda}^R||_2/||\hat\beta||_2\) indicates the amount that the ridge regression coefficient estimates have been shrunken towards zero. The standard least squares coefficient estimates are scale equivariant: multiplying \(X_j\) by a constant \(c\) simply leads to a scaling of the least squares coefficient estimates by a factor of \(\frac{1}{c}\). In other words, regardless of how the \(j_{th}\) predictor is scaled, \(X_j\hat\beta_j\) will remain the same. In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant. That is, \(X_j\hat\beta_{j,\lambda}^R\) will depend not only on the value of \(\lambda\), but also on the scaling of the \(j_{th}\) predictor. In fact, the value of \(X_j\hat\beta_{j, \lambda}^R\) may even depend on the scaling of the other predictors! Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula \[\tilde x_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar x_j)^2}}\] so that they are all on the same scale. (Note that our figure and figure in the text book is bit different because the scaling scheme we used is different from the scheme used in the text book but the results are almost the same)
Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off. As \(\lambda\) increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.
As you can see above figures, because ridge regression reduces variance and increases bias, at some value of lambda, it reduces test error effectively. In general in situations where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. In particular, when the number of variables \(p\) is almost as large as the number of observations \(n\), the least squares estimates will be extremly variable but ridge regression can do well on the situation even if \(p>n\), by decreasing variance and slightly increasing bias. In addition, ridge regression is computationally favorable, in fact, the computations required to solve ridge regression equation, simultaneously for all values of \(\lambda\), are almost identical to those for fitting a model using least squares.
Ridge regression does have one obvious disadvantage - it will include all \(p\) predictors in the final model. Unless \(\lambda = \infty\), the penalty \(\lambda\sum_{j}^p \beta_j^2\) will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero. The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, \(\hat\beta_{\lambda}^L\), minimize the quantity
\[\sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p\beta_j x_{ij}\Bigg)^2+\lambda\sum_{j=1}^p|\beta_j| = RSS + \lambda\sum_{j=1}^p|\beta_j|\] The only difference from ridge to lasso is that the \(\hat\beta_j^2\) term in the ridge regression penalty has been replaced by \(|\beta_j|\) in the lasso penalty. In statistical language, the lasso uses an \(\ell_1\) penalty instead of an \(\ell_2\) penalty. The \(\ell_1\) norm of a coefficient vector \(\beta\) is given by \(||\beta||_1 = \sum_{j}^p|\beta_j|\). Unlike ridge regression, lasso performs variable selection when the tuning parameter \(\lambda\) is sufficiently large. We say that the lasso yields sparse models - that is, models that involve only a subset of the variables.
grid <- 10^seq(10, -2, length = 100)
fit <- glmnet(x, y, alpha = 1, lambda = grid, standardize = TRUE) # lasso regression
fit.lsq <- lm(y ~ x)
# calculating L1 norm
fit$lambda # display all lambda values
## [1] 1.000000e+10 7.564633e+09 5.722368e+09 4.328761e+09 3.274549e+09
## [6] 2.477076e+09 1.873817e+09 1.417474e+09 1.072267e+09 8.111308e+08
## [11] 6.135907e+08 4.641589e+08 3.511192e+08 2.656088e+08 2.009233e+08
## [16] 1.519911e+08 1.149757e+08 8.697490e+07 6.579332e+07 4.977024e+07
## [21] 3.764936e+07 2.848036e+07 2.154435e+07 1.629751e+07 1.232847e+07
## [26] 9.326033e+06 7.054802e+06 5.336699e+06 4.037017e+06 3.053856e+06
## [31] 2.310130e+06 1.747528e+06 1.321941e+06 1.000000e+06 7.564633e+05
## [36] 5.722368e+05 4.328761e+05 3.274549e+05 2.477076e+05 1.873817e+05
## [41] 1.417474e+05 1.072267e+05 8.111308e+04 6.135907e+04 4.641589e+04
## [46] 3.511192e+04 2.656088e+04 2.009233e+04 1.519911e+04 1.149757e+04
## [51] 8.697490e+03 6.579332e+03 4.977024e+03 3.764936e+03 2.848036e+03
## [56] 2.154435e+03 1.629751e+03 1.232847e+03 9.326033e+02 7.054802e+02
## [61] 5.336699e+02 4.037017e+02 3.053856e+02 2.310130e+02 1.747528e+02
## [66] 1.321941e+02 1.000000e+02 7.564633e+01 5.722368e+01 4.328761e+01
## [71] 3.274549e+01 2.477076e+01 1.873817e+01 1.417474e+01 1.072267e+01
## [76] 8.111308e+00 6.135907e+00 4.641589e+00 3.511192e+00 2.656088e+00
## [81] 2.009233e+00 1.519911e+00 1.149757e+00 8.697490e-01 6.579332e-01
## [86] 4.977024e-01 3.764936e-01 2.848036e-01 2.154435e-01 1.629751e-01
## [91] 1.232847e-01 9.326033e-02 7.054802e-02 5.336699e-02 4.037017e-02
## [96] 3.053856e-02 2.310130e-02 1.747528e-02 1.321941e-02 1.000000e-02
coef(fit)[1:5, 1:5] # display some of coefficients
## 5 x 5 sparse Matrix of class "dgCMatrix"
## s0 s1 s2 s3 s4
## (Intercept) 520.015 520.015 520.015 520.015 520.015
## Income . . . . .
## Limit . . . . .
## Rating . . . . .
## Cards . . . . .
fit$l1 <- colSums(abs(as.matrix(coef(fit)[-1,]))) # L1 Norms
# Normalized L1 Norms (w.r.t L1 Norm of least square)
fit$norml1 <- fit$l1/sum(abs(coef(fit.lsq)[-1]))
# plot results
par(mfrow = c(1,2))
library(RColorBrewer)
palette <- brewer.pal(11, "Spectral")
# lambda vs coefficient
plot(fit, xvar = "lambda", col = palette, xlim = c(-5,7),
ylab = "Standardized Coefficients")
legend(x=-5, y=-25, fit$beta@Dimnames[[1]],
col = palette, lty = rep(1,11), cex = 0.35)
# L1 Norm vs coefficients
res_beta <- as.matrix(fit$beta)
plot(fit$norml1, res_beta[1,], type = 'l', col = palette[1], xlim = c(0, 1),
ylim = c(-300, 400), xlab = "Normalized L1Norm", ylab = "Standardized Coefficients")
for (i in 2:11) lines(fit$norml1, res_beta[i,], type = 'l', col = palette[[i]])
legend(x=0, y=-25, fit$beta@Dimnames[[1]],
col = palette, lty = rep(1,11), cex = 0.35)
par(mfrow = c(1,1))
# lasso model
lasso.mod <- glmnet(x[train,], y[train], alpha = 1, lambda = grid)
# cv of lasso
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha = 1)
plot(cv.out)
bestlambda <- cv.out$lambda.min; bestlambda
## [1] 0.6767748
lasso.pred <- predict(lasso.mod, s = bestlambda, newx = x[test,])
mean((lasso.pred - y.test)^2)
## [1] 10616.67
# final model
out <- glmnet(x, y, alpha = 1, lambda = grid)
predict(out, type = "coefficients", s = bestlambda)[1:12,]
## (Intercept) Income Limit
## 520.015000 -271.797813 401.236258
## Rating Cards Age
## 211.932352 22.192181 -10.225534
## Education GenderFemale StudentYes
## -2.580207 -4.540163 126.872596
## MarriedYes EthnicityAsian EthnicityCaucasian
## -3.665522 5.784060 3.434557
One can show that the lasso and ridge regression coefficient estimates solve the problems \[\min_{\beta} \Bigg\{\sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij}\Bigg)^2\Bigg\} \hspace{1cm} \text{subject to} \hspace{1cm} \sum_{j=1}^p|\beta_j|\leq s\] and \[\min_{\beta} \Bigg\{\sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij}\Bigg)^2\Bigg\} \hspace{1cm} \text{subject to} \hspace{1cm} \sum_{j=1}^p\beta_j^2\leq s\] respectively. When \(p=2\), the lasso coefficient estimates have the smallest RSS out of all points that lie within the diamond defined by \(|\beta_1|+|\beta_2|\leq s\). Similarly, the ridge regression estimates have the smallest RSS out of all points that lie within the circle defined by \(\beta_1^2+\beta_2^2\leq s\). Consider the problem \[\min_{\beta} \Bigg\{\sum_{i=1}^n \Bigg(y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij}\Bigg)^2\Bigg\} \hspace{1cm} \text{subject to} \hspace{1cm} \sum_{j=1}^p I(\beta_j\neq 0)\leq s\] Here \(I(\beta_j \neq 0)\) is an indicator variable. It takes on a value of 1 if \(\beta_j \neq 0\), and equals zero otherwise. This is equivalent to best subset selection. Unfortunately, solving the above equation is computationally infeasible when \(p\) is large, since it requires considering all \(p \choose s\) models containing \(s\) predictors. Therefore, we can interpret ridge regression and the lasso as computationally feasible alternatives to best subset selection that replace the intractable form of the best subset selection.
As you can see in the picture above, the constant RSS lines marked by the elipse intersect the circles and therefore have non-zero \(\beta_1\) and \(\beta_2\) for ridge regression. On the other hand, for lasso, the intersection occurs in the \(\beta_2\) axis, so only the \(\beta_2\) coefficient has the value and \(\beta_1\) has exactly zero. When \(p = 3\), then the constraint region for ridge regression becomes a sphere, and the constraint region for the lasso becomes a polyhedron. When \(p > 3\), the constraint for ridge regression becomes a hypersphere, and the constraint for the lasso becomes a polytope. However, the key ideas still hold. In particular, the lasso leads to feature selection when \(p > 2\) due to the sharp corners of the polyhedron or polytope.
So what is the better method? lasso or ridge?
In general, the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size. However, the number of predictors that is related to the response is never known a priori for real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.
In order to obtain a better intuition about the behavior of ridge regression and the lasso, consider a simple special case with \(n=p\), and \(X = I\), identity matrix. Assume we perform regression without intercept. Then least squares problem simplifies to finding \(\beta_1, ..., \beta_p\) that minimize \[\sum_{j=1}^p(y_j - \beta_j)^2\] then the least squares solution is given by \[\hat\beta_j = y_j\] And in this setting, ridge regression estimates take the form \[\hat\beta_j^R = \frac{y_j}{(1+\lambda)}\] and the lasso take the form \[\hat\beta_j^L =\Bigg\{\begin{array}{lll} {y_j-\lambda/2} \hspace{1cm} \text{if } y_j > \lambda/2;\\ {y_j-\lambda/2} \hspace{1cm} \text{if } y_j < -\lambda/2;\\ 0 \hspace{2.3cm} \text{if } |y_j| \leq \lambda/2.\end{array} \]
As illustrated in the above figure, in ridge regression, each least squares coefficient estimate is shrunken by the same proportion. In contrast, the lasso shrinks each least squares coefficient towards zero by a constant amount, \(\lambda/2\); the least squares coefficients that are less than \(\lambda/2\) in absolute value are shrunken entirely to zero. The type of shrinkage performed by the lasso in this simple setting is known as softthresholding. The fact that some lasso coefficients are shrunken entirely to softzero explains why the lasso performs feature selection. In general, though it may little complicated, the main ideas still hold approximately. ridge regression more or less shrinks every dimension of the data by the same proportion, whereas the lasso more or less shrinks all coefficients toward zero by a similar amount, and sufficiently small coefficients are shrunken all the way to zero.
The Ridge and lasso can also interpreted as Bayesian way. Let’s assume that the coefficient vector \(\beta\) has some prior distribution, say \(p(\beta)\), where \(\beta=(\beta_0, \beta_1, ..., \beta_p)^T\). The likelihood of the data can be written as \(f(Y |X, \beta)\), where \(X = (X_1,...,X_p)\). Multiplying the prior distribution by the likelihood gives us (up to a proportionality constant) the posterior distribution which takes the form \[p(\beta|X, Y) \propto f(Y|X,\beta)p(\beta|X) = f(Y|X, \beta)p(\beta)\] We assume the usual linear model \[Y = \beta_0 + X_1\beta_1 + ... + X_p\beta_p + \epsilon\] and suppose that the errors are independent and drawn from a normal distribution. Furthermore, assume thaat \(p(\beta) = \prod_{j=1}^p g(\beta_j)\), for some density function \(g\). It turns out that ridge regression and the lasso follow naturally from two special cases of \(g\)
If \(g\) is a Gaussian distribution with mean zero and standard deviation a function of \(\lambda\), then it follows that the posterior mode for \(\beta\) - that is, the most likely value for \(\beta\), given the data - is given by the ridge regression solution. (In fact, the ridge regression solution is also the posterior mean.)
If \(g\) is a double-exponential (Laplace) distribution with mean zero and scale parameter a function of \(\lambda\), then it follows that the posterior mode for \(\beta\) is the lasso solution. (However, the lasso solution is not the posterior mean, and in fact, the posterior mean does not yield a sparse coefficient vector)
Therefore, from a Bayesian viewpoint, ridge regression and the lasso follow directly from assuming the usual linear model with normal errors, together with a simple prior distribution for \(\beta\). Notice that the lasso prior is steeply peaked at zero, while the Gaussian is flatter and fatter at zero. Hence, the lasso expects a priori that many of the coefficients are (exactly) zero, while ridge assumes the coefficients are randomly distributed about zero. (for more information, see this post and this material)
so far we discussed about how ridge regression and the lasso works. To properly perform these methods, we need to choose right parameter \(\lambda\) for each specific problem. Cross-validation provides a simple way to tackle this problem. as we already seen the above example, we can choose \(\lambda\) at the minimum cross-validation error occur.
We discussed methods that are defined using the original predictors. We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods. Let \(Z_1, Z_2, ..., Z_M\) represent \(M<p\) linear combinations of our original \(p\) predictors. That is, \[Z_m = \sum_{j=1}^p \phi_{jm}X_j\] for some constants \(\phi_{1m},\phi_{2m},..., \phi_{pm}\), \(m=1, ..., M\). We can then fit the linear regression model \[y_i = \theta_0 + \sum_{m=1}^M \theta_mZ_{im}+\epsilon_i, \hspace{1cm} i=1,...,n\] using least squares. The term dimension reduction comes from the fact that this approach reduces the problem of estimating the \(p+1\) coefficients \(\beta_0, \beta_1, ..., \beta_p\) to the simpler problem of estimating the \(M + 1\) coefficients \(\theta_0, \theta)1, ..., \theta_M\), where \(M<p\). Notice that \[\sum_{m=1}^M\theta_MZ_{im} = \sum_{m=1}^M \theta_m \sum_{j=1}^p \phi_{jm}x_{ij} = \sum_{j=1}^p\sum_{m=1}^M \theta_m \phi_{jm}x_{ij}=\sum_{j=1}^p\beta_jx_{ij}\] where \[\beta_j = \sum_{m=1}^M =\theta_m\phi_{jm}\] Hence now the above regression equation can be thought of as a special case of the typical original linear regression model. Dimension reduction serves to constrain the estimated \(\beta_j\) coefficients, since now they must take the form above. This constraint on the form of the coefficients has the potential to bias the coefficient estimates. However, in situations where \(p\) is large relative to \(n\), selecting a value of \(M \ll\ p\) can significantly reduce the variance of the fitted coefficients.
All dimension reduction methods work in two steps. First, the transformed predictors \(Z_1, Z_2,...,Z_M\) are obtained. Second, the model is fit using these \(M\) predictors. However, the choice of \(Z_1, Z_2,...,Z_M\) , or equivalently, the selection of the \(\phi_{jm}\)’s, can be achieved in different ways. We will consider principal components and partial least squares here.
Principal Components Regression
Principal components analysis (PCA) is a technique for reducing the dimension of a \(n\times p\) data matrix \(X\). The first principal component direction of the data is that along which the observations vary the most. For example, \(X \in R^2\), the first PC can be represented as \[Z_1 = \phi_{11} \times (X_1 - \bar X_1) + \phi_{21} \times (X_2 - \bar X_2)\] Here \(\phi_{11}\) and \(\phi_{21}\) are the principal component loadings, which define the direction of PCA plot. The idea is that out of every possible linear combination of \(X_1\) and \(X_2\) such that \(\phi_{11}^2 + \phi_{21}^2=1\), this particular linear combination yields the highest varianc: i.e. this is the linear combination for which \(Var(Z_1)\) is maximized. If the two loadings are both positive and have similar size, then \(Z_1\) is almost an average of the two variables. The values of \(z_{11},...,z_{n1}\) are known as the principal component scores. There is also another interpretation for PCA: the first principal component vector defines the line that is as close as possible to the data.
The Principal components regression (PCR) approach involves constructing the first \(M\) principal components \(Z_1, ..., Z_M\), and then using these components as the predictors in a linear regression model that is fit using least squares. The key idea is that often a small number of principal components suffice to explain most of the variability in the data, as well as the relationship with the response. In other words, we assume that the directions in which \(X_1,...,X_p\) show the most variation are the directions that are associated with \(Y\).
We note that even though PCR provides a simple way to perform regression using \(M<p\) predictors, it is not a feature selection method. This is because each of the \(M\) principal components used in the regression is a linear combination of all \(p\) of the original features. In this sense, PCR is more closely related to ridge regression than to the lasso. In fact, one can show that PCR and ridge regression are very closely related. One can even think of ridge regression as a continuous version of PCR.
In PCR, the number of principal components, \(M\), is typically chosen by cross-validation. When performing PCR, we generally recommend standardizing each predictor, prior to generating the principal components. This standardization ensures that all variables are on the same scale. In the absence of standardization, the high-variance variables will tend to play a larger role in the principal components obtained, and the scale on which the variables are measured will ultimately have an effect on the final PCR model
# toy example
library(boot)
data(bigcity); head(bigcity) # 1920 and 1930 US population data set
## u x
## 1 138 143
## 2 93 104
## 3 61 69
## 4 179 260
## 5 48 75
## 6 37 63
city.pca <- prcomp(bigcity, center = TRUE, scale.=TRUE)
summary(city.pca)
## Importance of components:
## PC1 PC2
## Standard deviation 1.4077 0.13512
## Proportion of Variance 0.9909 0.00913
## Cumulative Proportion 0.9909 1.00000
pc1_score <- city.pca$sdev[1]*(bigcity$u-mean(bigcity$u)) +
city.pca$sdev[2] * (bigcity$x-mean(bigcity$x))
pc2_score <- city.pca$sdev[2]*(bigcity$u-mean(bigcity$u)) -
city.pca$sdev[1] * (bigcity$x-mean(bigcity$x))
df1 <- data.frame(u = bigcity$u, pc1_score, x = bigcity$x); df1 <- df1[order(df1$u),]
df2 <- data.frame(u = bigcity$u, pc2_score, x = bigcity$x); df2 <- df2[order(df2$u),]
# plot regression line
plot(pc1_score ~ u, data=df1, type = 'l', col = "firebrick",
xlab = "US population in 1920", ylab = "US population in 1930")
points(bigcity$u[order(bigcity$u)], bigcity$x[order(bigcity$u)],
pch=16, col = "cornflowerblue")
# PC vs predictors
par(mfrow=c(2,2))
plot(df1$pc1_score[order(df1$pc1_score)], df1$u[order(df1$pc1_score)], col = "firebrick",
pch = 16, xlab = "1st Principal Components", ylab = "US population in 1920")
plot(df1$pc1_score[order(df1$pc1_score)], df1$x[order(df1$pc1_score)], col = "firebrick",
pch = 16, xlab = "1st Principal Components", ylab = "US population in 1930")
plot(df2$pc2_score[order(df2$pc2_score)], df2$u[order(df2$pc2_score)], col = "firebrick",
pch = 16, xlab = "2nd Principal Components", ylab = "US population in 1920")
plot(df2$pc2_score[order(df2$pc2_score)], df2$x[order(df2$pc2_score)], col = "firebrick",
pch = 16, xlab = "2nd Principal Components", ylab = "US population in 1930")
par(mfrow=c(1,1))
# visualize pca
# library(devtools)
# install_github("vqv/ggbiplot")
library(ggbiplot)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:ggvis':
##
## resolution
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Loading required package: scales
##
## Attaching package: 'scales'
## The following objects are masked from 'package:ggvis':
##
## fullseq, zero_range
## Loading required package: grid
ggbiplot(city.pca, labels = rownames(bigcity))
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(2)
# Prep data
credit <- Credit[,-1]
x <- model.matrix(Balance ~ ., credit)[,-1]
y <- credit$Balance
x <- scale(x)
pcr.fit <- pcr(y ~ x, scale = TRUE, validation = "CV") # 10-fold CV
summary(pcr.fit)
## Data: X dimension: 400 11
## Y dimension: 400 1
## Fit method: svdpc
## Number of components considered: 11
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 460.3 299.2 299.4 293.3 291.6 289.0 286.6
## adjCV 460.3 298.9 299.1 288.7 291.9 290.9 293.6
## 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 264.4 264.9 265.8 100.3 100.3
## adjCV 263.8 264.5 265.5 100.1 100.1
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.05 39.64 49.73 59.74 68.89 77.73 86.43 93.91
## y 58.07 58.37 60.78 60.90 61.46 63.11 68.70 68.71
## 9 comps 10 comps 11 comps
## X 97.60 99.98 100.00
## y 68.72 95.47 95.51
validationplot(pcr.fit, val.type = "MSEP")
# train vs test
set.seed(1)
pcr.fit <- pcr(Balance ~., data = credit, subset = train, scale = TRUE, validation = "CV") # 10-fold CV
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred1 <- predict(pcr.fit, x[test,], ncomp = 10)
pcr.pred2 <- predict(pcr.fit, x[test,], ncomp = 1)
mean((pcr.pred1 - y.test)^2)
## [1] 1357772
mean((pcr.pred2 - y.test)^2)
## [1] 945116.9
pcr.fit <- pcr(y ~ x, scale = TRUE, ncomp = 10)
summary(pcr.fit)
## Data: X dimension: 400 11
## Y dimension: 400 1
## Fit method: svdpc
## Number of components considered: 10
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.05 39.64 49.73 59.74 68.89 77.73 86.43 93.91
## y 58.07 58.37 60.78 60.90 61.46 63.11 68.70 68.71
## 9 comps 10 comps
## X 97.60 99.98
## y 68.72 95.47
Partial Least Squares
The PCR approach that we just described involves identifying linear combinations, or directions, that best represent the predictors \(X_1,...,X_p\). We now present partial least squares (PLS), a supervised alternative to partial least PCR. Like PCR, PLS is a dimension reduction method, which first identifies squares a new set of features \(Z_1,...,Z_M\) that are linear combinations of the original features, and then fits a linear model via least squares using these \(M\) new features. But unlike PCR, PLS identifies these new features in a supervised way - that is, it makes use of the response \(Y\) in order to identify new features that not only approximate the old features well, but also that are related to the response. Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.
PLS computes the first direction \(Z_1\) by setting each \(\phi_{j1}\) in the PCA equation equal to the coefficient from the simple linear regression of \(Y\) onto \(X_j\). This coefficient is proportional to the correlation between \(Y\) and \(X_j\). Hence, in computing \(Z_1 = \sum_{j=1}^p \phi_{j1}X_j\), PLS places the highest weight on the variables that are most strongly related to the response. The second PLS direction can be identified by first adjust each of the variables for \(Z_1\), by regressing each variable on \(Z_1\) and taking residuals. These residuals can be interpreted as the remaining information that has not been explained by the first PLS direction. We then compute \(Z_2\) using this orthogonalized data in exactly the same fashion as \(Z_1\) was computed based on the original data. This iterative approach can be repeated \(M\) times to identify multiple PLS components \(Z_1,...,Z_M\). Finally, at the end of this procedure, we use least squares to fit a linear model to predict \(Y\) using \(Z_1,...,Z_M\) in exactly the same fashion as for PCR. As with PCR, the number \(M\) of partial least squares directions used in PLS is a tuning parameter that is typically chosen by cross-validation. We generally standardize the predictors and response before performing PLS.
set.seed(1)
# Prep data
credit <- Credit[,-1]
x <- model.matrix(Balance ~ ., credit)[,-1]
y <- credit$Balance
x <- scale(x)
pls.fit <- plsr(y ~ x, subset = train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 200 11
## Y dimension: 200 1
## Fit method: kernelpls
## Number of components considered: 11
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 480.1 266.0 198.8 110.4 101.4 100.4 100.4
## adjCV 480.1 265.6 197.7 107.5 100.9 100.0 100.0
## 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 100.26 100.2 99.90 100.14 100.23
## adjCV 99.89 99.8 99.44 99.74 99.83
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.41 34.17 38.13 50.48 59.30 64.01 71.70 80.47
## y 70.98 85.09 95.77 96.14 96.19 96.19 96.19 96.20
## 9 comps 10 comps 11 comps
## X 82.66 90.41 100.00
## y 96.24 96.24 96.24
validationplot(pls.fit, val.type = "MSEP")
pls.pred <- predict(pls.fit, x[test,], ncomp=3)
mean((pls.pred - y.test)^2)
## [1] 12191.2
pls.fit <- plsr(y ~ x, scale = TRUE, ncomp = 3)
summary(pls.fit)
## Data: X dimension: 400 11
## Y dimension: 400 1
## Fit method: kernelpls
## Number of components considered: 3
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps
## X 24.58 32.53 37.84
## y 69.67 86.53 94.95
Most traditional statistical techniques for regression and classification are intended for the low-dimensional setting in which \(n\), the number of observations, is much greater than \(p\), the number of features. In the past 20 years, new technologies have changed the way that data are collected in fields as diverse as finance, marketing, and medicine. It is now commonplace to collect an almost unlimited number of feature measurements (\(p\) very large). Data sets containing more features than observations are often referred to as high-dimensional.
When the number of features \(p\) is as large as, or larger than, the number of observations \(n\), least squares cannot (or rather, should not) be performed. The reason is simple: regardless of whether or not there truly is a relationship between the features and the response, least squares will yield a set of coefficient estimates that result in a perfect fit to the data, such that the residuals are zero, as you can see figure below. The problem is simple: when \(p>n\) or \(p \approx n\), a simple least squares regression line is too flexible and hence overfits the data.
The figure above shows linear regression results of unrelated variables. As shown in the figure above, the model \(R^2\) increases to 1 as the number of features included in the model increases, and correspondingly the training set MSE decreases to 0 as the number of features increases, even though the features are completely unrelated to the response. On the other hand, the MSE on an independent test set becomes extremely large as the number of features included in the model increases, because including the additional predictors leads to a vast increase in the variance of the coefficient estimates. Looking at the test set MSE, it is clear that the best model contains at most a few variables. However, someone who carelessly examines only the \(R^2\) or the training set MSE might erroneously conclude that the model with the greatest number of variables is best. This indicates the importance of applying extra care when analyzing data sets with a large number of variables, and of always evaluating model performance on an independent test set. Unfortunately, the \(C_p\), AIC, and BIC approaches are not appropriate in the high-dimensional setting, because estimating \(\hat\sigma^2\) is problematic.
It turns out that many of the methods seen in this tutorial for fitting less flexible least squares models, such as forward stepwise selection, ridge regression, the lasso, and principal components regression, are particularly useful for performing regression in the high-dimensional setting. Essentially, these approaches avoid overfitting by using a less flexible fitting approach than least squares.
The figure above illustrates condition that lasso was performed with \(n = 100\) observations and three values of \(p\), the number of features. Of the \(p\) features, 20 were associated with the response. As you can see in the figure, the lasso is sensitive when \(p\) is relatively similar to the number of truly associated feautes but it lose the power of discriminating relevant features when \(p\) is way higher than number of true features. The figure above highlights three important points.
The third point above is in fact a key principle in the analysis of highdimensional data, which is known as the curse of dimensionality. In general, adding additional signal features that are truly associated with the response will improve the fitted model, in the sense of leading to a reduction in test set error. However, adding noise features that are not truly associated with the response will lead to a deterioration in the fitted model, and consequently an increased test set error.
When we perform the lasso, ridge regression, or other regression procedures in the high-dimensional setting, we must be quite cautious in the way that we report the results obtained. In the high-dimensional setting, the multicollinearity problem is extreme: any variable in the model can be written as a linear combination of all of the other variables in the model. Essentially, this means that we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients for use in the regression.
For instance, suppose that we are trying to predict blood pressure on the basis of half a million SNPs, and that forward stepwise selection indicates that 17 of those SNPs lead to a good predictive model on the training data. It would be incorrect to conclude that these 17 SNPs predict blood pressure more effectively than the other SNPs not included in the model. There are likely to be many sets of 17 SNPs that would predict blood pressure just as well as the selected model. If we were to obtain an independent data set and perform forward stepwise selection on that data set, we would likely obtain a model containing a different, and perhaps even non-overlapping, set of SNPs. This does not detract from the value of the model obtained - for instance, the model might turn out to be very effective in predicting blood pressure on an independent set of patients, and might be clinically useful for physicians. But we must be careful not to overstate the results obtained, and to make it clear that what we have identified is simply one of many possible models for predicting blood pressure, and that it must be further validated on independent data sets.
It is also important to be particularly careful in reporting errors and measures of model fit in the high-dimensional setting. We have seen that when \(p>n\), it is easy to obtain a useless model that has zero residuals. Therefore, one should never use sum of squared errors, p-values, \(R^2\) statistics, or other traditional measures of model fit on the training data as evidence of a good model fit in the high-dimensional setting. Reporting this fact might mislead others into thinking that a statistically valid and useful model has been obtained, whereas in fact this provides absolutely no evidence of a compelling model. It is important to instead report results on an independent test set, or cross-validation errors. For instance, the MSE or \(R^2\) on an independent test set is a valid measure of model fit, but the MSE on the training set certainly is not.
As a closing example, we are going to apply all the methods we have learned to analyze hitter data set.Here we wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year. Analysis pipeline is given below.
## 1. preprocessing ################
# data cleaning
library(ISLR)
sum(is.na(Hitters))
## [1] 59
hitters <- na.omit(Hitters)
names(hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI"
## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun"
## [11] "CRuns" "CRBI" "CWalks" "League" "Division"
## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague"
## 2. Best subset selection ################
# model fit
library(leaps)
regfit.full <- regsubsets(Salary~., hitters, nvmax=19)
reg.summary <- summary(regfit.full)
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
# print R^2
reg.summary$rsq
## [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
## [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
# plot rss and adjr2
par(mfrow = c(2,2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type ='l')
points(which.min(reg.summary$rss),
reg.summary$rss[which.min(reg.summary$rss)], col = "red", cex = 2, pch=20)
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted Rsq", type ='l')
points(which.max(reg.summary$adjr2),
reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red", cex = 2, pch=20)
# plot cp and bic
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type ='l')
points(which.min(reg.summary$cp),
reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch=20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type ='l')
points(which.min(reg.summary$bic),
reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch=20)
par(mfrow = c(1,1))
# Using base plot
plot(regfit.full, scale = "r2")
plot(regfit.full, scale = "adjr2")
plot(regfit.full, scale = "Cp")
plot(regfit.full, scale = "bic")
# confirm coefficient
coef(regfit.full, which.min(reg.summary$bic))
## (Intercept) AtBat Hits Walks CRBI
## 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169
## DivisionW PutOuts
## -122.9515338 0.2643076
## 2. Foward and Backward Stepwise selection ################
regfit.fwd <- regsubsets(Salary~., data = hitters, nvmax = 19, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = hitters, nvmax = 19, method = "forward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
regfit.bwd <- regsubsets(Salary~., data = hitters, nvmax = 19, method = "backward")
regfit.hbd <- regsubsets(Salary~., data = hitters, nvmax = 19, method = "seqrep")
coef(regfit.full, 7)
## (Intercept) Hits Walks CAtBat CHits
## 79.4509472 1.2833513 3.2274264 -0.3752350 1.4957073
## CHmRun DivisionW PutOuts
## 1.4420538 -129.9866432 0.2366813
coef(regfit.fwd, 7)
## (Intercept) AtBat Hits Walks CRBI
## 109.7873062 -1.9588851 7.4498772 4.9131401 0.8537622
## CWalks DivisionW PutOuts
## -0.3053070 -127.1223928 0.2533404
coef(regfit.bwd, 7)
## (Intercept) AtBat Hits Walks CRuns
## 105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095
## CWalks DivisionW PutOuts
## -0.7163346 -116.1692169 0.3028847
coef(regfit.hbd, 7)
## (Intercept) AtBat Hits Walks CRuns
## 105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095
## CWalks DivisionW PutOuts
## -0.7163346 -116.1692169 0.3028847
## 3. validation set & cross-validation
# validation set
set.seed(1)
p <- dim(hitters)[2]-1
train <- sample(c(TRUE, FALSE), nrow(hitters), rep = TRUE)
test <- (!train)
regfit.best <- regsubsets(Salary ~ ., data = hitters[train,], nvmax = 19)
test.mat <- model.matrix(Salary~., data = hitters[test,])
val.errors <- rep(NA, p)
for (i in 1:p){
coefi <- coef(regfit.best, id = i)
pred <- test.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((hitters$Salary[test] - pred)^2)
}
val.errors
## [1] 220968.0 169157.1 178518.2 163426.1 168418.1 171270.6 162377.1
## [8] 157909.3 154055.7 148162.1 151156.4 151742.5 152214.5 157358.7
## [15] 158541.4 158743.3 159972.7 159859.8 160105.6
coef(regfit.best, which.min(val.errors))
## (Intercept) AtBat Hits Walks CAtBat CHits
## -80.2751499 -1.4683816 7.1625314 3.6430345 -0.1855698 1.1053238
## CHmRun CWalks LeagueN DivisionW PutOuts
## 1.3844863 -0.7483170 84.5576103 -53.0289658 0.2381662
predict.regsubsets <- function(object, newdata, id, ...){
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars = names(coefi)
mat[, xvars]%*%coefi
}
regfit.best <- regsubsets(Salary~., data = hitters, nvmax = p)
coef(regfit.best, which.min(val.errors))
## (Intercept) AtBat Hits Walks CAtBat
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798
## CRuns CRBI CWalks DivisionW PutOuts
## 1.4082490 0.7743122 -0.8308264 -112.3800575 0.2973726
## Assists
## 0.2831680
# CV
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(hitters), replace = TRUE)
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:19)))
for(j in 1:k){
best.fit <- regsubsets(Salary~., data = hitters[folds !=j,], nvmax = 19)
for(i in 1:19){
pred <- predict(best.fit, hitters[folds == j,], id=i)
cv.errors[j,i] = mean((hitters$Salary[folds==j]-pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
reg.best <- regsubsets(Salary~., data = hitters, nvmax=19)
coef(reg.best, which.min(mean.cv.errors))
## (Intercept) AtBat Hits Walks CAtBat
## 135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914
## CRuns CRBI CWalks LeagueN DivisionW
## 1.4553310 0.7852528 -0.8228559 43.1116152 -111.1460252
## PutOuts Assists
## 0.2894087 0.2688277
## 4. Ridge regression and the lasso ################
# Ridge regression
x <- model.matrix(Salary~., hitters)[, -1]
y <- hitters$Salary
library(glmnet)
grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x,y,alpha = 0, lambda = grid)
dim(coef(ridge.mod))
## [1] 20 100
ridge.mod$lambda[50]
## [1] 11497.57
coef(ridge.mod)[,50]
## (Intercept) AtBat Hits HmRun Runs
## 407.356050200 0.036957182 0.138180344 0.524629976 0.230701523
## RBI Walks Years CAtBat CHits
## 0.239841459 0.289618741 1.107702929 0.003131815 0.011653637
## CHmRun CRuns CRBI CWalks LeagueN
## 0.087545670 0.023379882 0.024138320 0.025015421 0.085028114
## DivisionW PutOuts Assists Errors NewLeagueN
## -6.215440973 0.016482577 0.002612988 -0.020502690 0.301433531
sqrt(sum(coef(ridge.mod)[-1,50]^2))
## [1] 6.360612
ridge.mod$lambda[60]
## [1] 705.4802
coef(ridge.mod)[,60]
## (Intercept) AtBat Hits HmRun Runs
## 54.32519950 0.11211115 0.65622409 1.17980910 0.93769713
## RBI Walks Years CAtBat CHits
## 0.84718546 1.31987948 2.59640425 0.01083413 0.04674557
## CHmRun CRuns CRBI CWalks LeagueN
## 0.33777318 0.09355528 0.09780402 0.07189612 13.68370191
## DivisionW PutOuts Assists Errors NewLeagueN
## -54.65877750 0.11852289 0.01606037 -0.70358655 8.61181213
sqrt(sum(coef(ridge.mod)[-1,60]^2))
## [1] 57.11001
predict(ridge.mod, s=50, type = "coefficients")[1:20,]
## (Intercept) AtBat Hits HmRun Runs
## 4.876610e+01 -3.580999e-01 1.969359e+00 -1.278248e+00 1.145892e+00
## RBI Walks Years CAtBat CHits
## 8.038292e-01 2.716186e+00 -6.218319e+00 5.447837e-03 1.064895e-01
## CHmRun CRuns CRBI CWalks LeagueN
## 6.244860e-01 2.214985e-01 2.186914e-01 -1.500245e-01 4.592589e+01
## DivisionW PutOuts Assists Errors NewLeagueN
## -1.182011e+02 2.502322e-01 1.215665e-01 -3.278600e+00 -9.496680e+00
# ridge validation set
set.seed(1)
train <- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.test <- y[test]
ridge.mod <- glmnet(x[train,], y[train], alpha=0, lambda=grid, thresh=1e-12)
ridge.pred <- predict(ridge.mod, s=4, newx = x[test,])
mean((ridge.pred-y.test)^2) # test error
## [1] 101036.8
mean((mean(y[train])-y.test)^2) # test error of intercept
## [1] 193253.1
ridge.pred <- predict(ridge.mod, s= 1e10, newx = x[test,]) # test error of lambda = infty
mean((ridge.pred - y.test)^2)
## [1] 193253.1
ridge.pred <- predict(ridge.mod, s=0, newx = x[test,], exact = TRUE, x=x[train,], y=y[train])
mean((ridge.pred - y.test)^2)
## [1] 114783.1
lm(y~x, subset = train)
##
## Call:
## lm(formula = y ~ x, subset = train)
##
## Coefficients:
## (Intercept) xAtBat xHits xHmRun xRuns
## 299.42849 -2.54027 8.36682 11.64512 -9.09923
## xRBI xWalks xYears xCAtBat xCHits
## 2.44105 9.23440 -22.93673 -0.18154 -0.11598
## xCHmRun xCRuns xCRBI xCWalks xLeagueN
## -1.33888 3.32838 0.07536 -1.07841 59.76065
## xDivisionW xPutOuts xAssists xErrors xNewLeagueN
## -98.86233 0.34087 0.34165 -0.64207 -0.67442
predict(ridge.mod, alpha=0, type="coefficients", exact=TRUE, x=x[train,], y=y[train])[1:20,1:4]
## 20 x 4 sparse Matrix of class "dgCMatrix"
## s0 s1 s2 s3
## (Intercept) 5.479846e+02 5.479845e+02 5.479844e+02 5.479843e+02
## AtBat 5.192368e-08 6.864003e-08 9.073807e-08 1.199504e-07
## Hits 1.790580e-07 2.367041e-07 3.129088e-07 4.136470e-07
## HmRun 8.926999e-07 1.180096e-06 1.560018e-06 2.062251e-06
## Runs 3.141580e-07 4.152984e-07 5.489999e-07 7.257455e-07
## RBI 3.533162e-07 4.670632e-07 6.174300e-07 8.162059e-07
## Walks 4.732016e-07 6.255446e-07 8.269331e-07 1.093157e-06
## Years 1.198689e-06 1.584595e-06 2.094742e-06 2.769125e-06
## CAtBat 3.981558e-09 5.263384e-09 6.957883e-09 9.197911e-09
## CHits 1.501273e-08 1.984594e-08 2.623516e-08 3.468133e-08
## CHmRun 9.980324e-08 1.319340e-07 1.744089e-07 2.305583e-07
## CRuns 3.232510e-08 4.273187e-08 5.648901e-08 7.467513e-08
## CRBI 2.741340e-08 3.623890e-08 4.790568e-08 6.332848e-08
## CWalks 3.625065e-08 4.792122e-08 6.334902e-08 8.374366e-08
## LeagueN 3.471972e-06 4.589742e-06 6.067369e-06 8.020705e-06
## DivisionW -8.090096e-06 -1.069463e-05 -1.413767e-05 -1.868917e-05
## PutOuts 2.727067e-08 3.605021e-08 4.765626e-08 6.299876e-08
## Assists 1.288616e-08 1.703474e-08 2.251892e-08 2.976869e-08
## Errors 2.804015e-07 3.706743e-07 4.900095e-07 6.477636e-07
## NewLeagueN 3.730442e-06 4.931425e-06 6.519053e-06 8.617804e-06
# ridge CV
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha=0)
plot(cv.out)
bestlam = cv.out$lambda.min
bestlam
## [1] 211.7416
ridge.pred <- predict(ridge.mod, s=bestlam, newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 96015.51
out <- glmnet(x,y,alpha=0)
predict(out, type="coefficients", s=bestlam)[1:20,]
## (Intercept) AtBat Hits HmRun Runs
## 9.88487157 0.03143991 1.00882875 0.13927624 1.11320781
## RBI Walks Years CAtBat CHits
## 0.87318990 1.80410229 0.13074383 0.01113978 0.06489843
## CHmRun CRuns CRBI CWalks LeagueN
## 0.45158546 0.12900049 0.13737712 0.02908572 27.18227527
## DivisionW PutOuts Assists Errors NewLeagueN
## -91.63411282 0.19149252 0.04254536 -1.81244470 7.21208394
# The lasso
lasso.mod <- glmnet(x[train,], y[train], alpha=1, lambda=grid)
plot(lasso.mod)
# lasso CV
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha=1)
plot(cv.out)
bestlam <- cv.out$lambda.min
lasso.pred <- predict(lasso.mod, s=bestlam, newx = x[test,])
mean((lasso.pred - y.test)^2)
## [1] 100743.4
out <- glmnet(x, y, alpha=1, lambda = grid)
lasso.coef <- predict(out, type = "coefficients", s= bestlam)[1:20, ]
lasso.coef
## (Intercept) AtBat Hits HmRun Runs
## 18.5394844 0.0000000 1.8735390 0.0000000 0.0000000
## RBI Walks Years CAtBat CHits
## 0.0000000 2.2178444 0.0000000 0.0000000 0.0000000
## CHmRun CRuns CRBI CWalks LeagueN
## 0.0000000 0.2071252 0.4130132 0.0000000 3.2666677
## DivisionW PutOuts Assists Errors NewLeagueN
## -103.4845458 0.2204284 0.0000000 0.0000000 0.0000000
## 5. PCR and PLS Regression ################
# PCR
library(pls)
set.seed(2)
# cv error for full model
pcr.fit <- pcr(Salary~., data=hitters, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 452 348.9 352.2 353.5 352.8 350.1 349.1
## adjCV 452 348.7 351.8 352.9 352.1 349.3 348.0
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 349.6 350.9 352.9 353.8 355.0 356.2 363.5
## adjCV 348.5 349.8 351.6 352.3 353.4 354.5 361.6
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 355.2 357.4 347.6 350.1 349.2 352.6
## adjCV 352.8 355.2 345.5 347.6 346.7 349.8
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 38.31 60.16 70.84 79.03 84.29 88.63 92.26
## Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 94.96 96.28 97.26 97.98 98.65 99.15 99.47
## Salary 46.75 46.86 47.76 47.82 47.85 48.10 50.40
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.75 99.89 99.97 99.99 100.00
## Salary 50.55 53.01 53.85 54.61 54.61
validationplot(pcr.fit, val.type = "MSEP")
# test CV error
set.seed(1)
pcr.fit <- pcr(Salary~., data=hitters, subset = train, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
summary(pcr.fit) # inspect minimum cv error components
## Data: X dimension: 131 19
## Y dimension: 131 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 464.6 406.1 397.1 399.1 398.6 395.2 386.9
## adjCV 464.6 405.2 396.3 398.1 397.4 394.5 384.5
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 384.8 386.5 394.1 406.1 406.5 412.3 407.7
## adjCV 383.3 384.8 392.0 403.4 403.7 409.3 404.6
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 406.2 417.8 417.6 413.0 407.0 410.2
## adjCV 402.8 413.9 413.5 408.3 402.4 405.5
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 38.89 60.25 70.85 79.06 84.01 88.51 92.61
## Salary 28.44 31.33 32.53 33.69 36.64 40.28 40.41
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 95.20 96.78 97.63 98.27 98.89 99.27 99.56
## Salary 41.07 41.25 41.27 41.41 41.44 43.20 44.24
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.78 99.91 99.97 100.00 100.00
## Salary 44.30 45.50 49.66 51.13 51.18
pcr.pred <- predict(pcr.fit, x[test,], ncomp = 7)
mean((pcr.pred - y.test)^2)
## [1] 96556.22
pcr.fit <- pcr(y ~ x, scale = TRUE, ncomp = 7)
summary(pcr.fit)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 38.31 60.16 70.84 79.03 84.29 88.63 92.26
## y 40.63 41.58 42.17 43.22 44.90 46.48 46.69
# PLS
set.seed(1)
pls.fit <- plsr(Salary~., data=hitters, subset = train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 131 19
## Y dimension: 131 1
## Fit method: kernelpls
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 464.6 394.2 391.5 393.1 395.0 415.0 424.0
## adjCV 464.6 393.4 390.2 391.1 392.9 411.5 418.8
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 424.5 415.8 404.6 407.1 412.0 414.4 410.3
## adjCV 418.9 411.4 400.7 402.2 407.2 409.3 405.6
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 406.2 408.6 410.5 408.8 407.8 410.2
## adjCV 401.8 403.9 405.6 404.1 403.2 405.5
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 38.12 53.46 66.05 74.49 79.33 84.56 87.09
## Salary 33.58 38.96 41.57 42.43 44.04 45.59 47.05
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 90.74 92.55 93.94 97.23 97.88 98.35 98.85
## Salary 47.53 48.42 49.68 50.04 50.54 50.78 50.92
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.11 99.43 99.78 99.99 100.00
## Salary 51.04 51.11 51.15 51.16 51.18
pls.pred <- predict(pls.fit, x[test,], ncomp = 2)
mean((pls.pred - y.test)^2)
## [1] 101417.5
pls.fit <- plsr(Salary~., data = hitters, scale = TRUE, ncomp = 2)
summary(pls.fit)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
## 1 comps 2 comps
## X 38.08 51.03
## Salary 43.05 46.40
## 6. Conclusion ################
error.bestsubset <- min(mean.cv.errors)
error.ridge <- mean((ridge.pred-y.test)^2)
error.lasso <- mean((lasso.pred - y.test)^2)
error.pcr <- mean((pcr.pred - y.test)^2)
error.pls <- mean((pls.pred - y.test)^2)
# CV foward
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(hitters), replace = TRUE)
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:19)))
for(j in 1:k){
best.fit <- regsubsets(Salary~., data = hitters[folds !=j,], nvmax = 19, method = "forward")
for(i in 1:19){
pred <- predict(best.fit, hitters[folds == j,], id=i)
cv.errors[j,i] = mean((hitters$Salary[folds==j]-pred)^2)
}
}
error.forward <- min(apply(cv.errors, 2, mean))
# CV backward
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(hitters), replace = TRUE)
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:19)))
for(j in 1:k){
best.fit <- regsubsets(Salary~., data = hitters[folds !=j,], nvmax = 19, method = "backward")
for(i in 1:19){
pred <- predict(best.fit, hitters[folds == j,], id=i)
cv.errors[j,i] = mean((hitters$Salary[folds==j]-pred)^2)
}
}
error.backward <- min(apply(cv.errors, 2, mean))
# CV hybrid
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(hitters), replace = TRUE)
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:19)))
for(j in 1:k){
best.fit <- regsubsets(Salary~., data = hitters[folds !=j,], nvmax = 19, method = "seqrep")
for(i in 1:19){
pred <- predict(best.fit, hitters[folds == j,], id=i)
cv.errors[j,i] = mean((hitters$Salary[folds==j]-pred)^2)
}
}
error.hybrid <- min(apply(cv.errors, 2, mean))
# print result
result <- data.frame(bestsub = error.bestsubset, fwd = error.forward,
bwd = error.backward, hyb = error.hybrid, ridge = error.ridge,
lasso = error.lasso, pcr = error.pcr, pls = error.pls)
barplot(as.numeric(result), names.arg = colnames(result), ylab = "Test MSE", col = rainbow(8))
knitr::kable(result)
bestsub | fwd | bwd | hyb | ridge | lasso | pcr | pls |
---|---|---|---|---|---|---|---|
125153.8 | 125654.9 | 127701.7 | 120589.3 | 96015.51 | 100743.4 | 96556.22 | 101417.5 |
In conclusion, ridge regression is the best method for this data set.
# Install Gurobi optimizer & Gurobi R package prior to install bestsubset package
# See http://www.gurobi.com/index, for more information.
# library(devtools)
# install_github(repo="ryantibs/best-subset", subdir="bestsubset")
library(bestsubset)
## Loading required package: gurobi
## Loading required package: slam
X <- hitters %>% select(-c(Salary, League, Division, NewLeague))
Y <- hitters$Salary
res <- bestsubset::bs(X, Y)
Ref: Hastie, Trevor, Robert Tibshirani, and Ryan J. Tibshirani. “Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso.” arXiv preprint arXiv:1707.08692 (2017).
You can practice these techniques with datasets given as this Databank