Contents

Part I

Basic Supervised Learning (James Ch4)

  • Logistic Regression, LDA, QDA, Naive Bayes, KNN

  • Tutorial: Comparison of classification methods with “Stock Market”, “Caravan Insurance” data

Resampling (James Ch5)

  • Validation Set, Leave-One-Out, K-fold Cross Validatoin, Bootstrap

  • Tutorial: Validation of the performance of different classification methods and resampling methods with “Auto” data

Part II

Model Selection (James Ch6)

  • Subset selection (Best subset selection, Stepwise selection)

  • Shrinkage Methods (Ridge Regression, Lasso)

  • Dimension Reduction Methods (PCR, PLS)

  • Tutorial: Test of the subset selection methods with validation of feature selection & dimension reduction (“Hitter” data)

Tutorial

Logistic Regression (s4 ~ s7)

logistic regression for binary class(s4 ~ s5)

linear regression model that predicts binary outcome is given as \[p(X) = \beta_0 + \beta_1 X\] and this model may return \(p(X) < 0\) or \(p(X) > 1\) for some values of \(X\). To avoid this problem, we use the logistic function, \[p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1+e^{\beta_0 + \beta_1 X}}\] and this equation can be expressed as \[\frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1 X}\] The quantity \(\frac{p(X)}{1-p(X)}\) is called the odds, and can take on any value between \(0\) and \(\infty\). Odds are traditionally used instead of probabilities in horse-racing, since they relate more naturally to the correct betting strategy. (probability of wining versus probability of losing) By taking the logarithm of both sides, we arrive at, \[\log \bigg( \frac{p(X)}{1-p(X)} \bigg) = \beta_0 + \beta_1X\] The left-hand side is called the log-odds or logit. We see that the logistic regression model has a logit that is linear in \(X\).

Based on the available training data, we need to get the coefficients \(\beta_0\) and \(\beta_1\) by maximizing the likelihood function or, in other words, finding maximum likelihood of this equation. The likelihood function is given as \[\ell(\beta_0, \beta_1) = \prod_{i= 1}^np(x_i)^{y_i}(1-p(x_{i}))^{1-y_i}\] The estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to maximize this likelihood function. To solve the above optimization problem, first, we take logarithm then the log-likelihood function become \[\begin{aligned}\ell(\beta_0, \beta_1) &= \sum_{i= 1}^n y_i \log p(x_i) + (1-y_i) \log(1-p(x_{i})) \\ & = \sum_{i= 1}^n \log (1-p(x_i)) + \sum_{i= 1}^n y_i\log \bigg(\frac{p(x_i)}{1-p(x_{i})}\bigg) \\ & = \sum_{i= 1}^n \log (1-\frac{e^{\beta_0 + \beta_1 x_i}}{1+e^{\beta_0 + \beta_1 x_i}}) + \sum_{i= 1}^n y_i(\beta_0 + x_i \beta_1) \\ & = -\sum_{i= 1}^n \log (1 + e^{\beta_0 + \beta_1 x_i}) + \sum_{i= 1}^n y_i(\beta_0 + x_i \beta_1) \\ \end{aligned}\] To find the maximum likelihood estimates, we would differentiate the log likelihood with respect to the parameters, set the derivatives equal to zero. Then, \[\frac{\partial \ell}{\partial \beta_1} = -\sum_{i= 1}^n \frac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}} + \sum_{i= 1}^n y_ix_i = \sum_{i= 1}^n (y_i - p(x_i;\beta_0, \beta_1))x_i=0\] The above equation has no explicit (closed form) solution due to nonlinearity of the logistic sigmoid function but we can solve it by numerical methods such as Newton-Raphson method.

Multiple logistic regression (s5)

We now consider the problem of predicting a binary response using multiple predictors by generalizing the logistic regression for binary class. \[\log \bigg(\frac{p(X)}{1-p(X)}\bigg) = \beta_0 + \beta_1 X_1 + ... + \beta_pX_p\] where \(X = (X_1, ..., X_p)\) are \(p\) predictors. Then, \[p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + ... + \beta_pX_p}}{1 + e^{\beta_0 + \beta_1 X_1 + ... + \beta_pX_p}}\] and, again, we can use the maximum likelihood method to estimate \(\beta_0, \beta_1, ..., \beta_p\).

Note that we can perform logistic regression for \(y_i>2\) response classes but in practice they tend not to be used all that often. Instead we use different methods such as discriminant analysis.

Example of Logistic Regression - default vs balance (s6 ~ 7)

# taining data set (s6)
training <- data.frame(balance = c(1000,800,100,50), default = c("Yes","Yes","No","No"))

glm.fits <- glm(default ~ balance, data = training, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fits)
## 
## Call:
## glm(formula = default ~ balance, family = binomial, data = training)
## 
## Deviance Residuals: 
##          1           2           3           4  
##  2.110e-08   1.105e-05  -1.020e-05  -1.890e-06  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.042e+01  9.490e+04       0        1
## balance      6.743e-02  1.616e+02       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5.5452e+00  on 3  degrees of freedom
## Residual deviance: 2.2972e-10  on 2  degrees of freedom
## AIC: 4
## 
## Number of Fisher Scoring iterations: 23
b0 <- coef(glm.fits)[1]
b1 <- coef(glm.fits)[2]

b0;b1
## (Intercept) 
##   -30.42224
##    balance 
## 0.06742674
# maximum likelihood estimation
likelihood <- exp(b0+1000*b1)/(1+exp(b0+1000*b1)) * 
  exp(b0+800*b1)/(1+exp(b0+800*b1)) * 
  1/(1+exp(b0+100*b1)) * 
  1/(1+exp(b0+50*b1)); likelihood
## (Intercept) 
##           1
# prediction
pred <- predict(glm.fits, type = "response")

# accuracy
table(pred>0.5, training$default)
##        
##         No Yes
##   FALSE  2   0
##   TRUE   0   2
## Large default data set (s7)
library(ISLR)
data(Default)

glm.fits <- glm(default ~ balance, data = Default, family = binomial)
summary(glm.fits)
## 
## Call:
## glm(formula = default ~ balance, family = binomial, data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8
pred <- predict(glm.fits, type = "response")
predict(glm.fits, newdata = data.frame(balance = 1000), type = "response")
##           1 
## 0.005752145
table(pred>0.5, Default$default)
##        
##           No  Yes
##   FALSE 9625  233
##   TRUE    42  100
# Multiple Logistic regression (s7)
glm.fits <- glm(default ~ balance + income + student, data = Default, family = binomial)
summary(glm.fits)
## 
## Call:
## glm(formula = default ~ balance + income + student, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8
pred <- predict(glm.fits, type = "response")
predict(glm.fits, newdata = data.frame(balance = 1500, 
                                       income = 40000, student = "Yes"), type = "response")
##          1 
## 0.05788194
predict(glm.fits, newdata = data.frame(balance = 1500, 
                                       income = 40000, student = "No"), type = "response")
##         1 
## 0.1049919
table(pred>0.5, Default$default)
##        
##           No  Yes
##   FALSE 9627  228
##   TRUE    40  105
# cf) Logistic regression with binary explanatory variable
glm.fits <- glm(default ~ student, data = Default, family = binomial)
summary(glm.fits)
## 
## Call:
## glm(formula = default ~ student, family = binomial, data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.2970  -0.2970  -0.2434  -0.2434   2.6585  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.50413    0.07071  -49.55  < 2e-16 ***
## studentYes   0.40489    0.11502    3.52 0.000431 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 2908.7  on 9998  degrees of freedom
## AIC: 2912.7
## 
## Number of Fisher Scoring iterations: 6
pred <- predict(glm.fits, type = "response")
table(pred>mean(unique(pred)), Default$default)
##        
##           No  Yes
##   FALSE 6850  206
##   TRUE  2817  127
# cf) Logistic regression with training and test set
train <- sample(10000, 7000)
test <- Default[-train,]

glm.fit <- glm(default ~ balance + income + student, data = Default, family = binomial, subset = train)
pred <- predict(glm.fit,test, type = "response")
d <- table(pred>0.5, test$default)
sum(diag(d))/sum(d)
## [1] 0.9746667

Solution

  1. \(\beta_0\) and \(\beta_1\) for 4 training samples

    \(\beta_0 = -30.42, \beta_1 = 0.067\), but they lack of statistical significance.

  2. \(p(\$1,000)\) for the logistic regression model which trained on the large default data set

    \(p(\$1,000)= 0.0058\).

  3. \(p(\$1,500, \$40,000, Student = Yes)\) for the multiple logistic regression model

    \(p(\$1,500, \$40,000, Student = Yes)= 0.058\),

    \(p(\$1,500, \$40,000, Student = No)= 0.105\).

    Notice that unlike our lecture slide, we did not scale income in units of \(\$1,000\).

  4. Overall conclusion

    From our modeling, you can notice that a student is riskier than a non-student if no information about the student’s credit card balance is available. However, that student is less risky than a non-student with the same credit card balance! (In other words, student status is a confounding factor. See the regression coefficient of student in simple linear regression vs multiple linear regression model)

Linear Discriminant Analysis (s8 ~ s11)

Introduction to LDA (s8 ~ s9)

Suppose that we wish to clasify an observation into one of \(K\) classes, where \(K\geq 2\). Let \(\pi_k\) represent the overall or prior probability that a randomly chosen observation comes from the \(k_{th}\) class. Let \(f_k(x) \equiv Pr(X=x | Y=k)\) denote the density function of \(X\) for an observation that comes from the \(k_{th}\) class. Then Bayes’ theorem states that \[Pr(Y=k|X=x) = p_k(X) = \frac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)}\] Assume \(p=1\) and \(f_k(x)\) is normal or Gaussian. Then, \[f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k} \exp\bigg(-\frac{1}{2\sigma_k^2}(x-\mu_k)^2\bigg)\] where \(\mu_k\) and \(\sigma_k^2\) are the mean and variance parameters for the \(k_{th}\) class. Let us further assume that \(\sigma_1^2 = ... = \sigma_K^2 = \sigma^2\). Then, \[p_k(x) = \frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma}\exp\Big(-\frac{1}{2\sigma^2}(x-\mu_k)^2\Big)}{\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2\pi}\sigma}\exp\Big(-\frac{1}{2\sigma^2}(x-\mu_l)^2\Big)}\] To maximize \(p_k(x)\), we need to maximize numerator of above equation. By taking logarithm and manipulationg terms, we can define linear discriminant functions as follows: \[\delta_k(x) = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)\] By simply taking log and discarding terms independent on \(k\).

When \(K=2\) and \(\pi_1 = \pi_2 = 0.5\), then the Bayes classifier assigns an observation to class 1 if \(2x(\mu_1 - \mu_2) > \mu_1^2 - \mu_2^2\). When \(K=2\) and \(\pi_1 \neq \pi_2\), then the Bayes classifier assigns an observation to class 1 if \(2x(\mu_1 - \mu_2) > \mu_1^2 - \mu_2^2 + 2\sigma^2log(\frac{\pi_2}{\pi_1})\)

Cancer classification - Manual calculation of the decision boundary

Let’s see example of classification of cancer (AML vs ALL) with 3 different gene expressions. We will assume that the expression of each gene is independent and they are following Gaussian distribution.

cancer <- read.table("http://bisyn.kaist.ac.kr/bis335/12-Classification_Basic-ALLnAML.csv", 
                     sep = "\t")

mu1 <- colMeans(cancer[1:10, 1:3])  #mu1 = mu_a = (-0.486587  0.624654 -0.095455)
mu2 <- colMeans(cancer[11:20, 1:3]) #mu2 = mu_b = (1.622521 1.239964 1.415364)
mu <- colMeans(cancer[1:20, 1:3])  #mu = (0.5679670 0.9323090 0.6599545)

# making covariance matrix
Ga <- cancer[1:10, ] # ALL
Gb <- cancer[11:20, ] # AML
Ga <- as.matrix(Ga)
Gb <- as.matrix(Gb)
# make mean corrected matrix of each group
for (i in 1:3){
  Ga[, i] <- Ga[, i] - mu[i]
  Gb[, i] <- Gb[, i] - mu[i]
}

# make covariance matrix of each group
Ca <- t(Ga)%*%Ga
Cb <- t(Gb)%*%Gb

# make pooled within group covariance matrix
Cp <- 0.5*(Ca+Cb)

# make inverse Cp matrix
Cpi <- solve(Cp)
round(Cpi%*%Cp,10)  # check identity matrix
##       ZYX FAH APLP2
## ZYX     1   0     0
## FAH     0   1     0
## APLP2   0   0     1
## a) prior probabilities
p <- c(0.5, 0.5)

## b) Find linear discriminant function for two cancer types
# for group a
mu1%*%Cpi                           # coefficient for ALL
##             ZYX     FAH     APLP2
## [1,] -0.7774935 1.70777 0.4954968
- (0.5*mu1%*%Cpi%*%mu1) + log(p[1]) # intercept for ALL
##          [,1]
## [1,] -1.39204
# for group b
mu2%*%Cpi                            # coefficient for AML
##            ZYX      FAH     APLP2
## [1,] -0.752471 1.995457 0.5786091
- (0.5*mu2%*%Cpi%*%mu2) + log(p[2])  # intercept for AML
##           [,1]
## [1,] -1.729316
## c) Decision boundary of X1
(as.numeric(mu1[1])+as.numeric(mu2[1]))/2
## [1] 0.567967

Cancer classification - Using R function

Now Let’s perform the same analsysis with R packages

# LDA with equal number of samples
# Univariate Gaussian - Gene #1, X_ZYX
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
G1_ALL <- rnorm(n=10000, mean = -0.49, sd = sqrt(14.07))
G1_AML <- rnorm(n=10000, mean = 1.62, sd = sqrt(14.07))
G1_ALL <- data.frame(gex = G1_ALL, cl = rep("ALL", length(G1_ALL)))
G1_AML <- data.frame(gex = G1_AML, cl = rep("AML", length(G1_AML)))
G1_data <- rbind(G1_ALL, G1_AML)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
G1.fit <- lda(cl ~ gex, data=G1_data); G1.fit
## Call:
## lda(cl ~ gex, data = G1_data)
## 
## Prior probabilities of groups:
## ALL AML 
## 0.5 0.5 
## 
## Group means:
##            gex
## ALL -0.4667464
## AML  1.5848551
## 
## Coefficients of linear discriminants:
##           LD1
## gex 0.2659218
# Univariate Gaussian - Gene #2, X_FAH
G2_ALL <- rnorm(n=10000, mean = 0.63, sd = sqrt(1.31))
G2_AML <- rnorm(n=10000, mean = 1.24, sd = sqrt(1.31))
G2_ALL <- data.frame(gex = G2_ALL, cl = rep("ALL", length(G2_ALL)))
G2_AML <- data.frame(gex = G2_AML, cl = rep("AML", length(G2_AML)))
G2_data <- rbind(G2_ALL, G2_AML)

G2.fit <- lda(cl ~ gex, data=G2_data); G2.fit
## Call:
## lda(cl ~ gex, data = G2_data)
## 
## Prior probabilities of groups:
## ALL AML 
## 0.5 0.5 
## 
## Group means:
##           gex
## ALL 0.6035368
## AML 1.2425651
## 
## Coefficients of linear discriminants:
##           LD1
## gex 0.8687246
# Univariate Gaussian - Gene #3, X_APLP2
G3_ALL <- rnorm(n=10000, mean = -0.1, sd = sqrt(8.67))
G3_AML <- rnorm(n=10000, mean = 1.42, sd = sqrt(8.67))
G3_ALL <- data.frame(gex = G3_ALL, cl = rep("ALL", length(G3_ALL)))
G3_AML <- data.frame(gex = G3_AML, cl = rep("AML", length(G3_AML)))
G3_data <- rbind(G3_ALL, G3_AML)

G3.fit <- lda(cl ~ gex, data=G3_data); G3.fit
## Call:
## lda(cl ~ gex, data = G3_data)
## 
## Prior probabilities of groups:
## ALL AML 
## 0.5 0.5 
## 
## Group means:
##             gex
## ALL -0.09498448
## AML  1.45780014
## 
## Coefficients of linear discriminants:
##           LD1
## gex 0.3410177
# LDA with different number of samples
# Univariate Gaussian - Gene #1, X_ZYX
G1_ALL <- rnorm(n=15000, mean = -0.49, sd = sqrt(14.07))
G1_AML <- rnorm(n=5000, mean = 1.62, sd = sqrt(14.07))
G1_ALL <- data.frame(gex = G1_ALL, cl = rep("ALL", length(G1_ALL)))
G1_AML <- data.frame(gex = G1_AML, cl = rep("AML", length(G1_AML)))
G1_data <- rbind(G1_ALL, G1_AML)

G1.fit <- lda(cl ~ gex, data=G1_data, cv=TRUE); G1.fit
## Call:
## lda(cl ~ gex, data = G1_data, cv = TRUE)
## 
## Prior probabilities of groups:
##  ALL  AML 
## 0.75 0.25 
## 
## Group means:
##            gex
## ALL -0.4932498
## AML  1.5315561
## 
## Coefficients of linear discriminants:
##           LD1
## gex 0.2633891
# Univariate Gaussian - Gene #2, X_FAH
G2_ALL <- rnorm(n=15000, mean = 0.63, sd = sqrt(1.31))
G2_AML <- rnorm(n=5000, mean = 1.24, sd = sqrt(1.31))
G2_ALL <- data.frame(gex = G2_ALL, cl = rep("ALL", length(G2_ALL)))
G2_AML <- data.frame(gex = G2_AML, cl = rep("AML", length(G2_AML)))
G2_data <- rbind(G2_ALL, G2_AML)

G2.fit <- lda(cl ~ gex, data=G2_data); G2.fit
## Call:
## lda(cl ~ gex, data = G2_data)
## 
## Prior probabilities of groups:
##  ALL  AML 
## 0.75 0.25 
## 
## Group means:
##          gex
## ALL 0.644304
## AML 1.250402
## 
## Coefficients of linear discriminants:
##           LD1
## gex 0.8711163
# Univariate Gaussian - Gene #3, X_APLP2
G3_ALL <- rnorm(n=15000, mean = -0.1, sd = sqrt(8.67))
G3_AML <- rnorm(n=5000, mean = 1.42, sd = sqrt(8.67))
G3_ALL <- data.frame(gex = G3_ALL, cl = rep("ALL", length(G3_ALL)))
G3_AML <- data.frame(gex = G3_AML, cl = rep("AML", length(G3_AML)))
G3_data <- rbind(G3_ALL, G3_AML)

G3.fit <- lda(cl ~ gex, data=G3_data); G3.fit
## Call:
## lda(cl ~ gex, data = G3_data)
## 
## Prior probabilities of groups:
##  ALL  AML 
## 0.75 0.25 
## 
## Group means:
##            gex
## ALL -0.1177763
## AML  1.3699436
## 
## Coefficients of linear discriminants:
##           LD1
## gex 0.3375547

LDA for p > 1 (s10)

Suppose we have p-dimensional random variable \(X = (X_1, X_2, ..., X_p)\) has multivariate Gaussian distribution. The we write \(X \sim N(\mu, \Sigma)\). Here \(E(X) = \mu\) is the mean of \(X\) and \(Cov(X) = \Sigma\) is the \(p \times p\) covariance matrix of \(X\). Formally, the multivariate Gaussian density is defined as

\[f_k(x) = \frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}\exp\bigg(-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)\bigg)\] Here \((x-\mu_k)^T\Sigma^{-1}(x-\mu_k)\) is called Mahalanobis distance of \(x\) from \(\mu\). “Distance” in the senes that if the mean, \(\mu\), is a zero vector, you can think of \(x^T\sigma^{-1}x\) as a measure of how far \(x\) is away from zero in “units of standard deviation” in that direction. The covariance matrix \(\Sigma\) can not be any matrix, it has to be symmetric and “positive definite”. The latter means that \(x^T\Sigma^{-1}x > 0\) for any vector \(x\) consisting of real numbers.(and for which no columns of \(\Sigma\) are zero). Suppose for the moment that \(\Sigma\) is a scalar, not a matrix. If \(x^2 \Sigma > 0\) for all \(x\), then \(\Sigma\) would also have to be positive, i.e. \(\Sigma > 0\). Positive definite is a generalization to matrices.

You can visualize this distribution as shown below.

# # visualize multivariate random variables
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
mu1 <- c(-0.49, 0.63, -0.1)
sigma1 <- matrix(c(14.07, 3.27, 9.8, 3.27, 1.31, 1.89, 9.81, 1.89, 8.67), 
                 byrow = TRUE, ncol=3); sigma1
##       [,1] [,2] [,3]
## [1,] 14.07 3.27 9.80
## [2,]  3.27 1.31 1.89
## [3,]  9.81 1.89 8.67
sigma2 <- matrix(c(14.07, 0, 0, 0, 1.31, 0, 0, 0, 8.67), 
                 byrow = TRUE, ncol=3); sigma2
##       [,1] [,2] [,3]
## [1,] 14.07 0.00 0.00
## [2,]  0.00 1.31 0.00
## [3,]  0.00 0.00 8.67
ALL1 <- data.frame(mvrnorm(n=10000, mu=mu1, Sigma = sigma1))
ALL2 <- data.frame(mvrnorm(n=10000, mu=mu1, Sigma = sigma2))

## ALL1
kd <- kde2d(ALL1$X1, ALL1$X2, n = 100)

# this line is not visible in pdf format because it uses html
p <- plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface();#p

## ALL2
kd <- kde2d(ALL2$X1, ALL2$X2, n = 100)
p <- plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface;#p

In the case of \(p>1\) predictors, the LDA classifier assumes that the observations in the \(k_{th}\) class are drawn from a multivariate Gaussian distribution \(N(\mu_k, \Sigma)\). By Bayes rule, the probability of category \(k\), given the input \(x\) is \[P(Y=k|X=x) = \frac{f_k(x)\pi_k}{P(X=x)}\] because the denominator does not depend on the category \(K\), so we can write it as a constant. \[P(Y=K|X=x) = C \times f_k(x)\pi_k\] Now, expanding \(f_k(x)\) \[P(Y=k|X=x) = \frac{C\pi_k}{(2\pi)^{p/2}|\Sigma|^{1/2}}\exp\bigg(-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)\bigg)\] And integrate all terms that independent on k into a constant \(C'\) \[P(Y=k|X=x) = C'\pi_k\exp\bigg(-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)\bigg)\] and take the logarithm on both sides \[\log P(Y=k|X=x) = \log C' + \log \pi_k - \frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)\] This is the same for every category, \(k\). So we want to find the maximum of this over \(k\).

\[\begin{aligned} &\log C' + \log \pi_k - \frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) \\ &= \log C' + \log \pi_k - \frac{1}{2}[x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k]+x^T\Sigma^{-1}\mu_k \\ &= C'' + \log \pi_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+x^T\Sigma^{-1}\mu_k \end{aligned}\]

Then the Bayes classifier assigns an observation \(X=x\) to the class for which \[\delta(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log\pi_k\] is largest. The Bayes decision boundaries are given in \(\delta_k(x) = \delta_{\ell}(x)\), i.e. \[x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k +\log \pi_k = x^T\Sigma^{-1}\mu_l - \frac{1}{2}\mu_l^T\Sigma^{-1}\mu_l+\log \pi_l\]

Example of LDA (S11, Default Dataset)

Perform LDA on Default data set we discussed in Logistic regression part. Use 10,000 samples for training purpose only. Is this model accurate? What is the problem of this model?

library(ISLR)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
data(Default)

lda.fit <- lda(default ~ student + balance, data=Default)
d <- table(predict(lda.fit)$`class`, Default$default); d
##      
##         No  Yes
##   No  9644  252
##   Yes   23   81
ACC <- sum(diag(d))/sum(d); ACC  # Accuracy = (TP+TN)/(TP+TN+FP+FN)
## [1] 0.9725
mean(predict(lda.fit)$`class` == Default$default)
## [1] 0.9725
error_rate <- 1-ACC; error_rate*100 # percent error_rate
## [1] 2.75
d <- as.data.table(d)
FPR <- d$N[2]/sum(d$N[1:2])
FNR <- d$N[3]/sum(d$N[3:4])
TPR <- 1-FNR # Sensitivity
TNR <- 1-FPR # Specificity

# trivial classifier
d <- table(rep("No", length(Default$default)), Default$default);d
##     
##        No  Yes
##   No 9667  333
err <- 1-sum(diag(d))/sum(d); err*100 # percent error_rate of trivial classifier
## [1] 3.33
# Changing threshold
d <- table(predict(lda.fit)$posterior[,2]>0.2, Default$default); d
##        
##           No  Yes
##   FALSE 9432  138
##   TRUE   235  195
1-sum(diag(d))/sum(d) 
## [1] 0.0373
# Error_rate vs threshold
error_rate <- c()
FNR <- c(); FPR <- c()
for (i in 1:51) {
  if(i == 1){
    threshold <- (i-1)*0.01
    d <- table(rep("No", length(Default$default)), Default$default)
    d <- as.data.table(d)
    error_rate[i]<- 1-d$N[2]/sum(d$N[1:2])
    FPR[i] <- 1-d$N[2]/sum(d$N[1:2])
    FNR[i] <- 0
  } else {
  threshold <- (i-1)*0.01
  d <- table(predict(lda.fit)$posterior[,2]>threshold, Default$default)
  error_rate[i] <- 1-sum(diag(d))/sum(d) 
  d <- as.data.table(d)
  FPR[i] <- d$N[2]/sum(d$N[1:2])
  FNR[i] <- d$N[3]/sum(d$N[3:4])
  }
}
# error_rate
plot(seq(0, 0.5,0.01), error_rate, type = 'l', ylim = c(0,0.75), lwd = 3,
     xlab = "Threshold", ylab= "Error Rate", cex = 1.2)
lines(seq(0,0.5,0.01), FNR, col = "blue", lwd = 3, lty = 2)
points(seq(0,0.5,0.01), FPR, col = "orange", pch = 16)
legend(0.35, 0.4, c("Error rate", "FPR", "FNR"), lty = c(1,2,2), 
        col = c("black", "orange","blue"), lwd = c(3, 3, 3))

# ROC
plot(c(FPR,0), c(1-FNR,0), type = 'l', ylim = c(0,1), main = "ROC Curve",
     xlab = "False positive rate", ylab= "True positive rate", cex = 1.2)
lines(x = c(-0.2,1.2), y = c(-0.2,1.2), lty = 2)

# using R package
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
input <- list(predictions = predict(lda.fit)$posterior[,2], labels = Default$default)
pred <- prediction(input$predictions, input$labels)
perf <- performance(pred,"tpr","fpr")
plot(perf, colorize = TRUE, main = "ROC Curve") # cutoff means the cutoff of posterior probabilty
lines(x = c(-0.2,1.2), y = c(-0.2,1.2), lty = 2)

# cf) hold out validation
# training vs test (7:3)
train <- sample(1:10000, 7000)
testset <- Default[-train,]

# check the sampling results
dim(Default[train,]); table(Default[train,]$default)
## [1] 7000    4
## 
##   No  Yes 
## 6773  227
dim(testset); table(testset$default)
## [1] 3000    4
## 
##   No  Yes 
## 2894  106
# Fitting model 
lda.fit.test <- lda(default ~ student + balance, data=Default, subset = train)
plot(lda.fit.test)

lda.prob.test <- predict(lda.fit.test, testset, type = "respose")

# Check test performance
input <- list(predictions = predict(lda.fit)$posterior[,2], labels = Default$default)
pred <- prediction(input$predictions, input$labels)
perf <- performance(pred,"tpr","fpr")
input <- list(predictions = lda.prob.test$posterior[,2], labels = testset$default)
pred2 <- prediction(input$predictions, input$labels)
perf2 <- performance(pred2,"tpr","fpr")
plot(perf, main = "ROC Curve") # training
plot(perf2, col = "red", add = TRUE) # testing
lines(x = c(-0.2,1.2), y = c(-0.2,1.2), lty = 2)

Answer

Since the parameter of this model is 2 (\(p=2\)) and the number of training example \(n=10,000\), we do not expect overfitting cause any problem. Instead, because only \(3.33\%\) of the individuals in the training sample defaulted, a simple but useless classifier that always predict that each individual will not default, regardless of predictors. This is so called “class imbalance problem”.

Why does LDA do such a poor job of classifying the customers who default? In other words, why does it have such a low sensitivity? As we have seen, LDA is trying to approximate the Bayes classifier, which has the lowest total error rate \((1-ACC)\) out of all classifiers (if the Gaussian model is correct). In otherwords, Bayesian classifier is not suitable for class imbalance problem. One solution is changing threshold of classification probability of LDA but it will increase FPR.

How we can determine suitable thresholds? One popular method is Receiver Operating Characteristics, also known as ROC curve. Under ROC analysis framework, performance is often evaluated by the area under the (ROC) curve (AUC). The larger AUC indicates the better performance. For better understanding of These performance metrics, see James p148 ~ p149 or this link.

Example of LDA (s16)

Now we have another example from our lecture slide. It is a classification problem with 3 gene expression levels. We are going to assume expression of each gene follows multivariate Gaussian distribution. Fit LDA model and evaluate performance of your model.

# Simulate multivariate Gaussian with p=3, equal number of samples
mu1 <- c(-0.49, 0.63, -0.1)
mu2 <- c(1.62, 1.24, 1.42)
sigma <- matrix(c(14.07, 3.27, 9.8, 3.27, 1.31, 1.89, 9.81, 1.89, 8.67), 
                byrow = TRUE, ncol=3)
ALL <- mvrnorm(n=10000, mu=mu1, Sigma = sigma)
ALL <- data.frame(ALL, cl = rep("ALL", length(ALL[,1])))
AML <- mvrnorm(n=10000, mu=mu2, Sigma = sigma)
AML <- data.frame(AML, cl = rep("AML", length(AML[,1])))
data <- rbind(ALL, AML)

# training vs test #1 (7:3)
train1 <- sample(1:20000, 20000*0.7)
testset1 <- data[-train,]

lda.fit1 <- lda(cl ~ X1+X2+X3, data=data, subset = train1); lda.fit1
## Call:
## lda(cl ~ X1 + X2 + X3, data = data, subset = train1)
## 
## Prior probabilities of groups:
##       ALL       AML 
## 0.4997143 0.5002857 
## 
## Group means:
##            X1        X2          X3
## ALL -0.403609 0.6499094 -0.04168185
## AML  1.695341 1.2448830  1.49134427
## 
## Coefficients of linear discriminants:
##           LD1
## X1 0.03604153
## X2 0.45397505
## X3 0.16183504
predict(lda.fit1, data.frame(X1 = 0.5, X2 = 0.5, X3 = 0.5))
## $class
## [1] ALL
## Levels: ALL AML
## 
## $posterior
##         ALL       AML
## 1 0.5359881 0.4640119
## 
## $x
##          LD1
## 1 -0.2449192
lda.fit1.test <- predict(lda.fit1, testset1, type = "respose")

# Simulate multivariate Gaussian with p=3, different number of samples
ALL <- mvrnorm(n=15000, mu=mu1, Sigma = sigma)
ALL <- data.frame(ALL, cl = rep("ALL", length(ALL[,1])))
AML <- mvrnorm(n=5000, mu=mu2, Sigma = sigma)
AML <- data.frame(AML, cl = rep("AML", length(AML[,1])))
data <- rbind(ALL, AML)

# training vs test #1 (7:3)
train2 <- sample(1:20000, 20000*0.7)
testset2 <- data[-train,]

# training vs test (7:3)
lda.fit2 <- lda(cl ~ X1+X2+X3, data=data, subset = train2); lda.fit2
## Call:
## lda(cl ~ X1 + X2 + X3, data = data, subset = train2)
## 
## Prior probabilities of groups:
##       ALL       AML 
## 0.7505714 0.2494286 
## 
## Group means:
##             X1        X2          X3
## ALL -0.4551191 0.6423014 -0.07042739
## AML  1.6759000 1.2504863  1.43901894
## 
## Coefficients of linear discriminants:
##           LD1
## X1 0.07074947
## X2 0.43056872
## X3 0.11820324
predict(lda.fit2, data.frame(X1 = 0.5, X2 = 0.5, X3 = 0.5))
## $class
## [1] ALL
## Levels: ALL AML
## 
## $posterior
##         ALL       AML
## 1 0.7743007 0.2256993
## 
## $x
##           LD1
## 1 -0.07369608
lda.fit2.test <- predict(lda.fit2, testset2, type = "respose")

length(testset1$cl)
## [1] 13000
# ROC
input <- list(predictions1 = lda.fit1.test$posterior[,2], 
              predictions2 = lda.fit2.test$posterior[,2], 
              labels1 = testset1$cl, labels2 = testset2$cl)
pred1 <- prediction(input$predictions1, input$labels1)
pred2 <- prediction(input$predictions2, input$labels2)
perf1 <- performance(pred1,"tpr","fpr")
perf2 <- performance(pred2,"tpr","fpr")
plot(perf1, col = "blue", main = "ROC Curve") # equal sample
plot(perf2, add = TRUE, col = "red") # unequal sample
lines(x = c(-0.2,1.2), y = c(-0.2,1.2), lty = 2)

Quadratic Discriminant Analysis (QDA) (s12)

Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. That is, it assumes that an observation from the \(k_{th}\) class is of the form \(X \sim N(\mu_k, \Sigma_k)\), where \(\Sigma_k\) is a covariance matrix for the \(k_{th}\) class. Under this assumption, the Bayes classifier assigns an observation \(X = x\) to the class for which \[\begin{aligned} \delta_k(x) &= -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)-\frac{1}{2}\log |\Sigma_k| + \log \pi_k \\ &= -\frac{1}{2}x^T\Sigma_k^{-1}x+x^T\Sigma_k^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k-\frac{1}{2}\log |\Sigma_k| + \log \pi_k \\ \end{aligned}\] is largest. So the QDA classifier involves plugging estimates for \(\Sigma_k, \mu_k\), and \(\pi_k\).

Why does it matter whether or not we assume that the \(K\) classes share a common covariance matrix? In other words, why would one prefer LDA to QDA, or vice-versa? The answer lies in the bias-variance trade-off. When there are \(p\) predictors, then estimating a covariance matrix requires estimating \(\frac{p(p+1)}{2}\) parameters. QDA estimates a separate covariance matrix for each class, for a total of \(K\frac{p(p+1)}{2}\) parameters. With \(50\) predictors this is some multiple of \(1,275\), which is a lot of parameters.

By instead assuming that the \(K\) classes share a common covariance matrix, the LDA model becomes linear in \(x\), which means there are \(Kp\) linear coefficients to estimate. Consequently, LDA is a much less flexible classifier than QDA, and so has substantially lower variance. This can potentially lead to improved prediction performance. But there is a trade-off: if LDA’s assumption that the \(K\) classes share a common covariance matrix is badly off, then LDA can suffer from high bias.

Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the \(K\) classes is clearly untenable.

Naive Bayes (s13)

It is especially appropriate when the dimension \(p\) of the feature space is high. The naive Bayes model assumes that given a class \(G = j\), the features \(X_k\) are independent: \[f_k(X) = \prod_{j=1}^p f_{kj}(x_j)\] The discriminant function of naive Bayes is given as

\[\delta_k(x) \propto \log \Bigg[\pi_k \prod_{j=1}^p f_{kj}(x_j)\Bigg] = -\frac{1}{2}\sum_{j=1}^p \frac{(x_j-\mu_{kj})^2}{\sigma^2_{kj}}+\log\pi_k\]

While this assumption is generally not true, it does simplify the estimation dramatically.

  • The individual class-conditional marginal probability \(f_{jk}\) can each be estimated separately. (Low computational complexity)

  • If a component \(X_j\) of \(X\) is discrete, then an appropriate histogram estimate can be used. This provides a seamless way of mixing variable types in a feature vector.

Despite these rather optimistic assumptions, naive Bayes classifiers often outperform far more sophisticated alternatives.The reasons is that although the individual class density estimates may be biased, this bias might not hurt the posterior probabilities as much, especially near the decision regions.

Below, we introduce simple toy example of naive spam classifier.

# generate toy example
train <- data.frame(class=c("spam","ham","ham","ham"), 
                    viagra=c("yes","no","no","yes"))
train
##   class viagra
## 1  spam    yes
## 2   ham     no
## 3   ham     no
## 4   ham    yes
# make model
library(e1071)
classifier <- naiveBayes(class ~ viagra,train)
classifier
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##  ham spam 
## 0.75 0.25 
## 
## Conditional probabilities:
##       viagra
## Y             no       yes
##   ham  0.6666667 0.3333333
##   spam 0.0000000 1.0000000
# test a sample that contains the word "viagra"
test <- data.frame(viagra=c("yes"))
test$viagra <- factor(test$viagra, levels=c("no","yes"))
test
##   viagra
## 1    yes
# P(spam | viagra = yes) = 0.5
prediction <- predict(classifier, test ,type="raw")
prediction
##      ham spam
## [1,] 0.5  0.5
# Add the "meet" variable to the training data set 
train <- data.frame(type=c("spam","ham","ham","ham"), 
                    viagra=c("yes","no","no","yes"),
                    meet=c("yes","yes","yes", "no"))
train
##   type viagra meet
## 1 spam    yes  yes
## 2  ham     no  yes
## 3  ham     no  yes
## 4  ham    yes   no
# Two independent variables (assumed in this model)
classifier <- naiveBayes(type ~ viagra + meet,train)
classifier
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##  ham spam 
## 0.75 0.25 
## 
## Conditional probabilities:
##       viagra
## Y             no       yes
##   ham  0.6666667 0.3333333
##   spam 0.0000000 1.0000000
## 
##       meet
## Y             no       yes
##   ham  0.3333333 0.6666667
##   spam 0.0000000 1.0000000
# test a sample that contains the words "viagra" & "meet"
test <- data.frame(viagra=c("yes"), meet=c("yes"))
test$viagra <- factor(test$viagra, levels=c("no","yes"))
test$meet <- factor(test$meet, levels=c("no","yes"))
test
##   viagra meet
## 1    yes  yes
# P(spam | viagra = yes, meet = yes) = 0.6
prediction <- predict(classifier, test ,type="raw")
prediction
##      ham spam
## [1,] 0.4  0.6

KNN classifier (s14)

We discussed this classifier in the last tutorial.

Comparison of classification methods (s15)

We studied 5 classification methods such as KNN, Logistic regression, LDA, QDA, and naive Bayes. We now consider the type of scenarios in which one approach might dominate the others.

  1. Logistic regression versus LDA When \(p=1\), log odds of LDA is given as \[\begin{aligned}\log \bigg(\frac{p(X)}{1-p(X)}\bigg) &= \log \bigg(\frac{p_1(X)}{p_2(X)}\bigg) \\ &= \log \Bigg[\frac{\pi_1\exp[(x-\mu_1)^2/2\sigma^2]}{\pi_2\exp[(x-\mu_2)^2/2\sigma^2]}\Bigg] \\ &= \log \frac{\pi_1}{\pi_2} - \frac{(x-\mu_1)^2}{2\sigma^2} +\frac{(x-\mu_2)^2}{2\sigma^2} \\ &= \log \frac{\pi_1}{\pi_2} + \frac{\mu_2^2 - \mu_1^2}{2\sigma^2} +\frac{\mu_1 - \mu_2}{\sigma^2}x \\ &= c_0 + c_1x \end{aligned}\]

The logistic regression equation is given as \[\log \bigg(\frac{p_1}{1-p_1}\bigg) = \beta_0 + \beta_1x\] As you can see, both models are linear functions of \(x\) and except their parameter estimation methods (MLE versus Gaussian estimation), they are quite similar. This holds when \(p>1\) as well.

Therefore,

  • Logistic regression is better when Gaussian assumptions are not met but LDA is better when they are met.

  • Since KNN is a non-parametric approach, the true decision boundary is highly nonlinear, it may outperform than Logistic regression and LDA.

  • QDA serves as compromise between the KNN and LDA/Logistic regression.

We summarized the relationship between classification model and linearity of decision boundary as shown below.

  1. uncorrelated (covariance = 0), Multivariate Gaussian with a different mean in each class. The decision boundary is linear.
    • Performance: \(LDA \approx Logistic \ regression > QDA > KNN\)
  2. uncorrelated (covariance = 0), Multivariate Gaussian with a different mean in each class. The decision boundary is linear.
    • Performance: \(LDA \approx Logistic \ regression > QDA > KNN\)
  3. correlated (correlation = -0.5), t-distribution with a different mean in each class. The decision boundary is linear.
    • Performance: \(Logistic \ regression > LDA > KNN > QDA\)

 

 

  1. Alternating correlation, Multivariate Gaussian with a different mean in each class. The decision boundary is quadratic.
    • Performance: \(QDA > KNN > LDA \approx Logistic \ regression\)
  2. uncorrelated (covariance = 0), Multivariate Gaussian with a different mean in each class but sampled from logistic function. The decision boundary is quadratic.
    • Performance: \(QDA > KNN > LDA \approx Logistic \ regression\)
  3. uncorrelated (covariance = 0), Multivariate Gaussian with a different mean in each class but sampled from highly non-linar function. The decision boundary is highly non-linear.
    • Performance: \(KNN > QDA > LDA \approx Logistic \ regression\)

 

 

These six examples illustrate that no one method will dominate the others in every situation. When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well. When the boundaries are moderately non-linear, QDA may give better results. Finally, for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.

Performance comparison on The stock market Data

The stock market data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date).

  • Year: 2001 ~ 2005
  • LagX: X day before

Predict Direction of market based on Lag 1 throught Lag 5 and Volume using 5 classification you learned from this tutorial. Use hold out validation strategy. In other words, use year 2001 to 2004 data for training and use year 2005 data as testing purpose.

# Inspect data
glimpse(Smarket)
## Observations: 1,250
## Variables: 9
## $ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001...
## $ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0...
## $ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1...
## $ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, ...
## $ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623,...
## $ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, ...
## $ Volume    <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.44...
## $ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0...
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Do...
attach(Smarket)
# dividing training and test data set
train <- (Year < 2005)
test <- Smarket[!train,]
dim(test)
## [1] 252   9
direction_test <- Direction[!train]

# Logistic regression
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket,
               family = binomial, subset = train)
glm.prob <- predict(glm.fit, test, type = "response")
glm.pred <- rep("Down", length(glm.prob))
glm.pred[glm.prob > 0.5] <- "Up"
table(glm.pred, direction_test)
##         direction_test
## glm.pred Down Up
##     Down   77 97
##     Up     34 44
mean(glm.pred == direction_test) # ACC
## [1] 0.4801587
mean(glm.pred != direction_test) # Error
## [1] 0.5198413
contrasts(Direction) # check dummy variable
##      Up
## Down  0
## Up    1
# Logistic regression with only Lag1 and Lag2
glm.fit2 <- glm(Direction ~ Lag1 + Lag2, data = Smarket, subset = train,
               family = binomial)
summary(glm.fit2)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Smarket, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.345  -1.188   1.074   1.164   1.326  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.03222    0.06338   0.508    0.611
## Lag1        -0.05562    0.05171  -1.076    0.282
## Lag2        -0.04449    0.05166  -0.861    0.389
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1381.4  on 995  degrees of freedom
## AIC: 1387.4
## 
## Number of Fisher Scoring iterations: 3
glm.prob2 <- predict(glm.fit2, test, type = "response")
glm.pred2 <- rep("Down", length(glm.prob2))
glm.pred2[glm.prob2 > 0.5] <- "Up"
table(glm.pred2, direction_test)
##          direction_test
## glm.pred2 Down  Up
##      Down   35  35
##      Up     76 106
mean(glm.pred2 == direction_test) # ACC
## [1] 0.5595238
mean(glm.pred2 != direction_test) # Error
## [1] 0.4404762
# Naive approach compare to logistic regression
106/(106+76)
## [1] 0.5824176
# QDA & LDA
lda.fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train); lda.fit
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
qda.fit <- qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train); qda.fit
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
lda.pred <- predict(lda.fit, test, type = "response")
qda.pred <- predict(qda.fit, test, type = "response")

# Naive bayes
bayes.fit <- naiveBayes(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
bayes.pred <- predict(bayes.fit, test, type = "raw")

# KNN with k=1
library(class)
train.knn <- cbind(Lag1, Lag2)[train,]
test.knn <- cbind(Lag1, Lag2)[!train,]
train.direction <- Direction[train]
set.seed(123)
knn.pred <- knn(train.knn, test.knn, train.direction, k=1, prob = TRUE)
table(knn.pred, direction_test)
##         direction_test
## knn.pred Down Up
##     Down   43 58
##     Up     68 83
# k=3
knn.pred2 <- knn(train.knn, test.knn, train.direction, k=3, prob = TRUE)
table(knn.pred2, direction_test)
##          direction_test
## knn.pred2 Down Up
##      Down   48 55
##      Up     63 86
# ROC - making input list
input <- list(predictions1 = glm.prob, 
              predictions2 = lda.pred$posterior[,2],
              predictions3 = qda.pred$posterior[,2],
              predictions4 = bayes.pred[,2],
              labels = direction_test)
# create prediction object to create performance object
pred1 <- prediction(input$predictions1, input$labels)
pred2 <- prediction(input$predictions2, input$labels)
pred3 <- prediction(input$predictions3, input$labels)
pred4 <- prediction(input$predictions4, input$labels)
# create Performance object
perf1 <- performance(pred1,"tpr","fpr")
perf2 <- performance(pred2,"tpr","fpr")
perf3 <- performance(pred3,"tpr","fpr")
perf4 <- performance(pred4,"tpr","fpr")
# Plot ROC
plot(perf1, main = "ROC Curve")
plot(perf2, add = TRUE, col = "red")
plot(perf3, add = TRUE, col = "blue")
plot(perf4, add = TRUE, col = "cyan")
lines(x = c(-0.2,1.2), y = c(-0.2,1.2), lty = 2)
legend("bottomright", c("Logit", "LDA", "QDA", "NB"), 
       col = c("black", "red", "blue", "cyan"), lty= rep(1,4))

detach(Smarket)

Extra example - Issue of scale, Caravan insurance data

The Caravan data set includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is Purchase, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only \(6 \%\) of people purchased caravan insurance. When you classify this data set with KNN, because KNN utilize distance between observations, the effect of scale of variables results significant difference in classification performance. A good way to handle this problem is to standardize the data so that all variables are given a mean of zero and a standard deviation of one.

dim(Caravan)
## [1] 5822   86
summary(Caravan$Purchase)
##   No  Yes 
## 5474  348
glimpse(Caravan[1:7, 1:7])
## Observations: 7
## Variables: 7
## $ MOSTYPE  <dbl> 33, 37, 37, 9, 40, 23, 39
## $ MAANTHUI <dbl> 1, 1, 1, 1, 1, 1, 2
## $ MGEMOMV  <dbl> 3, 2, 2, 3, 4, 2, 3
## $ MGEMLEEF <dbl> 2, 2, 2, 3, 2, 1, 2
## $ MOSHOOFD <dbl> 8, 8, 8, 3, 10, 5, 9
## $ MGODRK   <dbl> 0, 1, 0, 2, 1, 0, 2
## $ MGODPR   <dbl> 5, 4, 4, 3, 4, 5, 2
# Standardize data set
std_caravan <- scale(Caravan[,-dim(Caravan)[2]])

# Check results
var(Caravan[,1])
## [1] 165.0378
var(Caravan[,2])
## [1] 0.1647078
var(std_caravan[,1])
## [1] 1
var(std_caravan[,2])
## [1] 1
# split data set
test <- 1:1000
trainset <- std_caravan[-test,]
testset <- std_caravan[test,]
train.purchase <- Caravan$Purchase[-test]
test.purchase <- Caravan$Purchase[test]

# initialize  & performa KNN
set.seed(1)
knn.pred <- knn(trainset, testset, train.purchase, k=1)
knn.pred3 <- knn(trainset, testset, train.purchase, k=3)
knn.pred5 <- knn(trainset, testset, train.purchase, k=5)
mean(test.purchase != knn.pred) # error rate
## [1] 0.118
mean(test.purchase != knn.pred3) # error rate
## [1] 0.075
mean(test.purchase != knn.pred5) # error rate
## [1] 0.066
mean(test.purchase != "No") # trivial classifier that predict always "No"
## [1] 0.059
# TPR
d1 <- table(knn.pred, test.purchase)
d3 <- table(knn.pred3, test.purchase)
d5 <- table(knn.pred5, test.purchase)
d1 <- as.data.table(d1)
d3 <- as.data.table(d3)
d5 <- as.data.table(d5)

# k=1
PPV1 <- d1$N[4]/sum(d1$N[c(2,4)]); PPV1
## [1] 0.1168831
# k=3
PPV3 <- d3$N[4]/sum(d3$N[c(2,4)]); PPV3
## [1] 0.1923077
# k=5
PPV5 <- d5$N[4]/sum(d5$N[c(2,4)]); PPV5
## [1] 0.2666667
# Logistic regression
glm.fit <- glm(Purchase ~., data = Caravan, family = binomial, subset = -test)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.prob <- predict(glm.fit, Caravan[test,], type="response")
glm.pred <- rep("No", length(glm.prob))
glm.pred[glm.prob > 0.5] <- "Yes"
d <- table(glm.pred, test.purchase); d
##         test.purchase
## glm.pred  No Yes
##      No  934  59
##      Yes   7   0
glm.pred <- rep("No", length(glm.prob))
glm.pred[glm.prob > 0.25] <- "Yes"
d <- table(glm.pred, test.purchase); d
##         test.purchase
## glm.pred  No Yes
##      No  919  48
##      Yes  22  11
d <- as.data.table(d)
PPV <- d$N[4]/sum(d$N[c(2,4)]); PPV
## [1] 0.3333333
# ROC & AUC
input <- list(predictions = glm.prob, labels = test.purchase)
pred <- prediction(input$predictions, input$labels)
perf <- performance(pred,"tpr","fpr")
perf2 <- performance(pred,"auc")
plot(perf, colorize = TRUE, main = "ROC Curve")
lines(x = c(-0.2,1.2), y = c(-0.2,1.2), lty = 2)

print(paste0("AUC: ", round(perf2@y.values[[1]],3)))
## [1] "AUC: 0.741"
# perform LDA, QDA, NB classification by yourself

As you could see in this example, if the company tries to sell insurance to a random selection of customers, then the success rate will be only \(6 \%\), which may be far too low given the costs involved. Instead, the company would like to try to sell insurance only to customers who are likely to buy it. So the overall error rate is not of interest. Instead, the fraction of individuals that are correctly predicted to buy insurance is of interest (PPV, Positive predictive value or Precision). It turns out that KNN with \(K = 1\) does far better than random guessing among the customers that are predicted to buy insurance. Among 76 such customers, 9, or \(11.8 \%\), actually do purchase insurance. This is double the rate that one would obtain from random guessing. When you perform above classification with logistic regression, you can get better result by varing threshold of your classfier. In case of 0.25 threshold, you can get 33% PPV. Try other classifiers by yourself.

Validation and Resampling (s17)

Introduction (s18)

Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. Resampling approaches can be computationally expensive but, due to recent advances in computing power, the computational requirements of resampling methods generally are not prohibitive. We are going to discuss two of the most commonly used resampling methods, cross-validation and the bootstrap.

The cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility. The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection. The bootstrap is used in several contexts, most commonly model to provide a measure of accuracy of a parameter estimate or of a given selection statistical learning method. In this tutorial we first review some of hold-out validation methods and later we will discuss mathematical adjustment to the training error rate in order to estimate the test error rate.

The Validation set approach (s19)

 

The validation set approach involves randomly dividing the available set of observations into two parts, a training set and a validation set or hold-out set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate - typically assessed using MSE in the case of a quantitative response - provides an estimate of the test error rate. Let’s see simple example with “Auto” data set. In this data set, there is non-linear relationship between mpg and horsepower. Therefore including square of horsepower to the model gives better results than a model that used only a linear term. It is natural to wonder whether a cubic or higher-order fit might provide even better results.

lm.fit <- lm(mpg ~ poly(horsepower,1), data = Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 1), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           23.4459     0.2478   94.62   <2e-16 ***
## poly(horsepower, 1) -120.1377     4.9058  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
lm.fit <- lm(mpg ~ poly(horsepower,2), data = Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16
lm.fit <- lm(mpg ~ poly(horsepower,3), data = Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 3), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7039  -2.4491  -0.1519   2.2035  15.8159 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.446      0.221 106.105   <2e-16 ***
## poly(horsepower, 3)1 -120.138      4.375 -27.460   <2e-16 ***
## poly(horsepower, 3)2   44.090      4.375  10.078   <2e-16 ***
## poly(horsepower, 3)3   -3.949      4.375  -0.903    0.367    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.375 on 388 degrees of freedom
## Multiple R-squared:  0.6882, Adjusted R-squared:  0.6858 
## F-statistic: 285.5 on 3 and 388 DF,  p-value: < 2.2e-16

As we discussed in the previous lecture, because the addition of cubic term does not much improve \(R^2\) statistic, higher order-term is not helpful to get better result.

We could also answer this question using the validation method.

set.seed(123)
train <- sample(length(Auto$horsepower), length(Auto$horsepower) * 0.5)
test <- Auto[-train, ]
train.mse <- test.mse <- c()

# Using 1 sampling
for (i in 1:10){
  lm.fit <- lm(mpg ~ poly(horsepower,i), data =Auto, subset = train)
  res <- summary(lm.fit)
  pred.lm <- predict(lm.fit, test, type = "response")
  train.mse[i] <- mean(res$residuals^2)
  test.mse[i] <- mean((test$mpg - pred.lm)^2) 
}
plot(1:10, train.mse, col = "red", cex = 1.2, ylim = c(16, 28),
     ylab = "Mean Squared Error", xlab = "Degree of Polynomial", type = 'l')
points(1:10, train.mse, col = "red", pch = 16)
points(1:10, test.mse, col = "blue", pch = 16)
lines(1:10, test.mse, col = "blue", lty=2)
legend("topright", c("MSE_train", "MSE_test"), col = c("red", "blue"), lty = c(1,2),
       pch = c(16,16))

# iterative sampling
set.seed(123)
temp_train.mse <- temp_test.mse <- c()
train.mse <- test.mse <- data.frame()

for (j in 1:10){
  train <- sample(length(Auto$horsepower), length(Auto$horsepower) * 0.5)
  test <- Auto[-train,]
  for (i in 1:10){
  lm.fit <- lm(mpg ~ poly(horsepower,i), data =Auto, subset = train)
  res <- summary(lm.fit)
  pred.lm <- predict(lm.fit, test, type = "response")
  temp_train.mse[i] <- mean(res$residuals^2)
  temp_test.mse[i] <- mean((test$mpg - pred.lm)^2) 
  }
  train.mse <- rbind(train.mse, temp_train.mse)
  test.mse <- rbind(test.mse, temp_test.mse)
}
colnames(train.mse) <- colnames(test.mse) <- paste0("power_",1:10)
rownames(train.mse) <- rownames(test.mse) <- paste0("sample",1:10)

library(RColorBrewer)
# display.brewer.pal(n = 10, name = 'Spectral') # display palette
cols <- brewer.pal(n = 10, name = 'Spectral')

par(mfrow = c(1,2))
matplot(t(train.mse), type = "l", ylim = c(16, 28), col = cols, lty = 1,
        ylab = "Mean Squared Error (train)", xlab = "Degree of Polynomial")
matplot(t(test.mse), type = "l", ylim = c(16, 28), col = cols, lty = 1,
        ylab = "Mean Squared Error (test)", xlab = "Degree of Polynomial")

par(mfrow = c(1,1))

As is shown in the graph above, the validation estimate of the test error rate can be highly variable depending on precisely which observations are included in the training set and test set. In addition, because we use fewer sample (compared to entire data set), the performance of trained model is worse. Therefore the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

Leave-One-Out Cross Validation (LOOCV) (s20)

 

Leave-one-out cross-validation (LOOCV) is closely related to the validation set approach, but it attempts to address that method’s drawbacks.Like the validation set approach, LOOCV involves splitting the set of observations into two parts. However, instead of creating two subsets of comparable size, a single observation \((x_1, y_1)\) is used for the validation set, and the remaining observations \(\{(x_2, y_2),...,(x_n, y_n)\}\) make up the training set. The statistical learning method is fit on the \(n-1\) training observations, and a prediction \(\hat y_1\) is made for the excluded observation, using its value \(x_1\). Since \((x_1, y_1)\) was not used in the fitting process, \(MSE_1 = (y_1-\hat y_1)^2\) provides an approximately unbiased estimate for the test error. But even though \(MSE_1\) is unbiased for the test error, it is a poor estimate because it is highly variable, since it is based upon a single observation \((x_1, y_1)\). By repeating this procedure, we can get the LOOCV estimate for the test MSE as \[CV_{(n)} = \frac{1}{n}\sum_{i=1}^n MSE_i\] LOOCV has a couple of major advantages over the validation set approach 1. It has far less bias. 2. Perform LOOCV multiple times will always yield the same results.

LOOCV has the potential to be expensive to implement since the model has to be fit \(n\) times. This can be Very time consuming if \(n\) is large, and if each individual model is slow to fit. With least squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit. The following formula holds: \[CV_{(n)} = \frac{1}{n}\Bigg(\frac{y_i - \hat y_i}{1-h_i}\Bigg)^2\] where \(\hat y_i\) is the \(i_{th}\) fitted value from the original least square fit, and \(h_i\) is the leverage. This is like the ordinary MSE, except the \(i_{th}\) residual is divided by \(1-h_i\). The leverage lies between \(\frac{1}{n}\) and 1, and reflects the amount that an observation influences its own fit. Hence the residuals for high-leverage points are inflated in this formula by exactly the right amount for this equality to hold. The formular above is, however, does not hold in general, in which case the model has to be refit \(n\) time.

# LOOCV
library(boot)

# Allocating result vector
cv.error <- c()

# calculate LOOCV error
for (i in 1:10){
 glm.fit <- glm(mpg~poly(horsepower,i), data = Auto)
 cv.error[i] <- cv.glm(Auto,glm.fit)$delta[1]
 }

# Plot results
plot(1:10, cv.error, col = "blue", cex = 1.2, ylim = c(16, 28), main = "LOOCV",
     ylab = "Mean Squared Error", xlab = "Degree of Polynomial", type = 'l')
points(1:10, cv.error, col = "blue", pch = 16)

K-fold Cross Validation (s21)

 

An alternative to LOOCV is \(k\)-fold CV. V. This approach involves randomly dividing the set of observations into \(k\) groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining \(k - 1\) folds. The mean squared error, \(MSE_1\), is then computed on the observations in the held-out fold. This procedure is repeated \(k\) times; each time, a different group of observations is treated as a validation set. The \(k\)-fold CV estimate is computed by \[CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_i\] LOOCV is a special case of \(k\)-fold CV in which \(k\) is set to equal \(n\). In practice, one typically peforms \(k\)-fold CV using \(k=5\) or \(k=10\). \(k\)-fold CV is computationally inexpansive compared to LOOCV. In addition to the computational advantage, \(k\)-fold CV often gives more accurate estimates of the test error rate than does LOOCV. Since \(k\)-fold CV contains \(\frac{n(k-1)}{k}\) observations in each training set, it is more bias than LOOCV. However, LOOCV has higher variance because we trained the model from the almost identical set of observation, therefore these outputs are highly (positively) correlated with each other. In contrast, when we perform \(k\)-fold CV with \(k<n\), this effect is minimized so that the variance of estimate of test error should be lower than LOOCV case. Empirically, \(k=5\) or \(k=10\) showed acceptable biase and variance so we recommend you use \(k=5\) or \(k=10\).

# LOOCV
library(boot)

# Allocating result vector
cv.error10 <- cv.error5 <- c()
err10 <- err5 <- data.frame()

# calculate 10 fold CV error
for (j in 1:10){
for (i in 1:10){
 glm.fit <- glm(mpg~poly(horsepower,i), data = Auto)
 cv.error5[i] <- cv.glm(Auto,glm.fit, K = 5)$delta[1]
 cv.error10[i] <- cv.glm(Auto,glm.fit, K = 10)$delta[1]
}
  err5 <- rbind(err5, cv.error5)
  err10 <- rbind(err10, cv.error10)
}
colnames(err5) <- colnames(err10) <- paste0("power_",1:10)
rownames(err5) <- rownames(err10) <- paste0("iteration",1:10)

# Plot results (single line)
plot(1:10, cv.error10, col = "blue", cex = 1.2, ylim = c(16, 28), main = "5, 10-fold CV",
     ylab = "Mean Squared Error", xlab = "Degree of Polynomial", type = 'l', lty=2)
points(1:10, cv.error10, col = "blue", pch = 16)
points(1:10, cv.error5, col = "red", pch = 16)
lines(1:10, cv.error5, col = "red")
legend("topright", c("5-fold", "10-fold"), col = c("red", "blue"), lty = c(1,2),
       pch = c(16,16))

# plot iterations
par(mfrow = c(1,2))
matplot(t(err5), type = "l", ylim = c(16, 28), col = cols, lty = 1, main = "5-fold CV",
        ylab = "Mean Squared Error", xlab = "Degree of Polynomial")
matplot(t(err10), type = "l", ylim = c(16, 28), col = cols, lty = 1, main = "10-fold CV",
        ylab = "Mean Squared Error", xlab = "Degree of Polynomial")

par(mfrow = c(1,1))

 

Cross-validation on classification problem

So far, we have illustrated the use of cross-validation in the regression setting where the outcome \(Y\) is quantitative, and so have used MSE to quantify test error. But cross-validation can also be a very useful approach in the classification setting when \(Y\) is qualitative. In this setting, cross-validation works just as described earlier, except that rather than using MSE to quantify test error, we instead use the number of misclassified observations. For LOOCV case, the error rate is given as \[CV_{(n)} = \frac{1}{n} \sum_{i=1}^n Err_i\] where \(Err_i = I(y_i \neq \hat y_i)\).

As an example, think about the model we studied in the statistical learning lecture. By extending the model with quadratic terms, we can improve accuracy of our model. That is, \[\log \bigg(\frac{p}{1-p}\bigg) = \beta_0 + \beta_1 X_1 + \beta_2 X_1^2 + \beta_3 X_2 + \beta_4 X_2^2\] As you can see, when we increase degree of polynomial terms as linear, quadratic, cubic, quartic, … we can reduce test error rate of our model but in practice, this is infeasible. By inspecting 10-fold cross-validation error rate, you can estimate this trend, shown as below.

library(ElemStatLearn)
glimpse(mixture.example)
## List of 8
##  $ x       : num [1:200, 1:2] 2.5261 0.367 0.7682 0.6934 -0.0198 ...
##  $ y       : num [1:200] 0 0 0 0 0 0 0 0 0 0 ...
##  $ xnew    : 'matrix' num [1:6831, 1:2] -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2 -1.9 -1.8 -1.7 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6831] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:2] "x1" "x2"
##  $ prob    : num [1:6831] 3.55e-05 3.05e-05 2.63e-05 2.27e-05 1.96e-05 ...
##   ..- attr(*, ".Names")= chr [1:6831] "1" "2" "3" "4" ...
##  $ marginal: num [1:6831] 6.65e-15 2.31e-14 7.62e-14 2.39e-13 7.15e-13 ...
##   ..- attr(*, ".Names")= chr [1:6831] "1" "2" "3" "4" ...
##  $ px1     : num [1:69] -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2 -1.9 -1.8 -1.7 ...
##  $ px2     : num [1:99] -2 -1.95 -1.9 -1.85 -1.8 -1.75 -1.7 -1.65 -1.6 -1.55 ...
##  $ means   : num [1:20, 1:2] -0.2534 0.2667 2.0965 -0.0613 2.7035 ...
x <- mixture.example$x # X1 and X2
g <- mixture.example$y # class label
xnew <- mixture.example$xnew # grid points
cl <- as.numeric(mixture.example$prob>0.5)
new_data <- data.frame(xnew,cl)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob <- mixture.example$prob
prob.bayes <- matrix(prob, length(px1), length(px2)) # bayes decision boundary
  
# logistic regression
par(mfrow = c(2,2))
glm.grid <- glm(cl~x1+x2, data=new_data, family = binomial)# fitting model
prob.glm <- predict(glm.grid, type = "response")
prob1 <- matrix(prob.glm, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob1, levels=0.5, labels="", xlab="", ylab="", 
        main= "logistic regression (Degree = 1)", axes=FALSE, col = "red")
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob1>0.5, "coral", "cornflowerblue"))
box()

# 2nd order
glm.grid2 <- glm(cl~poly(x1,2, raw = TRUE) + poly(x2,2, raw = TRUE), 
                 data=new_data, family = binomial)# fitting model 
prob.glm2 <- predict(glm.grid2, type = "response")
prob2 <- matrix(prob.glm2, length(px1), length(px2))

par(mar=rep(2,4))
contour(px1, px2, prob2, levels=0.5, labels="", xlab="", ylab="", 
        main= "logistic regression (Degree = 2)", axes=FALSE, col = "red")
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob2>0.5, "coral", "cornflowerblue"))
box()

# 3nd order
glm.grid3 <- glm(cl~poly(x1,3, raw = TRUE) + poly(x2,3, raw = TRUE), 
                 data=new_data, family = binomial)# fitting model 
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
prob.glm3 <- predict(glm.grid3, type = "response")
prob3 <- matrix(prob.glm3, length(px1), length(px2))

par(mar=rep(2,4))
contour(px1, px2, prob3, levels=0.5, labels="", xlab="", ylab="", 
        main= "logistic regression (Degree = 3)", axes=FALSE, col = "red")
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob3>0.5, "coral", "cornflowerblue"))
box()

# 4th order
glm.grid4 <- glm(cl~poly(x1,4, raw = TRUE) + poly(x2,4, raw = TRUE), 
                 data=new_data, family = binomial)# fitting model 
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
prob.glm4 <- predict(glm.grid4, type = "response")
prob4 <- matrix(prob.glm4, length(px1), length(px2))

par(mar=rep(2,4))
contour(px1, px2, prob4, levels=0.5, labels="", xlab="", ylab="", 
        main= "logistic regression (Degree = 4)", axes=FALSE, col = "red")
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob4>0.5, "coral", "cornflowerblue"))
box()

par(mfrow = c(1,1))

# Performance of each model (fig5.8)
# Allocating result vector
cv_error.train <- cv_error.test <- train_err <- test_err <- c()

# calculate 10 fold CV error
for (j in 1:10){
for (i in 1:10){
 indices <- (((i-1)*round((1/10)*nrow(new_data))) + 1):(i*round((1/10)*nrow(new_data)))
 train <- new_data[-indices,]
 test <- new_data[indices,]
 glm.fit <- suppressWarnings(glm(cl~poly(x1,j, raw = TRUE) + 
                                   poly(x2,i, raw = TRUE), data = new_data, 
                subset = -indices, family=binomial))# fitting model
 pred.train <- predict(glm.fit, type = "response")
 pred.test <- predict(glm.fit, test, type = "response")
 conf.train <- table(train$cl, pred.train>0.5) 
 conf.test <- table(test$cl, pred.test>0.5)
 acc.train <- sum(diag(conf.train))/sum(conf.train)
 acc.test <- sum(diag(conf.test))/sum(conf.test)
 train_err[i] <- 1-acc.train
 test_err[i] <- 1-acc.test
}
 cv_error.train[j] <- mean(train_err)
 cv_error.test[j] <- mean(test_err)
}

# Plot results (single line)
plot(1:10, cv_error.test, col = "blue", cex = 1.2, main = "10-fold CV", ylim=c(0.05,0.43),
     ylab = "Mean Squared Error", xlab = "Degree of Polynomial", type = 'l', lty=2)
points(1:10, cv_error.test, col = "blue", pch = 16)
lines(1:10, cv_error.train, col = "red")
points(1:10, cv_error.train, col = "red", pch = 16)

# Try KNN case by your self

See MASS4 (Modern Applied Statistics with S, 4th Ed) pp.197-8. if you want to know about warnings in this fit.

Bootstrap (s24)

The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method. the power of the bootstrap lies in the fact that it can be easily applied to a wide range of statistical learning methods, including some for which a measure of variability is otherwise difficult to obtain and is not automatically output by statistical software.

Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of \(X\) and \(Y\), respectively, where \(X\) and \(Y\) are random quantities. We will invest a fraction \(\alpha\) of our money in \(X\), and will invest the remaining \(1-\alpha\) in \(Y\). Since there is variability associated with the returns on these two assets, we wish to choose \(\alpha\) to minimize the total risk, or variance, of our investment. In other words, we want to minimize \[Var(\alpha X + (1-\alpha) Y)\] One can show that the value that minimizes the risk is given by \[\alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2-2\sigma_{XY}}\] where \(\sigma_X^2 = Var(X), \sigma_Y^2 = Var(Y)\), and \(\sigma_{XY} = Cov(X,Y)\).

We can use the properties of the variance to write the expression we seek to minimize as \[Var(\alpha X + (1-\alpha) Y) = \alpha^2 Var(X) + (1-\alpha)^2 Var(Y) + 2\alpha(1-\alpha) Cov(X,Y)\] To minimize this w.r.t \(\alpha\), we can take the derivative of the above w.r.t \(\alpha\), set the result equal to zero, and then solve for \(\alpha\). Then we get, \[ 2 \alpha Var(X) + 2(1-\alpha)(-1)Var(Y) + 2(1-\alpha) Cov(X,Y) = 0\\ \alpha = \frac{Var(Y) - Cov(X, Y)}{Var(X) + Var(Y) - 2Cov(X, Y)}\]

Since the quantities of true population parameters are unknown, we can estimate the value of \(\alpha\) that minimizes the variance of our investment using \[\hat\alpha = \frac{\hat\sigma_Y^2 - \hat\sigma_{XY}}{\hat\sigma_X^2 + \hat\sigma_Y^2-2\hat\sigma_{XY}}\] You can quantify the accuracy of our estimate of \(\alpha\) by computing standard error of our estimate. In practice, we can not generate new samples from the original population. However, the bootstrap approach allows us to use a computer to emulate the process of obtaining new sample sets, so that we can estimate the variability of \(\hat\alpha\) without generating additional samples. Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set.

Bootstrap procedure can be illustrated as the figure below.

 

 

Suppose we have a simple data set, which we call \(Z\), that contains only \(n=3\) observations. We randomly select \(n\) observations from the data set in order to produce a bootstrap data set, \(Z^{*1}\). The sampling is performed with replacement, which means that the same observation can occur more than once in the bootstrap data set. We can use \(Z^{*1}\). to produce a new bootstrap estimate for \(\alpha\), which we call \(\hat\alpha^{*1}\). This procedure is repeated \(B\) times for some large value of \(B\), in order to produce \(B\) different bootstrap data sets, \(Z^{*1}, Z^{*2}, ..., Z^{*B}\), and \(B\) corresponding \(\alpha\) estimates, \(\hat\alpha^{*1}, \hat\alpha^{*2}, ..., \hat\alpha^{*B}\). We can compute the standard error of these bootstrap estimates using the formula \[SE_B(\hat\alpha) = \sqrt{\frac{1}{B-1} \sum_{r=1}^B \Bigg(\hat\alpha^{*r} - \frac{1}{B}\sum_{r'=1}^B \hat\alpha^{*r'}\Bigg)^2}\] This serves as an estimate of the standard error of \(\hat\alpha\) estimated from the original data set. As you can see the figure below, bootstrap approch shows quite good agreement with resampling approach from original dataset.

 

 

Since the probability of \(j_{th}\) sample is not in the any bootstrap sample is \(1-\frac{1}{n}\), and the probability of \(j_{th}\) sample is not in \(1, 2, ..., n_{th}\) bootstrap sample is \(\Big(1-\frac{1}{n}\Big)^n\) so that the probability of \(j_{th}\) sample is in a given bootstrap sample is \(1-\Big(1-\frac{1}{n}\Big)^n\). Therefore, when \(n \rightarrow \infty\), the probability of \(j_{th}\) sample is in a given bootstrap sample becomes \[1-\lim_{n\rightarrow \infty} \Big(1-\frac{1}{n}\Big)^n = 1-e^{-1} = 0.6321206\] Recall that by applying L’Hopital’s rule on the equation below, we get \[\begin{aligned}&\ln\Bigg[\lim_{n\rightarrow \infty} \Big(1+\frac{x}{n}\Big)^n\Bigg] \\ &= \lim_{n\rightarrow \infty} \frac{\ln\Big(1+\frac{x}{n}\Big)}{\frac{1}{n}} \\ &= \lim_{n\rightarrow \infty} \frac{\frac{1}{1+\frac{x}{n}} \cdot \frac{-x}{n^2}}{-\frac{1}{n^2}} \\ &= \lim_{n\rightarrow \infty} \frac{x}{1+\frac{x}{n}} = x \\ \\ &\text{Therefore,} \\ &\lim_{n\rightarrow \infty} \Big(1+\frac{x}{n}\Big)^n = e^x \end{aligned}\]

library(ISLR)
par(mfrow=c(2,2))
plot(sample(Portfolio$X, 100, replace=TRUE), sample(Portfolio$X, 100, replace=TRUE),
     xlab = "X", ylab = "Y", col = "#07976C", pch=16)
plot(sample(Portfolio$X, 100, replace=TRUE), sample(Portfolio$X, 100, replace=TRUE),
     xlab = "X", ylab = "Y", col = "#07976C", pch=16)
plot(sample(Portfolio$X, 100, replace=TRUE), sample(Portfolio$X, 100, replace=TRUE),
     xlab = "X", ylab = "Y", col = "#07976C", pch=16)
plot(sample(Portfolio$X, 100, replace=TRUE), sample(Portfolio$X, 100, replace=TRUE),
     xlab = "X", ylab = "Y", col = "#07976C", pch=16)

par(mfrow=c(1,1))

# alpha
alpha.fn <- function (data ,index){
   X <- data$X[index]
   Y <- data$Y[index]
   return ((var(Y)-cov(X,Y))/(var(X)+var(Y) -2*cov(X,Y)))
}
alpha.fn(Portfolio, 1:100)
## [1] 0.5758321
# alpha_hat
alpha.fn(Portfolio, sample(100,100, replace=TRUE))
## [1] 0.6851776
# r = 1000
set.seed(1)
boot(Portfolio, alpha.fn, R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.5758321 6.936399e-05  0.08868935
# with modelr & purrr R package
fn_alpha = function(data){
  # data needs to be dataframe to feed %>%
  data <- as.data.frame(data)
  data %>% summarize(alpha = (var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))  %>%  .$alpha
}

# alpha
Portfolio %>% fn_alpha()
## [1] 0.5758321
# alpha_hat
set.seed(1)
Portfolio %>%
  sample_n(100, replace=TRUE) %>%
  fn_alpha()
## [1] 0.5963833
# bootstrap
bootstrap_Portfolio <- Portfolio %>% modelr::bootstrap(1000) 

# get alpha estimate
library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:data.table':
## 
##     transpose
res <- bootstrap_Portfolio %>% mutate(alpha = map_dbl(strap, fn_alpha))

hist(res$alpha, col = "cornflowerblue", xlab = "alpha", main = "")
abline(v = 0.5758, col = "hotpink2", lwd = 3)

Summary Example - Auto dataset

Let’s summarize this chapter. If you study well, you can follow this code without any explaination.

library(ISLR)
set.seed(1)
train <- sample(dim(Auto)[1], dim(Auto)[1]*0.5)

# linear regression
lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)
# test mse
mean((Auto$mpg-predict(lm.fit, Auto))[-train]^2)
## [1] 26.14142
# quadratic regression
lm.fit <- lm(mpg ~ poly(horsepower,2), data = Auto, subset = train)
# test mse
mean((Auto$mpg-predict(lm.fit, Auto))[-train]^2)
## [1] 19.82259
# cubic regression
lm.fit <- lm(mpg ~ poly(horsepower,3), data = Auto, subset = train)
# test mse
mean((Auto$mpg-predict(lm.fit, Auto))[-train]^2)
## [1] 19.78252
## from different seed
set.seed(2)
train <- sample(dim(Auto)[1], dim(Auto)[1]*0.5)

# linear regression
lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)
# test mse
mean((Auto$mpg-predict(lm.fit, Auto))[-train]^2)
## [1] 23.29559
# quadratic regression
lm.fit <- lm(mpg ~ poly(horsepower,2), data = Auto, subset = train)
# test mse
mean((Auto$mpg-predict(lm.fit, Auto))[-train]^2)
## [1] 18.90124
# cubic regression
lm.fit <- lm(mpg ~ poly(horsepower,3), data = Auto, subset = train)
# test mse
mean((Auto$mpg-predict(lm.fit, Auto))[-train]^2)
## [1] 19.2574
## LOOCV
# linear fit
glm.fit <- glm(mpg ~ horsepower, data = Auto) # use glm instead of lm for utilizing cv.glm function.
cv.err <- cv.glm(Auto, glm.fit)
cv.err$delta
## [1] 24.23151 24.23114
# polynomial fits (1~5)
cv.err <- rep(0,5)
for (i in 1:5){
  glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  cv.err[i] <- cv.glm(Auto, glm.fit)$delta[1]
}
cv.err
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
## k-fold CV
# polynomial fits (1~10)
set.seed(17)
cv.err10 <- cv.err10.2 <- rep(0,10)
for (i in 1:10){
  glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  cv.err10[i] <- cv.glm(Auto, glm.fit, K=10)$delta[1]
  cv.err10.2[i] <- cv.glm(Auto, glm.fit, K=10)$delta[2]
}
cv.err10 # standard k-fold CV estimate
##  [1] 24.20520 19.24813 19.23734 19.46938 19.09800 19.14942 19.00590
##  [8] 18.68185 19.07743 19.65433
cv.err10.2 # bias-corrected k-fold CV estimate
##  [1] 24.22677 19.25471 19.22640 19.49643 19.15860 18.74707 18.99148
##  [8] 19.23236 19.06929 19.52286
## bootstrap
set.seed(1)
alpha.fn(Portfolio, 1:100)
## [1] 0.5758321
alpha.fn(Portfolio, sample(100, 100, replace=TRUE))
## [1] 0.5963833
boot(Portfolio, alpha.fn, R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original        bias    std. error
## t1* 0.5758321 -7.315422e-05  0.08861826
# Assess variability of the regression coefficient estimates
boot.fn <- function(data, index) return(coef(lm(mpg~horsepower, data=Auto, subset = index)))
boot.fn(Auto, 1:dim(Auto)[1])
## (Intercept)  horsepower 
##  39.9358610  -0.1578447
set.seed(1)
boot.fn(Auto, sample(dim(Auto)[1], dim(Auto)[1], replace = TRUE))
## (Intercept)  horsepower 
##  38.7387134  -0.1481952
boot.fn(Auto, sample(dim(Auto)[1], dim(Auto)[1], replace = TRUE))
## (Intercept)  horsepower 
##  40.0383086  -0.1596104
boot(Auto, boot.fn, 1000) # Bootstrap, SE(b0_hat) = 0.86, SE(b1_hat) = 0.0074
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original      bias    std. error
## t1* 39.9358610  0.02972191 0.860007896
## t2* -0.1578447 -0.00030823 0.007404467
summary(lm(mpg~horsepower, data = Auto))$coef # least sq, SE(b0_hat) = 0.717, SE(b1_hat) = 0.0064
##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81
# bootstrap estimate may be more accurate because 
# we do not have any assumption about variance of error in this case. 

# Quadratic model
boot.fn <- function(data, index) {
  return(coefficients(lm(mpg~horsepower + I(horsepower^2), data=Auto, subset = index)))
} # I --> as is, inhibit interpretation or conversion of objects. so R will calculate it as it is.
set.seed(1)
boot(Auto, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 56.900099702  6.098115e-03 2.0944855842
## t2* -0.466189630 -1.777108e-04 0.0334123802
## t3*  0.001230536  1.324315e-06 0.0001208339
summary(lm(mpg~horsepower + I(horsepower^2), data = Auto))$coef
##                     Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)     56.900099702 1.8004268063  31.60367 1.740911e-109
## horsepower      -0.466189630 0.0311246171 -14.97816  2.289429e-40
## I(horsepower^2)  0.001230536 0.0001220759  10.08009  2.196340e-21

For about bias-corrected k-fold CV estimate, see, Tibshirani, R.J. and Tibshirani, R., 2009. A bias correction for the minimum error rate in cross-validation. The Annals of Applied Statistics, pp.822-829.

Extra-Other Resampling methods (s25)

See supplementary material.