Introduction to Statistical Learning

In order to motivate our study of statistical learning, we begin with a simple example. Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product (Scaled by 1/1000 units). The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets (scaled by 1/1000 dollars) for the product in each of those markets for three different media: TV, radio, and newspaper. The data are displayed in below. It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

TV Radio Newspaper Sales
230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9
8.7 48.9 75.0 7.2

In this setting, the advertising budgets are input variables while sales is an output variable. The input variables are typically denoted using the symbol \(X\), with a subscript to distinguish them. So \(X_1\) might be the TV budget, \(X_2\) the radio budget, and \(X_3\) the newspaper budget. The inputs go by different names, such as predictors, independent variables, features or sometimes just variables. The output variable-in this case, sales-is often called the response or dependent variable, and is typically denoted using the symbol \(Y\). Using this data set, One may be interested in answering questions such as

 

  1. Is there a relationship between advertising budget and sales? (Analysis of Variance)

  2. How strong is the relationship between advertising budget and sales? (Quantifying linear relaitonship with RSE, F, \(R^2\))

  3. Which media contribute to sales? (Compare regression coefficients)

  4. How accurately can we estimate the effect of each medium on sales? (predict \(E(Y)\))

  5. How accurately can we predict future sales? (predict particular value of \(y\))

  6. Is the relationship linear? (Test validity of linear regression model)

  7. Is there synergy among the advertising media? (Interaction effect)

Using regression model, answer these questions yourself.

Simple Linear Regression

Recall, that simple linear regression model or population regression line with single predictor variable is defined as \[Y = \beta_0 + \beta_1 X + \epsilon\] and this relationship could be approximated by \[\hat Y = \hat \beta_0 + \hat \beta_1 + \hat \epsilon\] and here to approximate the true relationship, we measure predictor and response pair \((x_i, y_i)\) \(n\) times denoted as \((x_{i1}, y_{i1}), (x_{i2}, y_{i2}), ..., (x_{in}, y_{in})\).

The population mean or \(\mu\) can be approximated by \(E(\hat Y)\) and we already proved that \(E(\hat Y) = \hat \mu\). Therefore, the variance of the population mean which denoted as \(Var(\mu)\) can be approximated by \(Var(\hat \mu)\). This quantity can be estimated by computing standard error of \(\hat mu\), written as \(SE(\hat \mu)\) and given as \[Var(\hat \mu) = SE(\hat\mu)^2 = \frac{\hat \sigma^2}{n}\] As the same fashion, we can calculate how close our estimated regression coefficients \(\hat \beta_0\) and \(\hat \beta_1\) are to the true values \(\beta_0\) and \(\beta_1\). To compute the standard errors associated with \(\hat \beta_0\) and \(\hat\beta_1\), we can use the following formulas, as introduced in the regression lecture. \[SE(\hat\beta_0)^2 = \hat\sigma^2 \bigg[\frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^n (x_i - \bar x)^2}\bigg], \hspace{1cm} SE(\hat\beta_1)^2 = \frac{\hat\sigma^2} {\sum_{i=1}^n (x_i - \bar x)^2}\] where \(\hat \sigma^2 = Var(\epsilon)\). For these formulas to be strictly valid, we need to assume that the errors \(\epsilon_i\) for each observation are uncorrelated with common variance \(\sigma^2\). This is often not true as our example but it still a good approximation. In general \(\sigma^2\) is not known, but can be estimated from the data. The estimate of \(\sigma\) is known as the residual standard error, and is givne by the formula \[RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_{i=1}^n e_i^2}{n-2}} = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat y)^2}{n-2}}\] \(1-\alpha\) confidence interval is defined as a range of values such that with \(1-\alpha\%\) interval probability, the range will contain the true unknown value of the parameter. For simple linear regression, 95% confidence interval for \(\beta_1\) and \(\beta_0\) approximated by \[\hat\beta_1 \pm 2 \times SE(\hat\beta_1) \hspace{1cm} \hat\beta_0 \pm 2\times SE(\beta_0)\] We can perform hypothesis tests on the coefficients with \(H_0: \beta_1 = 0\) versus \(H_0: \beta_1 \neq 0\) using standard error. If \(\beta_1 = 0\), the model reduces to \(Y = \beta_0 + \epsilon\), and \(X\) is not associated wity \(Y\). To test \(H_0\), we need to determine whether \(\hat \beta_1\) is sufficiently far from zero that we can be confident that \(\beta_1\) is non-zero. This of course depends on the accuracy of \(\hat \beta_1\), that is, it depends on \(SE(\hat \beta_1)\). We compute a \(t-statistic\) give by \[t = \frac{\hat \beta_1 - 0}{SE(\hat \beta_1)}\] which measures the number of standard deviations that \(\hat\beta_1\) is away from 0.

We can also assess the validity of our model using residual standard error (RSE) and \(R^2\). The RSE is considered a measure of the lack of fit of the model but it is measured in the units of \(Y\), it is not always clear what constitues a good RSE. The \(R^2\) statistic provids an alternative measure of fit. \(R^2\) is given as \[R^2 = \frac{TSS - RSS}{TSS} = 1- \frac{RSS}{TSS}\] where \(TSS = \sum_{i=1}^n (y_i - \bar y)^2\). \(R^2\) measures the proportion of variability in \(Y\) that can be explained using \(X\). In simple regression setting \(R^2\) is the same as square of Pearson correlation coefficient, i.e., \(R^2 = r^2\).

library(readr)
Advertising <- read_csv("http://bisyn.kaist.ac.kr/bis335/11-StatLearning-Advertising.csv")
## Parsed with column specification:
## cols(
##   TV = col_double(),
##   Radio = col_double(),
##   Newspaper = col_double(),
##   Sales = col_double()
## )
# fit linear model
fit1 <- lm(Sales ~ TV, data = Advertising)
fit2 <- lm(Sales ~ Radio, data = Advertising)
fit3 <- lm(Sales ~ Newspaper, data = Advertising)

# model list & result summary
mod <- list(fit1, fit2, fit3)
res <- lapply(mod, summary)

# 1. ANOVA
f <- unlist(lapply(1:3, function(x) res[[x]]$fstatistic[1]))
sigma <- unlist(lapply(1:3, function(x) res[[x]]$sigma))
df1 <- unlist(lapply(1:3, function(x) res[[x]]$fstatistic[2]))
df2 <- unlist(lapply(1:3, function(x) res[[x]]$fstatistic[3]))
p <- pf(f, df1, df2, lower.tail = FALSE) 

# 2. R^2
r2 <- unlist(lapply(1:3, function(x) res[[x]]$r.squared))

# 3. Compare regression coefficient
b0 <- unlist(lapply(1:3, function(x) res[[x]]$coefficient[1]))
b1 <- unlist(lapply(1:3, function(x) res[[x]]$coefficient[2])) 
names(f) <- names(p) <- names(r2) <- names(b0) <- names(b1) <- c("fit1", "fit2", "fit3")

## Print results
fit_res <- t(data.frame(round(b0, 3), round(b1,3), 
                        round(sigma,3), round(r2,3), round(f,3), round(p,3)))
rownames(fit_res) <- c("$\\beta_0$", "$\\beta_1$", "$RSE(\\hat\\sigma)$","$R^2$", "$F$", "$p$")
kable(fit_res,
      row.names = TRUE,
      col.names = c("TV", "Radio", "Newspaper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 4. confidence interval of b0, b1
a <- 0.05

## Data
conf_fit1 <- confint(fit1, level = 1-a)
conf_fit2 <- confint(fit2, level = 1-a)
conf_fit3 <- confint(fit3, level = 1-a)

## Print results
conf_res <- do.call(rbind, lapply(list(conf_fit1, conf_fit2, conf_fit3), round, 3))
rownames(conf_res) <-  c("$\\beta_{0_{TV}}$", 
                         "$\\beta_{1_{TV}}$", 
                         "$\\beta_{0_{Radio}}$", 
                         "$\\beta_{1_{Radio}}$", 
                         "$\\beta_{0_{Newspaper}}$", 
                         "$\\beta_{1_{Newspaper}}$")
kable(conf_res,
      row.names = TRUE,
      col.names = c("Lower", "Upper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 5. Predict E(Y)
newdata.fit1 <- data.frame(TV = 100) 
newdata.fit2 <- data.frame(Radio = 100) 
newdata.fit3 <- data.frame(Newspaper = 100)

conf_fit1_yhat <- predict(fit1, newdata.fit1, level=1-a, interval = "confidence")
conf_fit2_yhat <- predict(fit2, newdata.fit2, level=1-a, interval = "confidence")
conf_fit3_yhat <- predict(fit3, newdata.fit3, level=1-a, interval = "confidence")
conf_res_print <- rbind(conf_fit1_yhat, conf_fit2_yhat, conf_fit3_yhat)
conf_res <- data.frame(Advertising, predict(fit1, level=1-a, interval = "confidence"))

# Print results
rownames(conf_res_print) <- c("TV", "Radio", "Newspaper")
kable(conf_res_print,
      row.names = TRUE,
      col.names = c("$\\hat y$", "Lower", "Upper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 6. Predict y*
pred_fit1_y <- predict(fit1, newdata.fit1, level=1-a, interval = "prediction")
pred_fit2_y <- predict(fit2, newdata.fit2, level=1-a, interval = "prediction")
pred_fit3_y <- predict(fit3, newdata.fit3, level=1-a, interval = "prediction")
pred_res_print <- rbind(pred_fit1_y, pred_fit2_y, pred_fit3_y)
pred_res <- cbind(Advertising, predict(fit1, level=1-a, interval = "prediction"))
## Warning in predict.lm(fit1, level = 1 - a, interval = "prediction"): predictions on current data refer to _future_ responses
# Print results
rownames(pred_res_print) <- c("TV", "Radio", "Newspaper")
kable(pred_res_print,
      row.names = TRUE,
      col.names = c("$\\hat y$", "Lower", "Upper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 7. Test validity of linear model (H0: b1 = 0)
t0 <- unlist(lapply(1:3, function(x) res[[x]]$coefficient[5]))
t1 <- unlist(lapply(1:3, function(x) res[[x]]$coefficient[6])) 
p0 <- unlist(lapply(1:3, function(x) res[[x]]$coefficient[7]))
p1 <- unlist(lapply(1:3, function(x) res[[x]]$coefficient[8])) 
names(t0) <- names(t1) <- names(p0) <- names(p1) <- c("fit1", "fit2", "fit3")

## print results
test_res <- rbind(t0, t1, p0, p1)
rownames(test_res) <-  c("$t_{\\beta_0}$", 
                         "$t_{\\beta_1}$", 
                         "$p_{\\beta_0}$", 
                         "$p_{\\beta_1}$")
kable(test_res,
      row.names = TRUE,
      col.names = c("TV", "Radio", "Newspaper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 8. Visualize
library(ggplot2)
ggplot(conf_res, aes(x = TV, y = Sales)) + 
  geom_point(col = "red3", size = 1.5) + 
  geom_line(aes(x = TV, y = fit1$fitted.values), col = "royal blue", size = 1) + 
  geom_segment(aes(xend = TV, yend = fit1$fitted.values), col = "grey") + 
  geom_ribbon(aes(x = TV, ymin = conf_res[,6], ymax = conf_res[,7]), alpha = 0.2) + 
  geom_ribbon(aes(x = pred_res$TV, ymin = pred_res[,6], ymax = pred_res[,7]), alpha = 0.1) +
  theme_bw()

## or simply
ggplot(Advertising, aes(x = TV, y = Sales)) + 
  geom_point() + 
  geom_smooth(method = "lm")

# 9. visualize effect of x-x_bar in confidence interval & prediction interval 
conf_res <- conf_res[order(conf_res[,5]),]
plot(conf_res[,7] - conf_res[,6], type='l', ylab = 'confidence interval')

pred_res <- pred_res[order(pred_res[,5]),]
plot(pred_res[,7] - pred_res[,6], type='l', ylab = 'Prediction interval')

 

Answer questions in terms of simple linear regression

  1. Is there a relationship between advertising budget and sales? (Analysis of Variance)

    This question can be answered by fitting a simple regression model of Sales onto TV, Radio and Newspaper. As we have seen above, there is clear evidence of a relationship between advertising and sales.

  2. How strong is the relationship between advertising budget and sales? (Quantifying linear relaitonship with RSE, \(R^2\))

    The RSE of the linear model \(7.03 + 0.0475 \times TV\) is 3.26. In other words, actual sales in each market deviate from the true regression line by approximately 3,260 units (Since the data is scaled by 1/1000), on average. Another way to think about this is that any prediction of sales on the basis of TV advertising would still be off by about 3,260 units on average. Of course, whether or not 3,260 units is an acceptable prediction error depends on the problem context. In this data set, the mean value of sales over all markets is approximately 14,000 units, and so the percentage error is \(\frac{3,260}{14,000} = 23 \%\) The \(R^2\) of the same model was 0.61, and so just under two-third variability in sales is explained by a linear regression on TV.

  3. Which media contribute to sales the most? (Compare regression coefficients)

    According to the linear model \(7.03 + 0.0475 \times TV\), an additional \(\$1,000\) spent on TV advertising is associated with selling approximately 47.5 additional units of the product on average. In case of Radio and News paper advertising, they associated with selling approximately 202 and 55 additional units of the pruduct respectively. Therefore, Radio contributes to sales the most.

  4. How accurately can we estimate the effect of each medium on sales? (predict \(E(Y)\))

    According to the confidence interval of \(\beta_0\) and \(\beta_1\) on TV, which are calculated as \([6.130, 7.935]\) and \([0.042, 0.053]\), we can conclude that in the absence of any TV advertising, Sales will, on average, fall some where between 6,130 and 7,940 units. Furthermore, for each \(\$1,000\) increase in TV advertising, there will be an average increase in sales of between 42 and 53 units.

  5. How accurately can we predict future sales? (predict particular value of \(y\))

    The prediction interval of simple linear regression model for TV is \([11.27, 12.30]\) when we spend \(\$100,000\) for TV.

  6. Is the relationship linear? (Test validity of linear regression model)

    According to the Testing results of regression parameters \(\beta_0\) and \(\beta_1\), we can reject the null hypotheses, that is \(\beta_0 \neq 0\) and \(\beta_1 \neq 0\).

  7. Is there synergy among the advertising media? (Interaction effect)

    We are not addressing this question in simple linear regression

 

Multiple Linear Regression

Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor. To extent our analysis one can run three separate simple linear regressions, each of which uses a different advertising medium as a predictor. However, the approach of fitting a separate simple linear regression model for each predictor is not entirely satisfactory since,

  • It is unclear how to make a single prediction of sales given levels of the three advertising media budgets. (They may not independent)

  • Each of the three regression equations ignores the other two media in forming estimates for the regression coefficients. (They may be correlated)

In multiple linear regression setting, suppose that we have \(p\) distinct predictors, then we fit the model \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_P + \epsilon\] where \(X_j\) represents the \(j_{th}\) predictor and \(\beta_j\) quantifies the association betwwen that variable and the response. We interpret \(\beta_j\) as the average effect on \(Y\) of a one unit increase in \(X_j\), holding all other predictors fixed.

We can fit our model with \[\hat y = \hat \beta_0 + \hat \beta_1 x_1 + \hat \beta_2 x_2 + ... + \hat \beta_p x_p\] We choose parameters \(\beta_0, \beta_1, ..., \beta_p\) to minimize the sum of squared residuals \[RSS = \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_{i1} - \hat \beta_2 x_{i2} - ... - \hat \beta_p x_{ip})^2\] If there are correlations among predictor variables, one predictor may significant in simple linear regression but its significant may disappear in multiple linear regression setting. In this case, One predictor may function as surrogate for other predictor. In otherwords, although the predictor does not affect response variable, it seems affect the response variable due to the other correlated variable. When we adjust correlated variable, the effect of surrogate variable may disappear.

We may test the significance of relationship between predictors and the response with \[H_0: \beta_1 = \beta_2 = ... = \beta_p = 0 \hspace{1cm} versus \hspace{1cm} H_a: at \ least \ one \ \beta_j \ is \ non{\text -}zero\] The hypothesis test is performed by computing the \(F-statistic\), \[F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)}\] where, \(TSS = \sum_{i=1}^n(y_i - \bar y)^2\) and \(RSS = \sum_{i=1}^n(y_i - \hat y_i)^2\) If the linear model assumptions are correct, \[E\{RSS/(n-p-1)\} = \sigma^2\] and that, provided \(H_0\) is true, \[E\{(TSS-RSS)/p\} = \sigma^2\]

If we want to test that a particular subset of \(q\) of the coefficients are zero, then the corresponding null hypothesis is \[H_0: \beta_{p-q+1} = \beta_{p-q+2}=...=\beta_p = 0\] where for convenience we have put the variables chosen for omission at the end of the list. In this case we fit a second model that uses all the variables except those last \(q\). Suppose that the residual sum of squares for that model is \(RSS_0\). Then the appropriate \(F{\text-}statistic\) is \[F = \frac{(RSS_0 - RSS)/q}{RSS/(n-p-1)}\] This statistic is equivalent to the \(t{\text-}statistic\) in Analysis of variance table when \(q = 1\). (Note that the square of each \(t{\text-}statistic\) is the corresponding F-statistic) So it reports the partial effect of adding that variable to the model.

When we have large number of predictors, let’s say \(p=100\), even though there are no association between predictors and response, we expect to see approximately five small p-values if we use 95% confidence level. Hence, if we use the individual \(t{\text-}statistic\) and associated p-values in order to decide whether or not there is any association between the variables and the response, there is a very high chance that we will incorrectly conclude that there is a relationship. However F-statistic does not suffer from this problem because it adjust for the number of predictors.

When we have very large number of predictors, say \(p > n\) (high dimensional setting), then there are more coefficients \(\beta_j\) to estimate than observations from which to estimate them. In this case, we cannot even fit the multiple linear regression model using least squares, so the F-statistic cannot be used, and neither can most of the other concepts that we have seen so far in here. In this case we can use approaches like subset selection, shrinkage methods or dimension reduction methods.

In order to fit a single model involving meaningful predictors only is referred to as variable selection. Various statistics can be used to judge the quality of a model. These include Mallow’s \(C_p\), Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted \(R^2\). We can also determine which model is best by performing residual analysis.

Unfortunately, there are a total of \(2^p\) models that contains subsets of \(p\) variables, which is impractical to perform brute-force search. There are three classical approaches for this task

  • Forward selection: From null model (without predictor), add best predictor at a time (one-by-one) by mesuaring RSS until some stopping criteria is satisfied.
  • Backward selection: From full model (every predictor), remove least significant predictor by measuring p-value until some stopping criteria is satisfied.
  • Mixed selection: Combination of Forward and backward selection. Start forward selection while Minimize RSS and p-value of individual predictor at the same time.

we can not apply backward selection when \(p>n\) (Here \(n\) is number of samples). Forward selection can always used but it is greedy solution which might contains redundant variables. Mixed selection may help this problem.

Now we need to review how we can assess model fit in multiple linear regression. Two of the most common measures of model fit are RSE and \(R^2\) and \(R^2\) in multiple linear regression equals \(Cor(Y, \hat Y)^2\), the square of the correlation between the response and the fitted linear model. RSE in multiple linear regression is defined as \[RSE = \sqrt{\frac{RSS}{n-p-1}}\] which is the same as simple linear regression when \(p=1\).

Once we fitted regression line, there are three kinds of uncertainty may affect quality of regression model. First, there is inaccuray in the coefficient estimates. This is related to reducible error. Second, model may bias. In other words, our data may not fit into MLR model. Third, the effect of random error \(\epsilon\). This is related to irreducible error we can measure each source of uncertainty by measuring following quantities.

  • Reducible error: Measure confidence interval
  • Bias: In linear regression, we ignore bias and operate as if the linear model is correct
  • Irreducible error: Measure prediction interval

We use a confidence interval to quantify the uncertainty surrounding the average sales over a large number of cities. On the other hand, a prediction interval can be used to quantify the uncertainty surrounding sales for a particular city.

library(readr)
Advertising <- read_csv("http://bisyn.kaist.ac.kr/bis335/11-StatLearning-Advertising.csv")
## Parsed with column specification:
## cols(
##   TV = col_double(),
##   Radio = col_double(),
##   Newspaper = col_double(),
##   Sales = col_double()
## )
# fit linear model
fit12 <- lm(Sales ~ TV + Radio, data = Advertising)
fit123 <- lm(Sales ~ TV + Radio + Newspaper, data = Advertising)

# model list & result summary
res12 <- summary(fit12)
res123 <- summary(fit123)

# 1. ANOVA
f_12 <- res12$fstatistic
f_123 <- res123$fstatistic
sigma_12 <- res12$sigma
sigma_123 <- res123$sigma
p_12 <- pf(f_12[1], f_12[2], f_12[3], lower.tail = FALSE) 
p_123 <- pf(f_123[1], f_123[2], f_123[3], lower.tail = FALSE) 

# 2. R^2
r2_12 <- res12$r.squared
r2_123 <- res123$r.squared

# 3. Compare regression coefficient
b0_12 <- res12$coefficient[1]
b1_12 <- res12$coefficient[2:3]
b0_123 <- res123$coefficient[1]
b1_123 <- res123$coefficient[2:4]

## Print results
fit_res <- t(data.frame(round(b0_12, 3), round(b1_12[1],3), round(b1_12[2],3), 
                        round(b0_123, 3), round(b1_123[1],3), round(b1_123[2],3), round(b1_123[3],3),
                        round(sigma_12,3), round(sigma_123,3), 
                        round(r2_12,3), round(r2_123,3), 
                        round(f_12[1],3), round(f_123[1],3),
                        round(p_12,3), round(p_123[1],3)))
rownames(fit_res) <- c("$\\beta_{0_{TV|Radio}}$", "$\\beta_{1_{TV}}$",  "$\\beta_{1_{Radio}}$",
                       "$\\beta_{0_{TV|Radio|Newspaper}}$", "$\\beta_{1_{TV}}$",  
                       "$\\beta_{1_{Radio}}$", "$\\beta_{1_{Newspaper}}$",
                       "$RSE(\\hat\\sigma_{TV|Radio})$", "$RSE(\\hat\\sigma_{TV|Radio|Newspaper})$",
                       "$R^2_{TV|Radio}$", "$R^2_{TV|Radio|Newspaper}$",
                       "$F_{TV|Radio}$", "$F_{TV|Radio|Newspaper}$",
                       "$p_{TV|Radio}$", "$p_{TV|Radio|Newspaper}$")
kable(fit_res,
      row.names = TRUE,
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 4. confidence interval of b0, b1
a <- 0.05
conf_fit12 <- confint(fit12, level = 1-a)
conf_fit123 <- confint(fit123, level = 1-a)

## Print results
conf_res <- rbind(round(conf_fit12, 3), round(conf_fit123, 3))
rownames(conf_res) <-  c("$\\beta_{0_{TV|Radio}}$", 
                         "$\\beta_{1_{TR,TV}}$", 
                         "$\\beta_{1_{TR,Radio}}$", 
                         "$\\beta_{0_{TV|Radio|Newspaper}}$", 
                         "$\\beta_{1_{TRN,TV}}$", 
                         "$\\beta_{1_{TRN,Radio}}$", 
                         "$\\beta_{1_{TRN,Newspaper}}$")
kable(conf_res,
      row.names = TRUE,
      col.names = c("Lower", "Upper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 5. Predict E(Y)
newdata.fit12 <- data.frame(TV = 100, Radio = 20) 
newdata.fit123 <- data.frame(TV = 100, Radio = 20, Newspaper = 0) 
conf_fit12_yhat <- predict(fit12, newdata.fit12, level=1-a, interval = "confidence")
conf_fit123_yhat <- predict(fit123, newdata.fit123, level=1-a, interval = "confidence")
conf_res_print <- rbind(conf_fit12_yhat, conf_fit123_yhat)

## print result
rownames(conf_res_print) <- c("TV|Radio", "TV|Radio|Newspaper")
kable(conf_res_print,
      row.names = TRUE,
      col.names = c("$\\hat y$", "Lower", "Upper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 6. Predict y*
pred_fit12_y <- predict(fit12, newdata.fit12, level=1-a, interval = "prediction")
pred_fit123_y <- predict(fit123, newdata.fit123, level=1-a, interval = "prediction")
pred_res_print <- rbind(pred_fit12_y, pred_fit123_y)

## print result
rownames(pred_res_print) <- c("TV|Radio", "TV|Radio|Newspaper")
kable(conf_res_print,
      row.names = TRUE,
      col.names = c("$\\hat y$", "Lower", "Upper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 7. Test validity of linear model (H0: b1 = 0)
coef_res_print <- rbind(res12$coefficients, res123$coefficients)
rownames(coef_res_print) <-  c("$\\beta_{0_{TV|Radio}}$", 
                               "$\\beta_{1_{TR,TV}}$", 
                               "$\\beta_{1_{TR,Radio}}$", 
                               "$\\beta_{0_{TV|Radio|Newspaper}}$", 
                               "$\\beta_{1_{TRN,TV}}$", 
                               "$\\beta_{1_{TRN,Radio}}$", 
                               "$\\beta_{1_{TRN,Newspaper}}$")

kable(coef_res_print,
      row.names = TRUE,
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 7. Correlation of response and predictor variables
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(fit123)
##        TV     Radio Newspaper 
##  1.004611  1.144952  1.145187
# 8. Plot
plot(fit123)

 

Extension of the Linear Model

The standard regression model provides interpretable results and works quite well on many real-world problems. However, Two of the most important assumptions, additivity and linearity may often violated in practice. The additive assumption means that the effect of changes in a predictor \(X_j\) on the response \(Y\) is independent of the values of the other predictors. The linear assumption states that the change in the response \(Y\) due to a one-unit chage in \(X_j\) is constant, regardless of the value of \(X_j\).

Consider interaction effect (or synergy effect), of two predictors, say TV and Radio. Now we have two main effect and one interaction effect in our model. Then our model become \[\begin{aligned} Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon \\ & = \beta_0 + (\beta_1 + \beta_3X_2)X_1 + \beta_2 X_2 + \epsilon \\ & = \beta_0 + \tilde \beta_1 X_1 + \beta_2 X_2 + \epsilon \end{aligned}\]

Here \(\beta_3\) denotes the increase in the effectiveness of \(X_1\) for a one unit increase in \(X_2\) or vice-versa. (In our example, If you increase one unit of Radio advertisement, effectiveness of TV advertisement increases as \(\beta_3\))

When an interaction term has a very small p-value, but the associated main effects do not, we need to consider hierarchical principle. The hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. In other words, if the interaction between \(X_1\) and \(X_2\) seems important, then we should include both \(X_1\) and \(X_2\) in the model even if their coefficient estimates have large p-values.

# fit linear model
fit123_int <- lm(Sales ~ TV * Radio * Newspaper, data = Advertising)

# model list & result summary
res <- summary(fit123_int)

# 1. ANOVA
f <- res$fstatistic
sigma <- res$sigma
p <- pf(f[1], f[2], f[3], lower.tail = FALSE) 

# 2. R^2
r2 <- res$r.squared

# 3. Compare regression coefficient
b0 <- res$coefficient[1]
b1 <- res$coefficient[2:7]

## Print results
fit_res <- t(data.frame(round(b0, 3), round(b1[1],3), round(b1[2],3), 
                        round(b1[3],3), round(b1[4],3), round(b1[5],3), round(b1[6],3),
                        round(sigma,3), round(r2,3), round(f[1],3), round(p,3)))
rownames(fit_res) <- c("$\\beta_0$", "$\\beta_{1_{TV}}$",  "$\\beta_{1_{Radio}}$",  
                       "$\\beta_{1_{TV|Radio}}$","$\\beta_{1_{TV|Newspaper}}$",
                       "$\\beta_{1_{Radio|Newspaper}}$", "$\\beta_{1_{All}}$",
                       "$RSE(\\hat\\sigma)$","$R^2$", "$F$", "$p$")
kable(fit_res,
      row.names = TRUE,
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 4. confidence interval of b0, b1
a <- 0.05
conf_fit123_int <- confint(fit123_int, level = 1-a)

## Print results
conf_res <- round(conf_fit123_int, 3)
rownames(conf_res) <-  c("$\\beta_0$", 
                         "$\\beta_{1_{TV}}$", 
                         "$\\beta_{1_{Radio}}$", 
                         "$\\beta_{1_{Newspaper}}$", 
                         "$\\beta_{1_{TV|Radio}}$",
                         "$\\beta_{1_{TV|Newspaper}}$",
                         "$\\beta_{1_{Radio|Newspaper}}$", 
                         "$\\beta_{1_{All}}$")
kable(conf_res,
      row.names = TRUE,
      col.names = c("Lower", "Upper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 5. Predict E(Y)
newdata.fit123_int <- data.frame(TV = 100, Radio = 20, Newspaper = 0) 
conf_fit123_int_yhat <- predict(fit123_int, newdata.fit123_int, level=1-a, interval = "confidence")
conf_res <- data.frame(Advertising, conf_fit123_int_yhat)
## Warning in data.frame(Advertising, conf_fit123_int_yhat): row names were
## found from a short variable and have been discarded
## print result
kable(conf_fit123_int_yhat,
      row.names = TRUE,
      col.names = c("$\\hat y$", "Lower", "Upper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 6. Predict y*
pred_fit123_int_y <- predict(fit123_int, newdata.fit123_int, level=1-a, interval = "prediction")

## print result
kable(pred_fit123_int_y,
      row.names = TRUE,
      col.names = c("$\\hat y$", "Lower", "Upper"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 7. Test validity of linear model (H0: b1 = 0)
coef_res_print <- as.data.frame(res$coefficient[25:32])
rownames(coef_res_print) <-  c("$\\beta_{0}$", 
                               "$\\beta_{1_{TV}}$", 
                               "$\\beta_{1_{Radio}}$", 
                               "$\\beta_{1_{Newspaper}}$", 
                               "$\\beta_{1_{TV, Radio}}$", 
                               "$\\beta_{1_{TV, Newspaper}}$", 
                               "$\\beta_{1_{Radio, Newspaper}}$", 
                               "$\\beta_{1_{TV, Radio, Newspaper}}$")
kable(coef_res_print,
      row.names = TRUE,
      col.names = c("p-value"),
      escape = FALSE,  
      align = 'c',
      format = 'latex',
      booktabs = TRUE) %>%
  kable_styling(position = "center")
# 7. Correlation of response and predictor variables
cor(Advertising)
##                   TV      Radio  Newspaper     Sales
## TV        1.00000000 0.05480866 0.05664787 0.7822244
## Radio     0.05480866 1.00000000 0.35410375 0.5762226
## Newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## Sales     0.78222442 0.57622257 0.22829903 1.0000000
library(car)
vif(fit123) # variable inflation factor, see later tutorial
##        TV     Radio Newspaper 
##  1.004611  1.144952  1.145187

 

Answer questions in terms of multiple linear regression

  1. Is there a relationship between advertising budget and sales? (Analysis of Variance)

    This question can be answered by fitting a multiple regression model of Sales onto TV, Radio and Newspaper. As we have seen above, there is clear evidence of a relationship between advertising and sales.

  2. How strong is the relationship between advertising budget and sales? (Quantifying linear relaitonship with RSE, \(R^2\))

    The RSE of the model is 1,681 units while the mean value for the response is \(14,022\), indicating a percentage error of roughly \(\frac{1,681}{14022} = 12 \%\). The \(R^2\) statistic shows that the predictors explain almost \(90\%\) of the variance in sales.

  3. Which media contribute to sales the most? (Compare regression coefficients)

    According to the multiple linear regression model, \(2.939 + 0.046 \times TV + 0.189 \times Radio - 0.001 \times Newspaper\), for a given amount of TV and newspaper advertising, spending an additional \(\$1,000\) on radio advertising leads to an increase in sales by approximately 189 units. Differ from simple linear regression case, Newspaper is not affect to sales and this fact can be interpreted as surrogate for other predictor variable which is Radio. In other words, due to the high correlation between Radio and Newspaper \((0.35)\), newspaper newspaper gets credit for the effect of radio on sales.

  4. How accurately can we estimate the effect of each medium on sales? (predict \(E(Y)\))

    \(95\%\) confidence interval of each predictor is given as \([0.043, 0.049]\) for TV, \([0.172, 0.206]\) for Radio, and \([-0.013, 0.011]\) for Newspaper. Therefore, association between sales versus TV and Radio advertisement is statistically significant. In addition, Since VIF scorers are \(1.005, 1.145, \ and \ 1.145\) for TV, Radio, and Newspaper suggesting no evidence of collinearity.

  5. How accurately can we predict future sales? (predict particular value of \(y\))

    The prediction interval of multiple linear regression model is \([10.86, 11.71]\) when we spend \(\$100,000\) for TV, $ $20,000$ for Radio.

  6. Is the relationship linear? (Test validity of linear regression model)

    From the residual plot, we conclude that there exist non-linear effect in our model.

  7. Is there synergy among the advertising media? (Interaction effect)

    According to \(TV \times Radio\) model, because \(R^2\) increased from 89.7% to 96.8 % compared to \(TV + Radio\) model, We can conclude that \(\frac{(96.8-89.7)}{(100-89.7)} = 69 \%\) of unexplained variance is explained by interaction effect. In other words, additional \(\$1,000\) on TV advertisment is associated with increased sales of \((\hat \beta_1 + \hat \beta_3 \times Radio) \times 1,000 = 19 + 1.1 \times Radio\) units and an increase in Radio advertising of \(\$1,000\) will be associated with an increase in sales of \((\hat\beta_2 + \hat\beta_3 \times TV)\times 1,000 = 29 + 1.1 \times TV\) units.

Potential Problems

When we fit a linear regression model to a particular data set, many problems may occur. Most common among these problems are the following and I add some of method to inspect and to solve the problem.

  1. Non-linearity of the response-predictor relationships.
  • Problem: High bias
  • Inspect: Residual plot
  • Solution: Variable transformation
  1. Correlation of error terms.
  • Problem: Underestimate the true standard error (Optimisitic conclusion about the model)
  • Inspect: Residual plot w.r.t. correlating factor (often time) - adjacent residual may have similar values
  • Solution: Experimental design (block correlating factor)
  1. Non-constant variance of error terms.(heteroscedasticity)
  • Problem: Unreliable estimation of statistics such as standard errors, confidence intervals, and hypothesis test.
  • Inspect: Funnel shape in the residual plot
  • Solution: Variable transformation, weighted least squares.
  1. Outliers.
  • Problem: Reduce power of model (wide confidence intervals and high p-values)
  • Inspect: Studentized residual plot (studentized residual > 3)
  • Solution: remove observation with caution or winsorize it.
  1. High-leverage points.
  • Problem: Entire model relys on few points.
  • Inspect: leverage statistic (> (p+1)/n), Studentized residual vs \(h_i\) plot
  • Solution: remove observation with caution or winsorize it.
  1. Collinearity.
  • Problem: Two or more predictor variables are closely related to one another. coeffiecint estimates are unreliable and the power of the hypothesis test is reduced.
  • Inspect: see correlation matrix of predictors. If there is multicollinearity meaning that 3 or more variables are correlated while correlation of pairwise is not significant, then compute variance inflation factor (VIF). It is given as \(VIF(\hat\beta_j) = \frac{1}{1-R^2_{X_j|X_{-j}}}\) where \(R^2_{X_j|X_{-j}}\) is the \(R^2\) from a regression of \(X_j\) onto all of the other predictors. If it exceed 5 or 10 indicates a problematic amount of collinearity (\(\min(VIF(\hat\beta_j)) = 1\))
  • Solution: drop one of the problematic variables or combine the collinear variables together into a single predictor.

Qualitative predictors

We have assumed that all variables in our linear regrssion model are quantitativ. But in practice, often some predictors are qualitative. Let’s inspect the Credit data as shown below. As you can see, we can perform linear regression for categorical variable such as gender. That is, gender is encoded as a dummy variable.

Credit <- read.csv("http://bisyn.kaist.ac.kr/bis335/11-StatLearning-Credit.csv")
Credit <- Credit[,-1]
glimpse(Credit)
## Observations: 400
## Variables: 11
## $ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 2...
## $ Limit     <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300...
## $ Rating    <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 58...
## $ Cards     <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3...
## $ Age       <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, ...
## $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9,...
## $ Gender    <fct> Male, Female, Male, Female, Male, Male, Female, Male...
## $ Student   <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No...
## $ Married   <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No,...
## $ Ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian...
## $ Balance   <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, ...
pairs(Credit[,c(11,5, 4, 6, 1:3)], col = "dark blue", cex = 0.1)

fit <- lm(Balance ~ Gender, data = Credit)
res <- summary(fit)
res$coefficients
##              Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 529.53623   31.98819 16.5541153 3.312981e-47
## GenderMale  -19.73312   46.05121 -0.4285039 6.685161e-01
  • balance: average credit card debt for a number of individuals
  • cards: number of credit cards
  • education: years of education
  • income: in thousands of dollars
  • limit: credit limit
  • rating: credit rating

Predictors with only two levels

Suppose that we wish to investigate differences in credit card balance between males and females, ignoring the other variables for the moment. we can create a new variable that takes the form \[x_i = \bigg\{\begin{array}{ll} 1 \hspace{1cm} if \ i_{th} \ person \ is \ male \\ 0 \hspace{1cm} if \ i_{th} \ person \ is \ female \end{array} \] and use this variable as a predictor in the regression equation. This results in the model

\[y_i = \beta_0 + \beta_1x_i + \epsilon_i = \bigg\{\begin{array}{ll} \beta_0 + \beta_1 + \epsilon_i \hspace{1cm} if \ i_{th} \ person \ is \ male \\ \beta_0 + \epsilon_i \hspace{1.9cm} if \ i_{th} \ person \ is \ female \end{array} \]

Now \(\beta_0\) can be interpreted as the average credit card balance among females, \(\beta_0 + \beta_1\) as the average credit card balance among females, and \(\beta_1\) as the average difference in credit card balance between males and females.

we could create a dummy variable

\[x_i = \bigg\{\begin{array}{ll} 1 \hspace{1cm} if \ i_{th} \ person \ is \ male \\ -1 \hspace{0.7cm} if \ i_{th} \ person \ is \ female \end{array} \]

and use this variable in the regression equation. This results in the model

\[y_i = \beta_0 + \beta_1x_i + \epsilon_i = \bigg\{\begin{array}{ll} \beta_0 + \beta_1 + \epsilon_i \hspace{1cm} if \ i_{th} \ person \ is \ male \\ \beta_0 - \beta_1 + \epsilon_i \hspace{1cm} if \ i_{th} \ person \ is \ female \end{array} \]

As you can see in this example, the coding scheme does not affect result but it could affect the way that the coefficients are interpreted.

Qualitative predictors with more than two levels

When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. Instead, we can create additional dummy variables. For example, ethnicity variable can be expressed as \[x_{i1} = \bigg\{\begin{array}{ll} 1 \hspace{1cm} if \ i_{th} \ person \ is \ Asian \\ 0 \hspace{1cm} if \ i_{th} \ person \ is \ not \ Asian \end{array} \] and \[x_{i2} = \bigg\{\begin{array}{ll} 1 \hspace{1cm} if \ i_{th} \ person \ is \ Caucasian \\ 0 \hspace{1cm} if \ i_{th} \ person \ is \ not \ Caucasian \end{array} \] The the model become

\[y_i = \beta_0 + \beta_1x_{i1} + \beta_1x_{i2} + \epsilon_i = \Bigg\{\begin{array}{lll} \beta_0 + \beta_1 + \epsilon_i \hspace{1cm} if \ i_{th} \ person \ is \ Asian \\ \beta_0 + \beta_2 + \epsilon_i \hspace{1cm} if \ i_{th} \ person \ is \ Caucasian \\ \beta_0 + \epsilon_i \hspace{1.9cm} if \ i_{th} \ person \ is \ African American \end{array} \]

There will always be one fewer dummy variable than the number of levels. The level with no dummy variable is known as baseline.

fit <- lm(Balance ~ Ethnicity, data = Credit)
res <- summary(fit)
res$coefficients
##                     Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)        531.00000   46.31868 11.4640565 1.774117e-26
## EthnicityAsian     -18.68627   65.02107 -0.2873880 7.739652e-01
## EthnicityCaucasian -12.50251   56.68104 -0.2205766 8.255355e-01
res
## 
## Call:
## lm(formula = Balance ~ Ethnicity, data = Credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -531.00 -457.08  -63.25  339.25 1480.50 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          531.00      46.32  11.464   <2e-16 ***
## EthnicityAsian       -18.69      65.02  -0.287    0.774    
## EthnicityCaucasian   -12.50      56.68  -0.221    0.826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared:  0.0002188,  Adjusted R-squared:  -0.004818 
## F-statistic: 0.04344 on 2 and 397 DF,  p-value: 0.9575

above table shows that the estimated balance for baseline, African American, is \(\$531.00\). Asian category will have \(\$18.69\) less debt than the baseline and that the Caucasian category will have \(\$12.50\) less debt than baseline. However, p-value associated with the coefficient estimates for the two dummy variables are very large, suggesting no statistical evidence of a real difference in credit card balance between the ethnicities. you can also see that the F-test has a p-value of 0.96, indicating that we cannot reject the null hypothesis that there is no relationship between balance and ethnicity.

The concept of interactions also applies to qualitative variables or to a combination of both. In fact, an interaction between a qualitative variable and a quantitative variable has a particularly nice interpretation. Suppose we wish to predict balance using the income (quantitative) and student (qualitative) variable. In the absence of an interaction term, the model takes the form

\[\begin{aligned} balance_i &\approx \beta_0 + \beta_1 \times income_i + \bigg\{\begin{array}{ll} \beta_2 \hspace{1cm} if \ i_{th} \ person \ is \ a \ student \\ 0 \hspace{1.2cm} if \ i_{th} \ person \ is \ not \ a \ student \end{array} \\ &= \beta_1 \times income_i + \bigg\{\begin{array}{ll} \beta_0+\beta_2 \hspace{1cm} if \ i_{th} \ person \ is \ a \ student \\ \beta_0 \hspace{1.9cm} if \ i_{th} \ person \ is \ not \ a \ student \end{array} \end{aligned}\]

Notice that this amounts to fitting two parallel lines to the data, one for students and one for non-students. The lines for students and non-students have different intercepts, \(\beta_0 + \beta_2\) versus \(\beta_0\), but the same slope, \(\beta_1\). This model has serious limitation because the the average effect on balance of a one-unit increase in income does not depend on whether or not the individual is student. By multiplying income with the dummy variable for student, we can address this limitation. our model now becomes

\[\begin{aligned} balance_i &\approx \beta_0 + \beta_1 \times income_i + \bigg\{\begin{array}{ll} \beta_2 + \beta_3 \times income_i \hspace{1cm} if \ student \\ 0 \hspace{4cm} if \ not \ student \end{array} \\ &= \bigg\{\begin{array}{ll} (\beta_0+\beta_2) + (\beta_1 + \beta_3) \times income)i \hspace{2cm} if \ student \\ \beta_0+\beta_1 \times income_i \hspace{4.7cm} if \ not \ student \end{array} \end{aligned}\]

Now our model have different intercepts as well as different slopes. This allows for the possibility that changes in income may affect the credit card balances of students and non-students differently. We note that the slope for students is lower than the slope for non-students. This suggests that increases in income are associated with smaller increases in credit card balance among students as compared to non-students. (In other words, non-students spend more money when they get more income compared to students)

K-Nearest Neighbors (KNN)

Let’s change our attention from regression to classification problem awhile. As you already studied,K-nearest neighbors (KNN) classifier is method to estimate probability of certain class by observing class of \(K\) neighbors. Recall that conditional probability for class \(j\) is given as \(Pr(Y = j|X=x_0) = \frac{1}{K}\sum_{i \in N_0} I(y_i = j)\) where \(N_0\) is \(K\) points in the trainig data that are closes to \(x_0\).

The figure below shows a scatterplot of training data on a pair of inputs \(X_1\) and \(X_2\). The data are simulated, and for the present the simulation model is not important. The output class variable \(G\) has the values \(\color{blue}{BLUE}\) or \(\color{orange}{ORANGE}\), and is represented as such in the scatterplot. There are 100 points in each of the two classes.

library(ElemStatLearn)

require(class)
## Loading required package: class
x <- mixture.example$x # X1 and X2
g <- mixture.example$y # class label
xnew <- mixture.example$xnew # new data points

# KNN = 1
mod1 <- knn(x, xnew, g, k=1, prob=TRUE)
prob <- attr(mod1, "prob")
prob <- ifelse(mod1=="1", prob, 1-prob)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob1 <- matrix(prob, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob1, levels=0.5, labels="", xlab="", ylab="", 
        main= "1-nearest neighbour", axes=FALSE)
prob <- mixture.example$prob
prob.bayes <- matrix(prob, length(px1), length(px2)) # bayes decision boundary
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob1>0.5, "coral", "cornflowerblue"))
box()

# KNN = 5
mod5 <- knn(x, xnew, g, k=5, prob=TRUE)
prob <- attr(mod5, "prob")
prob <- ifelse(mod5=="1", prob, 1-prob)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob5 <- matrix(prob, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob5, levels=0.5, labels="", xlab="", ylab="", 
        main= "5-nearest neighbour", axes=FALSE)
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob5>0.5, "coral", "cornflowerblue"))
box()

# KNN = 15
mod15 <- knn(x, xnew, g, k=15, prob=TRUE)
prob <- attr(mod15, "prob")
prob <- ifelse(mod15=="1", prob, 1-prob)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob15 <- matrix(prob, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob15, levels=0.5, labels="", xlab="", ylab="", 
        main= "15-nearest neighbour", axes=FALSE)
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob15>0.5, "coral", "cornflowerblue"))
box()

# KNN = 100
mod100 <- knn(x, xnew, g, k=100, prob=TRUE)
prob <- attr(mod100, "prob")
prob <- ifelse(mod100=="1", prob, 1-prob)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob100 <- matrix(prob, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob100, levels=0.5, labels="", xlab="", ylab="", 
        main="100-nearest neighbour", axes=FALSE)
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob100>0.5, "coral", "cornflowerblue"))
box()

# linear regression
xnew <- mixture.example$xnew # grid points
cl <- as.numeric(mixture.example$prob>0.5)
new_data <- data.frame(xnew,cl)
glm.grid <- glm(cl~xnew, data=new_data, family = binomial)# fitting model
prob.glm <- predict(glm.grid, type = "response")
prob1 <- matrix(prob.glm, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob1, levels=0.5, labels="", xlab="", ylab="", 
        main= "Linear regression", axes=FALSE)
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob1>0.5, "coral", "cornflowerblue"))
box()

# Bayes classifier
prob <- mixture.example$prob
prob.bayes <- matrix(prob, length(px1), length(px2))
contour(px1, px2, prob.bayes, levels=0.5, labels="", xlab="x1",
        ylab="x2",
        main="Bayes decision boundary")
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob.bayes>0.5, "coral", "cornflowerblue"))

Assessing KNN model performance

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
set.seed(123)
centers <- c(sample(1:10, 5000, replace=TRUE),
             sample(11:20, 5000, replace=TRUE))
means <- mixture.example$means
means <- means[centers, ]
mix.test <- mvrnorm(10000, c(0,0), 0.2*diag(2))
mix.test <- mix.test + means
cltest <- c(rep(0, 5000), rep(1, 5000))
ks <- c(1,3,5,7,9,11,15,17,23,25,35,45,55,83,101,151)

# nearest neighbours to try
nks <- length(ks)
misclass.train <- numeric(length=nks)
misclass.test <- numeric(length=nks)
names(misclass.train) <- names(misclass.test) <- ks

for (i in seq(along=ks)) {
  mod.train <- knn(x,x,k=ks[i],cl=g)
  mod.test <- knn(x, mix.test,k= ks[i],cl= g)
  misclass.train[i] <- 1 - sum(mod.train==factor(g))/200
  misclass.test[i] <- 1 - sum(mod.test==factor(cltest))/10000
}
print(cbind(misclass.train, misclass.test))
##     misclass.train misclass.test
## 1            0.000        0.2980
## 3            0.130        0.2415
## 5            0.130        0.2288
## 7            0.145        0.2241
## 9            0.155        0.2276
## 11           0.185        0.2463
## 15           0.155        0.2512
## 17           0.175        0.2486
## 23           0.175        0.2525
## 25           0.170        0.2536
## 35           0.200        0.2611
## 45           0.210        0.2575
## 55           0.200        0.2689
## 83           0.265        0.2827
## 101          0.305        0.3068
## 151          0.305        0.3245
df <- round(rev(200/ks),1)
plot(rev(misclass.train),xlab="Flexibility (degrees of freedom, N/K)",ylab="Test error (MSE)",
     type="n",xaxt="n", ylim = c(0.05, 0.33))
axis(1, 1:length(df), as.character(df))
axis(3, 1:length(ks), as.character(rev(ks)))
mtext("Number of Nearest Neighbor (K)", side=3, line = -1)
lines(rev(misclass.test),type="b",col='blue',pch=20)
lines(rev(misclass.train),type="b",col='red',pch=20)
abline(h=0.21,col='purple')
legend("bottomleft",lty=1,col=c("red","blue", "purple"),legend = c("train ", "test", "Bayes"))

Comparison of Linear regression with K-Nearest Neighbors

Let’s consider best-known nonparametric regression method K-nearest neighbors regression (KNN regression). KNN regression method is closely related to the KNN classifier discussed above. Givne a value for \(K\) and a prediction point \(x_0\), KNN regression first identifies the \(K\) training observations that are closest to \(x_0\), represented by \(N_0\). It then estimates \(f(x_0)\) using the average of all the training responses in \(N_0\). In other words, \[\hat f(x_0) = \frac{1}{K} \sum_{x_i \in N_0}y_i\] when \(K\) is small, the regression line seems step fuction but when \(K\) is large the regression line become smooth. The optimal value for \(K\) will depend on the bias-variance tradeoff.

n what setting will a parametric approach such as least squares linear regression outperform a non-parametric approach such as KNN regression? The answer is simple: the parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form of \(f\).

when dimension of data set increased the performance of KNN decrease its performance dramatically. This is common problem of KNN. spreading 100 observations over \(p=20\) dimensions results in a phenomenon in which a given observation has no nearby neighbors-this is the so-called curse of dimensionality. Linear regression better scale than KNN because parametric method often more powerful with small number of observations in each dimension.

library(readr)
Advertising <- read_csv("http://bisyn.kaist.ac.kr/bis335/11-StatLearning-Advertising.csv")
## Parsed with column specification:
## cols(
##   TV = col_double(),
##   Radio = col_double(),
##   Newspaper = col_double(),
##   Sales = col_double()
## )
library(FNN)
## 
## Attaching package: 'FNN'
## The following objects are masked from 'package:class':
## 
##     knn, knn.cv
res_knn1 <- knn.reg(train = Advertising$TV, y = Advertising$Sales, k=1)
res_knn5 <- knn.reg(train = Advertising$TV, y = Advertising$Sales, k=5)
res_knn10 <- knn.reg(train = Advertising$TV, y = Advertising$Sales, k=10)
plot(Advertising$TV, Advertising$Sales, col = "sky blue", pch = 16, ylab = "Sales", xlab = "TV")
lines(Advertising$TV, fit1$fitted.values, col = "blue")
lines(Advertising$TV[order(Advertising$TV)], res_knn1$pred[order(Advertising$TV)], col = "yellow")
lines(Advertising$TV[order(Advertising$TV)], res_knn5$pred[order(Advertising$TV)], col = "green")
lines(Advertising$TV[order(Advertising$TV)], res_knn10$pred[order(Advertising$TV)], col = "red")
legend("topleft", legend = c("linear", "KNN, K=1", "KNN, K=5", "KNN, K=10"), 
       col = c("blue", "yellow", "green", "red"), lty = rep(1,4))

Datasets & Examples

  • You can get following descriptions and more details with ‘help(name of dataset)’
  • help(Auto)

  • Gas mileage, horsepower, and other information for 392 vehicles
  • help(College)

  • Statistics for a large number of US Colleges from the 1995 issue of US News and World Report
  • help(Boston)

  • The Boston data frame has 506 rows and 14 columns
#install.packages("ISLR")
#install.packages("MASS")
library(ISLR); data(Auto); data(College)
## 
## Attaching package: 'ISLR'
## The following object is masked _by_ '.GlobalEnv':
## 
##     Credit
library(MASS); data(Boston)

Formula syntax: Interaction terms in model

  • The syntax lstat:rm tells R to include an interaction term between lstat and rm. The syntax lstat*rm simultaneously includes lstat, age and the interaction term lstat*rm as predictors;
##Formula syntax: Interaction terms in model
lm.fit = lm(medv~lstat*rm,data=Boston)
summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat * rm, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.2349  -2.6897  -0.6158   1.9663  31.6141 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -29.12452    3.34250  -8.713   <2e-16 ***
## lstat         2.19398    0.20570  10.666   <2e-16 ***
## rm            9.70126    0.50023  19.393   <2e-16 ***
## lstat:rm     -0.48494    0.03459 -14.018   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.701 on 502 degrees of freedom
## Multiple R-squared:  0.7402, Adjusted R-squared:  0.7387 
## F-statistic: 476.9 on 3 and 502 DF,  p-value: < 2.2e-16

Plotting Interaction terms

  • We are going to use Plot3D package for scatterplot in 3D space and draw multivariate linear regression on that space. The effect of interaction term on response.
  • To draw surface of the regression model, first we should make a grid mesh matrix for each axis.
  • Predictions are made for each pair of grid mesh. And stored in z.
##Plotting Interaction terms (1)
#Plotting
#install.packages("plot3D")
library(plot3D)

#Create grid matrices
X <- seq(3, 40, length.out = 50)
Y <- seq(3, 9, length.out = 50)
M <- mesh(X, Y)
lstat = M$x
rm = M$y
M.df <- data.frame(lstat = c(M$x), rm = c(M$y))
Z <- predict(lm.fit, newdata = M.df)
Z <- matrix(Z, ncol = ncol(rm))
  • You see that the convex form, because of the effect of the interactions on response.
  • The surface is well fitted on the data points
##Plotting Interaction terms (2)

scatter3D(Boston$lstat,Boston$rm,Boston$medv,
          col = 'black', pch = '.', cex = 3,
          theta = 20,phi = 17, type = 'p')
surf3D(lstat,rm,Z,colvar = Z, add = T,
       colkey = FALSE, box = FALSE, facets = FALSE,
       theta = 20, phi = 17)

scatter3D(Boston$lstat, Boston$rm, Boston$medv,
          col = 'black', pch = '.', cex = 3,
          theta = -15, phi = 15, type = 'p')
surf3D(lstat, rm, Z, colvar = Z, add = T,
       colkey = FALSE, box = FALSE, facets = FALSE,
       theta = -15, phi = 15)

Formula syntax: Non-linear transformations of the predictors

  • given a predictor X, we can create a predictor X2 using ‘I(X^2)’.
  • The function ‘I()’ is needed since the ‘^’ has a special meaning in a formula;
  • (a+b)^2 is identical to (a+b)*(a+b)
  • (a+b)^2-a:b is identical to a+b. a^2 is identical to a
  • Using poly function, you can make polynomial terms at once. E.g.poly(lstat,2) := lstat + I(lstat^2)
##Formula syntax: Non-linear transformations of the predictors
lm.fit2 = lm(medv~lstat+I(lstat^2), data = Boston)
summary(lm.fit2)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

Polynomial regression and performance test

##Polynomial regression and performance test
set.seed(6)
train <- sample(dim(College)[1],518,replace=F)
train.lm = College[train,]; test.lm = College[-train,] #separate training set and test set
correlation <- numeric(); train.correlation <- numeric()
for (p in 1:10){
  lm.fit3 <- lm(Apps ~ poly(F.Undergrad, p), data = train.lm)
  pred.lm <- predict(lm.fit3, newdata = test.lm)
  pred.lm.train <- predict(lm.fit3)
  train.correlation[p] <- cor(train.lm$Apps, pred.lm.train)
  correlation[p] <- cor(test.lm$Apps, pred.lm) #training corr. & test corr.
}

p.degree = 1:10
plot(p.degree, train.correlation, col = 'blue', type = 'l', ylim = c(0.8, 0.85))
lines(p.degree, correlation, col = 'red')

Plots for residuals, error terms

##Plots for residuals, error terms
lm.fit = lm(medv ~ lstat, data = Boston)

par(mfrow = c(2,2))
plot(lm.fit)

plot(predict(lm.fit), residuals(lm.fit)) #residuals
plot(predict(lm.fit), rstudent(lm.fit)) # Studentized residuals
plot(predict(lm.fit), rstandard(lm.fit)) #standardized residuals
#hat values for each observations
plot(hatvalues(lm.fit))

classification (KNN)

##classification (KNN)
attach(College)
library(class)
set.seed(1)
train <- sample(dim(College)[1], 518, replace = F)
train.x = College[train,-1]; test.x = College[-train,-1]
train.y = College[train, 1]; test.y = College[-train, 1]
set.seed(2)
knn.pred = knn(train.x,test.x,train.y,k=2)
table(knn.pred, test.y)
##         test.y
## knn.pred  No Yes
##      No   58  17
##      Yes   8 176
mean(knn.pred!=test.y)
## [1] 0.0965251
set.seed(2)
knn.pred = knn(train.x,test.x,train.y,k=8)
table(knn.pred, test.y)
##         test.y
## knn.pred  No Yes
##      No   56   5
##      Yes  10 188
mean(knn.pred!=test.y)
## [1] 0.05791506
with(College[train,],
     plot(Grad.Rate, Top25perc, col = as.integer(Private), pch=19))
with(College[-train,],
     points(Grad.Rate, Top25perc, col = as.integer(Private), pch = 1))
with(College[-train,],
     points(Grad.Rate, Top25perc, col = as.integer(knn.pred), cex = 1.5, pch = 0))