StatLearning-Theory

Introduction to statistical learning

We studied simple and multiple linear regression in the previous lecture. Now instead of specifically considering regression, focus on the statistical modelling with regression and classification as examples.

Suppose we observe a quantitative response \(Y\) and \(p\) different predictors, \(X_1, X_2, ..., X_p\). We assume that there is some relationship between \(Y\) and \(X = (X_1, X_2, ..., X_p)\), which can be written in the very general form \[Y = f(X) + \epsilon\] Here \(f\) is some fixed but unknown function of \(X_1, ..., X_p\), and \(\epsilon\) is a random error term, which is independent of \(X\) and has mean zero. In this formulation \(f\) represents the systematic information that \(X\) provides about \(Y\).

In essence, statistical learning refers to a set of approaches for estimating \(f\) and there are two main reasons that we may wish to estimate \(f\): prediction and inference

Prediction

In many situations, a set of inputs \(X\) are readily available, but the output \(Y\) cannot be easily obtained. In this setting, since the error term averages to zero, we can predict \(Y\) using

\[\hat Y = \hat f(X)\] where \(\hat f\) represents our estimate for \(f\), and \(\hat Y\) represents the resulting prediction for \(Y\). In this setting, \(\hat f\) is often treated as a black box, in the sense that one is not typically concerned with the exact form of \(\hat f\), provided that it yields accurate predictions for \(Y\).

The accuracy of \(\hat Y\) as a prediction for \(Y\) depends on two quantities, which we will call the reducible error and the irreducible error. In general, \(\hat f\) will not be a perfect estimate for \(f\), and this inaccuracy will introduce some error. This error is reducible because we can potentially improve the accuracy of \(\hat f\) by using the most appropriate statistical learning technique to estimate \(f\). However, even if it were possible to form a perfect estimate for \(f\), so that our estimated response took the form \(\hat Y = f(X)\), our prediction would still have some error in it! This is because \(Y\) is also a function of \(\epsilon\), which, by definition, cannot be predicted using \(X\). Therefore, variability associated with \(\epsilon\) also affects the accuracy of our predictions. This is known as the irreducible error, because no matter how well we estimate \(f\), we cannot reduce the error introduced by \(epsilon\). Assume for a moment that both \(\hat f\) and \(X\) are fixed. Then, it is easy to show that

\[\begin{aligned} E(Y - \hat Y)^2 &= E[f(X) + \epsilon - \hat f(X)]^2 \\ & = \underbrace{[f(X) - \hat f(X)]^2} + \underbrace{Var(\epsilon)} \\ & \hspace{1cm} Reducible \hspace{0.8cm} Irreducible \end{aligned}\]

where \(E(Y-\hat Y)^2\) represents the average, or expected value, of the squared difference between the predicted and actual value of \(Y\), and \(Var(\epsilon)\) represents the variance associated with the error term \(\epsilon\).The focus of this book is on techniques for estimating \(f\) with the aim of minimizing the reducible error. It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of our prediction for \(Y\). This bound is almost always unknown in practice.

Inference

We are often interested in understanding the way that \(Y\) is affected as \(X_1,...,X_p\) change. In this situation we wish to estimate \(f\), but our goal is not necessarily to make predictions for \(Y\). We instead want to understand the relationship between \(X\) and \(Y\), or more specifically, to understand how \(Y\) changes as a function of \(X_1,...,X_p\). Now \(\hat f\) cannot be treated as a black box, because we need to know its exact form. In this setting, one may be interested in answering the following questions:

Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Can the relationship between \(Y\) and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

Depending on whether our ultimate goal is prediction, inference or a combination of the two, different methods for estimating \(f\) may be appropriate. For example, linear models allow for relatively simple and interpretable inference, but may not yield as accurate predictions as some other approaches. In contrast, some of the highly non-linear approaches can potentially provide quite accurate predictions for \(Y\) , but this comes at the expense of a less interpretable model for which inference is more challenging.

How do we estimate \(f\)?

Although the approaches are very different, the problem also known as Function identification or Function approximation share certain characteristics. I summarized it below.

a set of \(n\) different data points called training data to teach our method how to estimate \(f\). Formally, Let \(x_{ij}\) represent the value of the \(j_{th}\) predictor, or iput, for observation \(i\), where \(i = 1,2, ... , n\) and \(j = 1, 2, ..., p\). Let \(y_i\) represent the response variable for the \(i_{th}\) observation. Then our trainig data consist of \[\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\} \ \ where \ \ x_i = (x_{i1}, x_{i2}, ..., x_{ip})^T\]

Our goal is to apply a statistical learning method to the training data in order to estimate the unknown function \(f\). In other words, we want to find a function \(\hat f\) such that \(Y \approx \hat f(X)\) for any observation \((X,Y)\) Broadly speaking, They are either parametric or non-parametric.

Parametric Methods

Parametric methods involve a two-step model-based approach.

First, we make an assumption about the functional form, or shape, of \(f\). for example, if \(f\) is a linear model, \(f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... +\beta_p X_p\) and the problem is greatly simplified. Instead of having to estimate an entirely arbitrary p-dimensional function \(f(X)\), one only needs to estimate the \(p+1\) coefficients \(\beta_0, \beta_1, ..., \beta_p\).
Second, fit or train the model. In case of linear model, we need to estimate the parameters \(\beta_0, \beta_1, ..., \beta_p\).That is, we want to find values of these parameters such that \[Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p\] The most common approch to fitting this model is referred to (ordinary) least squares but it is one of many possible ways to fit the model.

The model-based approach (parametric) reduces the problem of estimating \(f\) down to one of estimating a set of parameters. The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of \(f\). We can try to address this problem by choosing flexible models that can fit many different possible functional forms flexible for \(f\). But in general, fitting a more flexible model requires estimating a greater number of parameters. These more complex models can lead to a phenomenon known as overfitting the data, which essentially means they overfitting follow the errors, or noise, too closely. Let’s see one example.

Example of parametric method - linear regression

Consider the Advertising data shown below. One may be interested in answering questions such as:

Which media contribute to sales?

?? Which media generate the biggest boost in sales?

?? How much increase in sales is associated with a given increase in TV advertising?

TV	Radio	Newspaper	Sales
230.1	37.8	69.2	22.1
44.5	39.3	45.1	10.4
17.2	45.9	69.3	9.3
151.5	41.3	58.5	18.5
180.8	10.8	58.4	12.9
8.7	48.9	75.0	7.2

library(readr)
Advertising <- read_csv("http://bisyn.kaist.ac.kr/bis335/11-StatLearning-Advertising.csv")

# fit linear model
fit <- lm(Sales ~ TV + Radio + Newspaper, data = Advertising)
summary(fit)

## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = Advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

library(tidyr)
library(dplyr)
library(ggplot2)
df <- Advertising %>% gather(Media = c("TV", "Radio", "Newspaper"))
df$key <- factor(df$key, levels = c("TV", "Radio", "Newspaper"))
ggplot(df, mapping = aes(x = value, y = Sales, color = key) ) + 
  geom_point() + 
  facet_grid(.~key, scales = "free") + 
  geom_smooth(method = "lm")

Non-parametric Methods

Non-parametric methods do not make explicit assumptions about the functional form of \(f\). Instead they seek an estimate of \(f\) that gets as close to the data points as possible without being too rough or wiggly. Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for \(f\), they have the potential to accurately fit a wider range of possible shapes for \(f\). But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for \(f\).

The trade-off between prediction accuracy and model interpretability

For inference, model interpretability needs to be considered so that more restrictive model is adequate. In case of prediction, flexible approaches may have more predictive power. But this is not always the case. (why?)

It is useful to think highly interpretable model as linear model with a few number of parameters, highly flexible model as nonlinear model with large number of parameters and other models are in between.

Supervised versus unsupervised learning

Most statistical learning problems fall into one of two categories: supervised or unsupervised. For each observation of the predictor measurement(s) \(x_i\), \(i = 1,...,n\) there is an associated response measurement \(y_i\). We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).

In contrast, unsupervised learning describes the somewhat more challenging situation in which for every observation \(i = 1,...,n\), we observe a vector of measurements \(x_i\) but no associated response \(y_i\). It is not possible to fit a linear regression model, since there is no response variable to predict. One statistical learning tool that we may use in this setting is cluster analysis, or clustering. The goal of cluster analysis is to ascertain, on the basis of \(x_1,...,x_n\), whether the observations fall into analysis relatively distinct groups.

Many problems fall naturally into the supervised or unsupervised learning paradigms. However, sometimes the question of whether an analysis should be considered supervised or unsupervised is less clear-cut. For instance, suppose that we have a set of \(n\) observations. For \(m\) of the observations, where \(m < n\), we have both predictor measurements and a response measurement. For the remaining \(n - m\) observations, we have predictor measurements but no response measurement. Such a scenario can arise if the predictors can be measured relatively cheaply but the corresponding responses are much more expensive to collect. We refer to this setting as a semi-supervised learning problem.

Regression versus classification problems

Variables can be characterized as either quantitative or qualitative (also known as categorical). Quantitative variables take on numerical values. In contrast, qualitative variables take on values in one of \(K\) different classes, or categories. We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems. However, the distinction is not always that crisp. Least squares linear regression is used with a quantitative response, whereas logistic regression is typically used with a qualitative (two-class, or binary) response. Whether the predictors are qualitative or quantitative is generally considered less important. Most of the statistical learning methods discussed in here can be applied regardless of the predictor variable type, provided that any qualitative predictors are properly coded before the analysis is performed.

Assessing model accuracy

Measuring the quality of fit

In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data. That is, we need to quantify the extentto which the predicted response value for a given observation is close to the true response value for that observation. In the regression setting, the most commonly-used measure is the mean squared error (MSE), given by

\[MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat f(x_i))^2\] where \(\hat f(x_i)\) is the prediction that \(\hat f\) gives for the \(i_{th}\) observation.

The MSE in above equation is computed using the training data that was used to fit the model, and so should more accurately be referred to as the training MSE. But in general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.

To state it more mathematically, suppose that we fit our statistical learning method on our training observations \(\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}\), and we obtain the estimate \(\hat f\) We can then compute \(\hat f(x_1), \hat f(x_2), ..., \hat f(x_n)\). If these are approximately equal to \(y_1, y_2, ..., y_n\), then the training MSE is small. However, we are not interested in whether \(\hat f(x_i) \approx y_i\); instead, we want to know whether \(\hat f(x_0)\) is approximately equal to \(y_0\), where (x_0, y_0) is a previously unseen test observation not used to train the statistical learning method. We want to choose the method that gives the lowest test MSE, as opposed to the lowest training MSE. In other words, if we had a large number of test observations, we could compute

\[Ave(y_0 - \hat f(x_0))^2\] the average squared prediction error for these test observations \((x_0, y_0)\). We would like to select the model for which the average of this quantity - the test MSE - is as small as possible. We will discuss this topic later in this tutorial with specific example.

The bias-variance trade-off

The expected test MSE, for a given value \(x_0\), can always be decomposed into the sum of three fundamental quantities: the variance of \(\hat f(x_0)\), the squared bias of \(\hat f(x_0)\) and the variance of the error terms \(\epsilon\). That is,

\[E\bigg(y_0 - \hat f(x_0)\bigg)^2 = Var(\hat f(x_0)) + [Bias(\hat f(x_0))]^2 + Var(\epsilon)\]

Here the notation \(E\bigg(y_0 - \hat f(x_0)\bigg)^2\) defines the expected test MSE, and refers to the average test MSE that we would obtain if we repeatedly estimated \(f\) using a large number of training sets, and tested each at \(x_0\). The overall expected test MSE can be computed by averaging \(E\bigg(y_0 - \hat f(x_0)\bigg)^2\) over all possble values of \(x_0\) in the test set. The equation above tells us that in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias. Note that variance is inherently a nonnegative quantity, and squared bias is also nonnegative. Hence, we see that the expected test MSE can never lie below \(Var(\epsilon)\), the irreducible error.

What do we mean by the variance and bias of a statistical learning method? Variance refers to the amount by which \(\hat f\) would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different \(\hat f\). But ideally the estimate for \(f\) should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in \(\hat f\). In general, more flexible statistical methods have higher variance. On the other hand, bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between \(Y\) and \(X_1, X_2,...,X_p\). It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of \(f\). If true \(f\) is substantially non-linear, no matter how many training observations we are given, it will not be possible to produce an accurate estimate using linear regression. In other words, linear regression results in high bias in this case. Generally, more flexible methods result in less bias. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases. The relationship between bias, variance, and test set MSE given in equation is referred to as the bias-variance trade-off. In a real-life situation in which \(f\) is unobserved, it is generally not possible to explicitly compute the test MSE, bias, or variance for a statistical learning method. Nevertheless, one should always keep the bias-variance trade-off in mind.

The classification setting

Thus far, our discussion of model accuracy has been focused on the regression setting. But many of the concepts that we have encountered, such as the bias-variance trade-off, transfer over to the classification setting with only some modifications due to the fact that \(y_i\) is no longer numerical. Suppose that we seek to estimate \(f\) on the basis of training observations \(\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}\), where now \(y_1, y_2, ..., y_n\) are qualitative. The most common approach for quantifying the accuracy of our estimate \(\hat f\) is the training error rate, the proportion of mistakes that are made if we apply our estimate \(\hat f\) to the training observations: \[\frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat y_i)\] Here \(\hat y_i\) is the predicted class label for the \(i_{th}\) observation using \(\hat f\). And \(I(y_i \neq \hat y_i)\) is an indicator variable that equals 1 if \(y_i \neq \hat y_i\) and zero if \(y_i = \hat y_i\). If \(I(y_i \neq \hat y_i) = 0\) then the \(i_th\) observation was classified correctly by our variable classification method; otherwise it was misclassified.

The equation above is referred to as the training error rate because it is computed based on the data that was used to train our classifier. The test error rate associated with a set of test observations of the form \((x_0, y_0)\) is given by \[Ave(I(y_0 \neq \hat y_0))\] where \(\hat y_0\) is the predicted class label that results from applying the classfier to test observation with predictor \(x_0\). A good classifier is one for which the test error is smallest.

The Bayes Classifier

It is possible to show that the test error rate given above is minimized, on average, by a very simple classifier that assigns each observation to the most likely class, given its predictor values. In other words, we should simply assign a test observation with predictor vector \(x_0\) to the class \(j\) for which \[Pr(Y=j | X = x_0)\] is largest. Note that the above expression is a conditional probability: it is the probability conditional that \(Y = j\), given the observed predictor vector \(x_0\). This very simple classifier is called the Bayes classifier. In a two-class problem where there are only two possible response values, say class 1 or class 2, the Bayes classifier corresponds to predicting class one if \(P(Y = 1|X = x_0) > 0.5\), and class two otherwise.

The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. Since the Bayes classifier will always choose the class Bayes error for which the expression above is largest, the error rate at \(X = x_0\) will be \(1 - \max_{j} P(Y = j|X = x_0)\). In general, the overall Bayes error rate is given by \[1 - E \bigg(\max_j P(Y=j|X)\bigg)\] where the expectation averages the probability over all possible values of \(X\). The Bayes error rate is analogous to the irreducible error, discussed earlier.

K-Nearest Neighbors

In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of \(Y\) given \(X\), and so computing the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods. Many approaches attempt to estimate the conditional distribution of \(Y\) given \(X\), and then classify a given observation to the class with highest estimated probability. One such method is the K-nearest neighbors (KNN) classifier. Given a positive integer \(K\) and a test observation \(x_0\), the KNN classifier first identifies the K points in the training data that are closest to \(x_0\), represented by \(N_0\). It then estimates the conditional probability for class \(j\) as the fraction of points in \(N_0\) whose response values equal \(j\): \[P(Y=j | X = x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j)\] Finally, KNN applies Bayes rule and classifies the test observation \(x_0\) to the class with the largest probability.

Example of KNN

library(ElemStatLearn)

require(class)

## Loading required package: class

x <- mixture.example$x # X1 and X2
g <- mixture.example$y # class label
xnew <- mixture.example$xnew # new data points

# KNN = 1
mod1 <- knn(x, xnew, g, k=1, prob=TRUE)
prob <- attr(mod1, "prob")
prob <- ifelse(mod1=="1", prob, 1-prob)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob1 <- matrix(prob, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob1, levels=0.5, labels="", xlab="", ylab="", 
        main= "1-nearest neighbour", axes=FALSE)
prob <- mixture.example$prob
prob.bayes <- matrix(prob, length(px1), length(px2)) # bayes decision boundary
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob1>0.5, "coral", "cornflowerblue"))
box()

# KNN = 5
mod5 <- knn(x, xnew, g, k=5, prob=TRUE)
prob <- attr(mod5, "prob")
prob <- ifelse(mod5=="1", prob, 1-prob)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob5 <- matrix(prob, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob5, levels=0.5, labels="", xlab="", ylab="", 
        main= "5-nearest neighbour", axes=FALSE)
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob5>0.5, "coral", "cornflowerblue"))
box()

# KNN = 15
mod15 <- knn(x, xnew, g, k=15, prob=TRUE)
prob <- attr(mod15, "prob")
prob <- ifelse(mod15=="1", prob, 1-prob)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob15 <- matrix(prob, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob15, levels=0.5, labels="", xlab="", ylab="", 
        main= "15-nearest neighbour", axes=FALSE)
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob15>0.5, "coral", "cornflowerblue"))
box()

# KNN = 100
mod100 <- knn(x, xnew, g, k=100, prob=TRUE)
prob <- attr(mod100, "prob")
prob <- ifelse(mod100=="1", prob, 1-prob)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob100 <- matrix(prob, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob100, levels=0.5, labels="", xlab="", ylab="", 
        main="100-nearest neighbour", axes=FALSE)
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob100>0.5, "coral", "cornflowerblue"))
box()

# linear regression
xnew <- mixture.example$xnew # grid points
cl <- as.numeric(mixture.example$prob>0.5)
new_data <- data.frame(xnew,cl)
glm.grid <- glm(cl~xnew, data=new_data, family = binomial)# fitting model
prob.glm <- predict(glm.grid, type = "response")
prob1 <- matrix(prob.glm, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob1, levels=0.5, labels="", xlab="", ylab="", 
        main= "Linear regression", axes=FALSE)
contour(px1, px2, prob.bayes, levels = 0.5, col = "purple", lty = 2, add=TRUE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob1>0.5, "coral", "cornflowerblue"))
box()

# Bayes classifier
prob <- mixture.example$prob
prob.bayes <- matrix(prob, length(px1), length(px2))
contour(px1, px2, prob.bayes, levels=0.5, labels="", xlab="x1",
        ylab="x2",
        main="Bayes decision boundary")
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob.bayes>0.5, "coral", "cornflowerblue"))

Assessing model performance

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

set.seed(123) 
centers <- c(sample(1:10, 5000, replace=TRUE),
             sample(11:20, 5000, replace=TRUE))
means <- mixture.example$means
means <- means[centers, ]
mix.test <- mvrnorm(10000, c(0,0), 0.2*diag(2))
mix.test <- mix.test + means
cltest <- c(rep(0, 5000), rep(1, 5000))
ks <- c(1,3,5,7,9,11,15,17,23,25,35,45,55,83,101,151)

# nearest neighbours to try
nks <- length(ks)
misclass.train <- numeric(length=nks)
misclass.test <- numeric(length=nks)
names(misclass.train) <- names(misclass.test) <- ks
for (i in seq(along=ks)) {
  mod.train <- knn(x,x,k=ks[i],cl=g)
  mod.test <- knn(x, mix.test,k= ks[i],cl= g)
  misclass.train[i] <- 1 - sum(mod.train==factor(g))/200
  misclass.test[i] <- 1 - sum(mod.test==factor(cltest))/10000
}
print(cbind(misclass.train, misclass.test))

##     misclass.train misclass.test
## 1            0.000        0.2980
## 3            0.130        0.2415
## 5            0.130        0.2288
## 7            0.145        0.2241
## 9            0.155        0.2276
## 11           0.185        0.2463
## 15           0.155        0.2512
## 17           0.175        0.2486
## 23           0.175        0.2525
## 25           0.170        0.2536
## 35           0.200        0.2611
## 45           0.210        0.2575
## 55           0.200        0.2689
## 83           0.265        0.2827
## 101          0.305        0.3068
## 151          0.305        0.3245

plot(rev(misclass.train),xlab="Degrees of Freedom - N/K",ylab="Test error",type="n",xaxt="n", ylim = c(0.05, 0.33))
axis(1, 1:length(ks), as.character(ks))
lines(rev(misclass.test),type="b",col='blue',pch=20)
lines(rev(misclass.train),type="b",col='red',pch=20)
abline(h=0.21,col='purple')
legend("bottomleft",lty=1,col=c("red","blue", "purple"),legend = c("train ", "test", "Bayes"))