Linear regression (s3)

Summary of simple linear regression

Consider the data value \(y_i\) as being the observed value of a random variable \(Y_i\), whose distribution depends upon a (fixed) value \(x_i\). The simple linear regression model is shown below. \[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\] Here the observed value of the dependent variable \(y_i\) is composed of a linear function \(\beta_0 + \beta_1 x_i\) of the explanatory variable \(x_i\), together with an error term \(\epsilon_i\). The error terms \(\epsilon_1, ..., \epsilon_n\) are generally taken to be independent observations from a \(N(0, \sigma^2)\) distribution, for some error variance \(\sigma^2\). This implies that the values \(y_i, ..., y_n\) are observations from the independent random variables \[Y_i \sim N(\beta_0 + \beta_1x_i, \sigma^2)\]

Notice that

\[E(Y_i) = \beta_0 + \beta_1 x_i\] The three unknown parameters, the intercept parameter \(\beta_0\), the slope parameter \(\beta_1\), and the error variance \(\sigma^2\), are estimated from the data set.

The regression line \(y = \beta_0 + \beta_1x\) is fitted to the data points \((x_1, y_1), ..., (x_n, y_n)\) by finding the line that is “closest” to the data points in some sense. In least squares fit, we consider the vertical deviations between the line and the data points \[y_i - (\beta_0 + \beta_1 x_i) \hspace{1cm} 1\leq i \leq n\]

and minimizes the sum of squares of these vertical deviations \[Q = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1x_i))^2\]

Although this definition of closeness is in a sense arbitrary, with normally distributed error terms it results in parameter estimates \(\hat\beta_0\) and \(\hat\beta_1\) that are maximum likelihood estimates. This is because an error term \(\epsilon_i\) has a probability density function \[\frac {1}{\sqrt{2\pi}\sigma} e^{-\epsilon_i^2/2\sigma^2}\] So that the joint density of the error terms \(\epsilon_1, ..., \epsilon_n\) is \[\bigg(\frac {1}{\sqrt{2\pi}\sigma}\bigg)^n e^{-\sum_{i=1}^n\epsilon_i^2/2\sigma^2}\] This likelihood is maximized by minimizing \[\sum_{i=1}^n \epsilon_i^2\] which is equal to \(Q\) because \[\epsilon_i = y_i - (\beta_0 + \beta_1x_i)\] The parameter estimates \(\hat\beta_0\) and \(\hat\beta_1\) are therefore the values that minimize the quantity \(Q\). They are easily found by taking partial derivatives of \(Q\) with respect to \(\beta_0\) and \(\beta_1\) and setting the resulting expression equal to 0, as we have already seen above.

\[\frac{\partial Q}{\partial \beta_0} = -\sum_{i=1}^n 2(y_i - (\beta_0 + \beta_1x_i), \hspace{1cm} \frac{\partial Q}{\partial \beta_1} = -\sum_{i=1}^n 2x_i(y_i - (\beta_0 + \beta_1x_i)\] The parameter estimates \(\hat\beta_0\) and \(\hat\beta_1\) are thus the solutions to the equations

\[\sum_{i=1}^n y_i = n\beta_0 + \beta_1\sum_{i=1}^n x_i, \hspace{1cm} \sum_{i=1}^n x_iy_i = \beta_0\sum_{i=1}^nx_i + \beta_1\sum_{i=1}^n x_i^2\] These equations are known as the normal equations. The normal equations can be sloved to give \[\hat\beta_1 = \frac{\sum_{n=1}^n x_iy_i - \frac{(\sum_{n=1}^nx_i)(\sum_{n=1}^ny_i)}{n}}{\sum_{n=1}^nx_i^2 - \frac{(\sum_{n=1}^nx_i)^2}{n}}=\frac{S_{xy}}{S_{xx}}, \hspace{1cm} \hat\beta_0 = \bar y - \hat\beta_1 \bar x\] For a specific value of the explanatory variable \(x^*\), this equation provides a fitted value \[\hat y|x^* = \hat\beta_0 + \hat\beta_1x^*\] and for data points \(x_i\), the fitted values are \[\hat y_i = \hat\beta_0 + \hat\beta_1x_i\]

The error variance \(\sigma^2\) can be estimated by considering the deviations between the observed data values \(y_i\) and their fitted values \(\hat y_i\).

\[\begin{aligned}\hat\sigma^2 &= \frac{SSE}{n-2} \\ &= \frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{n-2}\\ &= \frac{\sum_{i=1}^n (y_i - (\hat\beta_0 + \hat\beta_1 x_i))^2}{n-2} \\ &= \frac{\sum_{i=1}^n (y_i - (\hat\beta_0 + \hat\beta_1 x_i))^2}{n-2} \\ &= \frac{\sum_{i=1}^n y_i^2 - \hat\beta_0\sum_{i=1}^n y_i - \hat\beta_1\sum_{i=1}^n x_iy_i}{n-2} \\ \end{aligned}\]

Notice that the error variance can be estimated using only the six quantities \[n \hspace{1cm} \sum_{i=1}^n x_i\hspace{1cm} \sum_{i=1}^n y_i\hspace{1cm} \sum_{i=1}^n x_i^2 \hspace{1cm} \sum_{i=1}^n y_i^2 \hspace{1cm} \sum_{i=1}^n x_iy_i\]

Introduction of linear regression (s4)

Regression is a process of finding relationship between independent variable and dependent variable. Here we can use two categories of models, the deterministic model and the probabilistic model. Suppose that \(Y\) and \(X\) are related according to the equation,

\[Y_i = \beta_0 + \beta_1 X_i\] where \(\beta_0\) and \(\beta_1\) are unknown parameters. This model is called a deterministic model because it does not allow for any error in predicting \(Y\) as a function of \(X\).

In contrast, statisticians use probabilistic models given as below.

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where \(\beta_0\): intercept, \(\beta_1\): slope, \(\epsilon_i\): error to estimate

\(Y_i\), \(X_i\): independent variable, \(Y_i\): dependent “random” variable with error distribution,

\[E(\epsilon_i) = 0, \ Var(\epsilon_i)=\sigma_i^2, \ \forall \epsilon \sim N(0, \sigma^2)\] \[E(Y_i) = \beta_0 + \beta_1 X_i, \ Var(Y_i) = \sigma_i^2, \ \forall Y \sim N(\beta_0+\beta_1 X, \sigma^2)\]

 

 

Although infinitely many different functions can be used to model the mean value of the response variable \(Y\) as a function of one or more independent variables, we will concentrate on a set of models called linear statistical models. When we say we have a linear statistical model of \(Y\), we mean that \(E(Y)\) is a linear function of the unknown parameters \(\beta_0\) and \(\beta_1\) and not necessarily a linear function of \(X\). Thus, \(Y=\beta_0 + \beta_1 (lnX) + \epsilon\) is a linear model.

Simple Linear Regression (s5)

If the model relates \(E(Y)\) as a linear function of the unknown parameters \(\beta_0\) and \(\beta_1\) only, the model is called a simple linear regression model.

\[E(Y) = \beta_0 + \beta_1 X_i\]

If more than one independent variable - say, \(x_1, x_2, ..., x_k\) - are of interest and we model \(E(Y)\) by

\[E(Y)=\beta_0 + \beta_1 X_1 + ... + \beta_k X_k\] the model is called a multiple linear regression model. We will cover this topic at the end of this tutorial.

In simple linear regression model states that the following four conditions must hold (LINE):

  • The mean of the responses \(E(Y_i)\), is a Linear function of the \(x_i\)

  • The errors, \(\epsilon_i\), are hence the responses \(Y_i\), are Independent

  • The errors, \(\epsilon_i\), are hence the responses \(Y_i\), are Normally distributed

  • The errors, \(\epsilon_i\), are hence the responses \(Y_i\), have Equal variance (\(\sigma^2\)) for all \(x\) values

The method of least squares (s6)

Suppose that we wish to fit the model

\[E(Y) = \beta_0 + \beta_1 X_i\]

to the set of data points as shown in the figure below. (Note that the independent variable \(x\) could be \(w^2\) or \(w^{\frac{1}{2}}\) or \(ln(w)\) and so on, for some other independent variable \(w\))

 

 

The least-squares procedure for fitting a line through a set of \(n\) data points is similar to the moethod that we might use if we fit a line by eye. A convenient way to accomplish this, and one that yields estimators with good properties, is to minimize the sum of squares of the vertical deviations from the fitted line. Thus, if

\[\hat{y_i}=\hat\beta_0 + \hat\beta_1x_i\] is the predicted value of \(i_{th}\) \(y\) value (when \(x = x_i\)), then the deviation (or error) of the observed value of \(y_i\) from \(\hat{y_i}=\hat\beta_0 + \hat\beta_1x_i\) is the difference \(y_i - \hat y_i\) and the sum of squares of deviations to be minimized is

\[SSE = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\] To minimize this equation, we need to take partial derivatives of SSE with respect to \(\hat\beta_0\) and \(\hat\beta_1\) and setting them equal to zero. Thus,

\[\frac{\partial SSE}{\partial \hat\beta_0} = \frac{\partial \{ \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\}}{\partial \hat\beta_0} = -\sum_{i=1}^n 2[y_i - (\hat\beta_0 + \hat\beta_1x_i)] \\ = -2 \bigg(\sum_{i=1}^n y_i - n\hat\beta_0 - \hat\beta_1\sum_{i=1}^nx_i \bigg) = 0\] and

\[\frac{\partial SSE}{\partial \hat\beta_1} = \frac{\partial \{ \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\}}{\partial \hat\beta_1} = -\sum_{i=1}^n 2[y_i - (\hat\beta_0 + \hat\beta_1x_i)]x_i \\ = -2 \bigg(\sum_{i=1}^n x_iy_i - \hat\beta_0\sum_{i=1}^n x_i - \hat\beta_1\sum_{i=1}^nx_i^2 \bigg) = 0\]

The equations \(\frac{\partial SSE}{\partial \hat\beta_0} = 0\) and \(\frac{\partial SSE}{\partial \hat\beta_1} = 0\) are called the least-squares equations for estimating the parameters of a line. The least-squares equations are linear in \(\hat\beta_0\) and \(\hat\beta_1\) and hence can be solve simultaneously. You can verify that the solutions are

\[\hat\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2} = \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n}\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{\sum_{i=1}^n x_i^2 - \frac{1}{n} \bigg( \sum_{i=1}^n x_i \bigg)^2}\] \[\hat\beta_0 = \bar y - \beta_1 \bar x\] Note that

\(\sum_{i=1}^n \bar x y_i = \frac{1}{n} \sum_{i=1}^n x_i \sum_{i=1}^n y_i\)

and

\(\sum_{i=1}^n \bar x \bar y - \sum_{i=1}^n \bar y x_i = \bar y \sum_{i=1}^n (\bar x - x_i) = 0\),

because

\(\sum_{i=1}^n (\bar x - x_i) = 0\)

We will denotes \(S_{xy}\) as a quantity \(\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)\),

\(S{xx}\) as a quantity \(\sum_{i=1}^n (x_i - \bar x)^2\),

and \(S{yy}\) as a quantity \(\sum_{i=1}^n (y_i - \bar y)^2\).

Then the above equation become \[\hat \beta_1 = \frac{S_{xy}}{S_{xx}}, \ \ \hat\beta_0 = \bar y - \hat \beta_1 \bar x\] where \[\begin{aligned}S_{xx} &= \sum_{i=1}^n (x_i - \bar x)^2 \\ &= \sum_{i=1}^n x_i^2 - 2n\bar x^2 + n \bar x^2 \\ &= \sum_{i=1}^n x_i^2 - n \bar x^2 \\ &= \sum_{i=1}^n x_i^2 - n \bigg(\frac{\sum_{i=1}^n x_i}{n} \bigg)^2 \\ &= \sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^n x_i)^2}{n}\end{aligned}\]

\[\begin{aligned} S_{xy} &= \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) \\ &= \sum_{i=1}^n x_iy_i - n \bar x \bar y \\ &= \sum_{i=1}^n x_iy_i - n \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n^2} \\ &= \sum_{i=1}^n x_iy_i - \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n}\end{aligned}\] and \[SSE = S_{yy} - \hat \beta_1 S_{xy}\]

Maximum Likelihood Estimation (MLE) Approach

Recall the model, \[Y_i = \beta_0 + \beta_1 x_i + \epsilon_i\] where \(\epsilon_i \sim N(0, \sigma^2)\)

Then we can find the mean and variance of each \(Y_i\) \[E[Y_i|X_i = x_i] = \beta_0 + \beta_1x_i\] and \[Var[Y_i|X_i = x_i] = \sigma^2\] Additionally, the \(Y_i\) follow a normal distribution conditioned on the \(x_i\), \[Y_i|X_i \sim N(\beta_0+\beta_1 x_i, \sigma^2)\] Recall that the P.D.F of a random variable \(X \sim N(\mu, \sigma^2)\) is given by \[f_X(x;\mu, \sigma^2) = \frac{1}{\sqrt{{2\pi\sigma^2}}}\bigg[-\frac{1}{2} \bigg(\frac{x-\mu}{\sigma}\bigg)^2\bigg]\] Then we can write the P.D.F of each of the \(Y_i\) as \[f_{Y_i}(y_i,x_i;\beta_0, \beta_1,\sigma^2) = \frac{1}{\sqrt{{2\pi\sigma^2}}} exp\bigg[-\frac{1}{2} \bigg(\frac{y_i-(\beta_0-\beta_1x_i)}{\sigma}\bigg)^2\bigg]\] Given \(n\) data points \((x_i, y_i)\) we can write the likelihood, which is a function of the three parameters \(\beta_0, \beta_1\) and \(\sigma^2\). Since the data have been observed, we use lower case \(y_i\) to denote that these values are no longer random. \[L(\beta_0, \beta_1,\sigma^2) = \prod_{i=1}^n\frac{1}{\sqrt{{2\pi\sigma^2}}} exp\bigg[-\frac{1}{2} \bigg(\frac{y_i-\beta_0-\beta_1x_i}{\sigma}\bigg)^2\bigg]\] Our goal is to find values of \(\beta_0, \beta_1\), and \(\sigma^2\) which maximize this function, which is a straightforward multivariate calculus problem.

We’ll start by doing a bit of rearraging to make our task easier. \[L(\beta_0, \beta_1,\sigma^2) = \bigg(\frac{1}{\sqrt{{2\pi\sigma^2}}}\bigg)^2 exp\bigg[-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2\bigg]\] Then, as is often the case when finding MLEs, for mathematical onvenience we will take the natural logarithm of the likelihood function since log is a monotonically increasing function. Then we will proceed to maximize the log-likelihood, and the resulting estimates will be the same as if we we had not taken the log. \[\log L(\beta_0, \beta_1,\sigma^2) = -\frac{n}{2} \log{(2\pi)} - \frac{n}{2} \log{(\sigma^2)} - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2\] Note that we use log to mean the natural logarithm (or \(\ln\)). We now take a partial derivative with respect to each of the parameters.

\[\begin{aligned} \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \beta_0} &= \frac{1}{\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 \\ \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \beta_1} &= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i) (y_i-\beta_0-\beta_1x_i)^2 \\ \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \sigma^2} &= \frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 \end{aligned}\]

When then set each of the partial derivatives equal to zero and solve the resulting system of equations.

\[\begin{aligned} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 &= 0\\ \sum_{i=1}^n (x_i) (y_i-\beta_0-\beta_1x_i)^2 &= 0\\ \frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 &=0 \end{aligned}\]

You may notice that the first two equations also appear in the least squares approach. Then, skipping the issue of actually checking if we have found a maximum, we then arrive at our estimates. We call these estimates the maximum likelihood estimates.

\[\begin{aligned} \hat\beta_1 &= \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n} (\sum_{i=1}^n x_i) (\sum_{i=1}^n y_i)}{\sum_{i=1}^n x_i^2 - \frac{1}{n} (\sum_{i=1}^n x_i)^2} = \frac{S_{xy}}{S_{xx}}\\ \hat\beta_0 &= \bar y - \hat\beta_1 \bar x \\ \hat\sigma^2 &= \frac{1}{n} \sum_{i=1}^n (y_i-\hat y_i)^2 \end{aligned}\]

Note that \(\hat\beta_0\) and \(\hat\beta_1\) are the same as the least squares estimates. However we now have a new estimate of \(\sigma^2\), that is \(\hat\sigma^2\). So we now have two different estimates of \(\sigma^2\).

\[\begin{aligned}S_e^2 &= \frac{1}{n-2} \sum_{i=1}^{n}(y_i - \hat y_i)^2 = \frac{1}{n-2} \sum_{i=1}^{n}e_i^2 \hspace{1cm} Least \ \ Squares \\ \hat\sigma^2 &= \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat y_i)^2 = \frac{1}{n} \sum_{i=1}^{n}e_i^2 \hspace{1cm} MLE \end{aligned}\]

As we will see the later section, least square estimate is unbiased estimate while estimate from MLE is biased estimate because the expectation of \(\sigma^2\) is the same as the \(S_e^2\) but not \(\hat\sigma^2\).

 

Question: Is bias always a bad thing?

try to answer yourself and see these posts to get more idea

Example of parameter estimation in simple linear regression (s7)

Use the mothod of least squares to fit a straight line to the \(n=5\) data points given in the table below.(Wackrly p571)

x y
-2 0
-1 0
0 1
1 1
2 3

Solution

We commence computation of the least-squares estimates for the slope and intecept of the fitted line by constructing table below.

Using the results from this table, we obtain

\[\hat \beta_1 = \frac{S_{xy}}{S_{xx}} = \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n}\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{\sum_{i=1}^n x_i^2 - \frac{1}{n} \bigg( \sum_{i=1}^n x_i \bigg)^2} = \frac {7 - \frac{1}{5} (0)(5)}{10 - \frac{1}{5} (0)^2} = 0.7\] \[\hat\beta_0 = \bar y - \hat\beta_1 \bar x = \frac{5}{5} - (0.7)(0) = 1\] and the fitted line is \[\hat y = 1 + 0.7x\]

slr <- function(x){1+0.7*x}
plot(data$x[1:5], data$y[1:5], xlab = "x", ylab = "y", ylim = c(-1.5, 3))
curve(slr, add = TRUE)

# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)
## 
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
## 
## Residuals:
##          1          2          3          4          5 
##  4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01  6.000e-01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     0.2708   3.693   0.0345 *
## x[1:5]        0.7000     0.1915   3.656   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.7556 
## F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535

Variance estimation (s8)

Expectation of \(\hat\beta_1\)

Assume that \(n\) independent observations are to be made on the model

\[Y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

Then the regression coefficient \(\hat\beta_1\) can be estimated by

\[\hat \beta_1 = \frac{S_{xy}}{S_{xx}} = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2}\] which can be written as

\[\hat \beta_1 = \frac{\sum_{i=1}^n (x_i - \bar x)y_i - \bar y \sum_{i=1}^n (x_i - \bar x)}{S_{xx}} = \frac{\sum_{i=1}^n (x_i - \bar x)y_i}{S_{xx}}\] Then we have

\[\begin{aligned} E(\hat \beta_1) &= E \bigg [ \frac{\sum_{i=1}^n (x_i - \bar x)y_i}{S_{xx}} \bigg] \\ &= \frac{\sum_{i=1}^n (x_i - \bar x) E(y_i)}{S_{xx}} \\ &= \frac{\sum_{i=1}^n (x_i - \bar x) (\beta_0 + \beta_1 x_i)}{S_{xx}} \\ &= \beta_0 \frac{\sum_{i=1}^n (x_i - \bar x)}{S_{xx}} + \beta_1 \frac{\sum_{i=1}^n (x_i - \bar x)x_i}{S_{xx}}\end{aligned}\] because \(\sum_{i=1}^n (x_i - \bar x) = 0\) and \(S_{xx} = \sum_{i=1}^n (x_i - \bar x)^2 = \sum_{i=1}^n (x_i - \bar x)x_i\),

we have

\[E(\hat \beta_1) = 0 + \beta_1 \frac{S_{xx}}{S_{xx}} = \beta_1\] Thus, \(\hat \beta_1\) is an unbiased estimator of \(\beta_1\)

 

Note: Definition of unbiasedness

If the following holds: \[E[u(X_1, X_2, ..., X_n)] = \theta\] then the statistic \(u(X_1, X_2, ..., X_n)\) is an unbiased estimator of the parameter \(\theta\). Otherwise, \(u(X_1, X_2, ..., X_n)\) is a biased estimator of \(\theta\)

 

Variance of \(\hat\beta_1\)

Recall that, for random variable \(X_i = X_1, ..., X_n\) and \(Y_j = Y_1, ..., Y_m\) with \(E(X_i) = \mu_i\), \(E(y_j) = \xi_j\), define

\[U_1 = \sum_{i = 1}^n a_iX_i, \ \ \ U_2 = \sum_{j = 1}^m b_jY_j\] for constants \(a_1, ..., a_n\) and \(b_1, ..., b_m\) then following holds:

  • \(E(U_1) = \sum_{i=1}^{n} a_i\mu_i\)

  • \(V(U_1) = \sum_{i=1}^{n} a_i^2 V(X_i) + 2\sum\sum_{1 \leq i<j\leq n} a_ia_j Cov(X_i, X_j)\), where the double sum is over all pairs \((i, j)\) with \(i < j\)

  • \(Cov(U_1, U_2) = \sum_{i=1}^{n}\sum_{j=1}^{m} a_ib_j Cov(X_i, Y_j)\)

With these properties, we can find \(V(\hat \beta_1)\) as

\[V(\hat \beta_1) = V \bigg[\frac{\sum_{i=1}^n (x_i - \bar x)y_i}{S_{xx}} \bigg] = \bigg[ \frac{1}{S_{xx}} \bigg ]^2 \sum_{i=1}^n V[(x_i - \bar x)y_i] = \bigg[ \frac{1}{S_{xx}} \bigg ]^2 \sum_{i=1}^n (x_i - \bar x)^2V(y_i)\] Because \(V(y_i) = \sigma^2\) (equal variance assumption), for \(i = 1, 2, ..., n\), \[V(\hat \beta_1) = \frac{\sigma^2}{S_{xx}}\]

Expectation and Variance of \(\hat\beta_0\)

Now let us to find the expected value and variance of \(\hat \beta_0\), where \(\hat \beta_0 = \bar y - \hat \beta_1 \bar x\). From the same theorme, we have

\[V(\hat \beta_0) = V(\bar y) + \bar x^2 V(\hat \beta_1) - 2\bar x Cov(\bar y, \hat \beta_1)\] Consequently, we must find \(V(\bar y)\) and \(Cov(\bar y, \hat \beta_1)\) in order to obtain \(V(\hat \beta_0)\). Since \(Y_i = \beta_0 + \beta_1 x_i + \epsilon_i\), we see that \[\bar y = \frac{1}{n} \sum_{i=1}^n y_i = \beta_0 + \beta_1 \bar x + \bar\epsilon\] Also \(E(\epsilon) = \sum_{i=1}^n \frac{1}{n}\epsilon_i=0\), Thus \[E(\bar Y) = \beta_0 + \beta_1 \bar x + E(\bar \epsilon) = \beta_0 + \beta_1 \bar x\] and \[V(\bar y) = V(\bar \epsilon) = \bigg (\frac{1}{n}\bigg) V(\epsilon_i) = \frac{\sigma^2}{n}\]

To find \(Cov(\bar y, \hat \beta_1)\), rewrite the expression for \(\hat\beta_1\) as \[\hat\beta_1 = \sum_{i=1}^n c_iy_i\] where \[c_i = \frac{x_i - \bar x}{S_{xx}}\] (Notice that \(\sum_{i=1}^n c_i = 0\)) Then, \[Cov(\bar y, \hat \beta) = Cov \bigg[\sum_{i=1}^n \bigg(\frac{1}{n}\bigg)y_i, \sum_{i=1}^nc_iy_i\bigg] = \sum_{i=1}^n \bigg(\frac{c_i}{n}\bigg)V(y_i) + {\sum\sum}_{i \neq j}\bigg(\frac{c_i}{n}\bigg)Cov(y_i,y_j)\] Because \(y_i\) and \(y_j\), where \(i \neq j\), are independent, \(Cov(y_i, y_j) = 0\). Also, \(V(y_i) = \sigma^2\), hence

\[Cov(\bar y, \hat \beta) = \frac{\sigma^2}{n} \sum_{i=1}^n c_i = \frac{\sigma^2}{n} \sum_{i=1}^n \bigg(\frac{x_i-\bar x}{S_{xx}}\bigg)=0\] Returning to our original task of finding the expected value and variance of \(\hat \beta_0 = \bar y - \hat \beta_1 \bar x\), we apply expectation theorems to obtain \[E(\hat \beta_0) = E(\bar y) - E(\hat \beta_1)\bar x = \beta_0 + \beta_1 \bar x - \beta_1 \bar x = \beta_0\] Thus, we have shown that both \(\hat\beta_0\) and \(\hat\beta_1\) are unbiased estimators of their respective parameters. Because we have derived \(V(\bar y)\), \(V(\hat \beta_1)\), and \(Cov(\bar y, \hat \beta_1)\), we are ready to find \(V(\hat \beta_0)\). From the same properties \[V(\hat\beta_0) = V(\bar y) + \bar x^2 V(\hat \beta_1) - 2 \bar x Cov(\bar y, \hat \beta_1)\] Substituting the values for \(V(\bar y)\), \(V(\hat\beta_1)\), and \(Cov(\bar y, \hat \beta_1)\), we obtain \[V(\hat\beta_0) = \frac{\sigma^2}{n} + \bar x^2 \bigg(\frac{\sigma^2}{S_{xx}} \bigg) - 0 = \sigma^2 \bigg(\frac{1}{n} + \frac{\bar x^2}{S_{xx}}\bigg) = \frac{\sigma^2 \sum_{i=1}^n x_i^2}{nS_{xx}}\] Furthermore, \[Cov(\hat\beta_0, \hat\beta_1) = \frac{-\bar x \sigma^2}{S_{xx}}\] Note that \(\hat\beta_0\) and \(\hat\beta_1\) are correlated (and therefore dependent) unless \(\bar x = 0\).

Usually the value of \(\sigma^2\) is unknown, we will need to make use of the sample observations to estimate \(\sigma^2\). If \(\bar y\) is used to estimate the mean, we can use the expression below to estimate the population variance \(\sigma^2\)

\[\bigg(\frac{1}{n-1}\bigg) \sum_{i=1}^n (y_i - \bar y)^2\] Because we are now using \(\hat y_i\) to estimate \(E(y_i)\), it seems natrual to base an estimate of \(\sigma^2\) on \(SSE=\sum_{i=1}^n (y_i - \hat y_i)^2\). Indeed, we will show that

\[S^2 = \bigg(\frac{1}{n-2}\bigg)\sum_{i=1}^n(y_i-\hat y_i)^2 = \bigg(\frac{1}{n-2}\bigg)SSE\] provides an unbiased estimator for \(\sigma^2\). Notice that the 2 occurring in the denominator of \(S^2\) corresponds to the number of \(\beta\) parameters estimated in the model.

Because \[E(S^2) = E\bigg[\bigg(\frac{1}{n-2}\bigg) SSE\bigg] = \bigg(\frac{1}{n-2}\bigg) E(SSE),\] it is necessary to find \(E(SSE)\) in order to verify that \(E(S^2) = \sigma^2\)

Notice that \[\begin{aligned} E(SSE) &= E\bigg[\sum_{i=1}^n (y_i-\hat y_i)^2\bigg] \\ &= E\bigg[\sum_{i=1}^n (y_i-\hat\beta_0 - \hat\beta_1x_i)^2\bigg] \\ &= E\bigg[\sum_{i=1}^n (y_i- \bar y + \hat\beta_1 \bar x - \hat\beta_1 x_i)^2\bigg] \\ &= E\bigg[\sum_{i=1}^n [(y_i- \bar y) - \hat\beta_1 (x_i - \bar x)]^2\bigg] \\ &= E\bigg[\sum_{i=1}^n (y_i- \bar y)^2 + \hat\beta_1^2 \sum_{i=1}^n(x_i - \bar x)^2 -2\hat\beta_1 \sum_{i=1}^n(x_i - \bar x)(y_i - \bar y) \bigg] \end{aligned}\]

Because \(\sum_{i=1}^n(x_i - \bar x)(y_i - \bar y)=\sum_{i=1}^n(x_i - \bar x)^2\hat\beta_1\),(Recall that \(\hat\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2}\)) the last two terms in the expectation combine to give \(-\hat\beta_1^2 \sum_{i=1}^n (x_i - \bar x)^2 = -\hat\beta_1^2 S_{xx}\). Also,

\[\sum_{i=1}^n (y_i - \bar y)^2 = \sum_{i=1}^n y_i^2 - n\bar y,\]

therefore,

\[\begin{aligned} E\bigg[\sum_{i=1}^n (y_i - \hat y_i)^2\bigg] &= E\bigg[\sum_{i=1}^n y_i^2 - n\bar y^2 - \hat\beta_1^2 S_{xx}\bigg] \\&=\sum_{i=1}^n E(y_i^2) - nE(\bar y^2) - S_{xx}E(\hat\beta_1^2) \end{aligned}\]

Noting that, for any random variable \(U\), \(E(U^2) = V(U) + [E(U)]^2\), we see that \[\begin{aligned} E\bigg[\sum_{i=1}^n (y_i-\hat y_i)^2\bigg] &=\sum_{i=1}^n E(y_i^2) - nE(\bar y^2) - S_{xx}E(\hat\beta_1^2) \\ &= \sum_{i=1}^n \{V(y_i) + [E(y_i)]^2\} - n \{V(\bar y) + [E(\bar y)]^2\} - S_{xx}\{V(\hat\beta_1) + [E(\hat\beta_1)]^2\} \\ &= n\sigma^2 + \sum_{i=1}^n(\beta_0+\beta_1x_i)^2 - n \bigg[\frac{\sigma^2}{n} + (\beta_0 + \beta_1 \bar x)^2 \bigg] - S_{xx} \bigg(\frac{\sigma^2}{S_{xx}} + \beta_1^2\bigg) \\ &=(n-2)\sigma^2 +\bigg[\sum_{i=1}^n \beta_0^2 +2\beta_0\beta_1\sum_{i=1}^n x_i + \beta_1^2\sum_{i=1}^n x_i^2\bigg] - \bigg[n\beta_0^2 + 2n\beta_0\beta_1 \bar x + n\beta_1^2\bar x^2\bigg] \\ & \ \ - \bigg[\beta_1^2\sum_{i=1}^n x_i^2 -2\beta_1^2 \bar x\sum_{i=1}^n x_i + \beta_1^2\sum_{i=1}^n \bar x^2\bigg] \\ &=(n-2)\sigma^2 +\bigg[n\beta_0^2 +2n\beta_0\beta_1 \bar x + \beta_1^2\sum_{i=1}^n x_i^2\bigg] - \bigg[n\beta_0^2 + 2n\beta_0\beta_1 \bar x + n\beta_1^2\bar x^2\bigg] - \bigg[\beta_1^2\sum_{i=1}^n x_i^2 -2n\beta_1^2 \bar x^2 + n\beta_1^2 \bar x^2\bigg] \\ &=(n-2)\sigma^2 \end{aligned}\]

As shown above, the expression simplifies to \((n-2)\sigma^2\). Thus, we find that an unbiased estimator of \(\sigma^2\) is given by \[S^2 = \bigg(\frac{1}{n-2}\bigg) \sum_{i=1}^n (y_i - \hat y_i)^2 = \bigg(\frac{1}{n-2}\bigg)SSE\] One task remains, finding an easy way to calculate \(\sum_{i=1}^n(y_i - \hat y_i)^2 = SSE\).

\[\begin{aligned} SSE = \sum_{i=1}^n(y_i - \hat y_i)^2 &= \sum_{i=1}^n (y_i - \hat\beta_0 - \hat\beta_1 x_i)^2 \\ &= \sum_{i=1}^n(y_i - \bar y)^2 - \hat\beta_1\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) = S_{yy} - \hat\beta_1S_{xy} \end{aligned}\]

Example of variance estimation in simple linear regression (s9, Wackerly p579)

Find the variances of the estimators \(\hat\beta_0\) and \(\hat\beta_1\) for the table below.

x y
-2 0
-1 0
0 1
1 1
2 3

Solution Previous example we found that \[n = 5, \hspace{0.5cm} \sum_{i=1}^n x_i = 0, \hspace{0.5cm} \sum_{i=1}^n x_i^2 = 10, \hspace{0.5cm} S_{xx} = 10, \hspace{0.5cm} and \hspace{0.5cm} \bar x = 0\] Thus, \[V(\hat\beta_0) = \frac{\sigma^2\sum_{i=1}^n x_i^2}{nS_{xx}} = \frac{\sigma^2 (10)}{5(10)} = \bigg(\frac{1}{5}\bigg)\sigma^2\] and \[V(\hat\beta_1) = \bigg(\frac{1}{10}\bigg)\sigma^2\] Notice that \(Cov(\hat\beta_0, \hat\beta_1) = 0\) in this case since \(\sum_{i=1}^n x_i = 0\)

To get the answer, we need to estimate \(\sigma^2\). For these data, \(n = 5\) and we have already determined that \[\sum_{i=1}^n y_i = 5, \hspace{0.5cm} S_{xy} = 7, \hspace{0.5cm} and \hspace{0.5cm} \hat\beta_1 = 0.7\] It is easily determined that \(\sum_{i=1}^n y_i^2 = 11\) and that \[S_{yy} = \sum_{i=1}^n(y_i - \bar y)^2 = \sum_{i=1}^ny_i^2 - n(\bar y)^2 = 11 - 5(1)^2 = 6.0\] Therefore, \[SSE = S_{yy} - \hat\beta_1 S_{xy} = 6.0 - (0.7)(7) = 1.1\] and

\[S^2 = \frac{SSE}{n-2} = \frac{1.1}{5-2} = \frac{1.1}{3} = 0.367\]

x y xy \(x^2\) \(y^2\)
1 -2 0 0 4 0
2 -1 0 0 1 0
3 0 1 0 0 1
4 1 1 1 1 1
5 2 3 6 4 9
Sum 0 5 7 10 11
n <- 5
Sxx <- data$x2[6] - n * (data$x[6]/n)
Syy <- data$y2[6] - n * (data$y[6]/n)
Sxy <- data$xy[6]
b1 <- 0.7
b0 <- 1

# estimation of population variance
SSE <- Syy - b1*Sxy
S2 <- SSE/(n-2)
S <- sqrt(S2)

# cf) MLE
S2_mle <- SSE/n
S_mle <- sqrt(S2_mle)

# Formatting
knitr::kable(data.frame(Sxx, Syy, Sxy, SSE, S2, S, S2_mle, S_mle))
Sxx Syy Sxy SSE S2 S S2_mle S_mle
10 6 7 1.1 0.3666667 0.6055301 0.22 0.4690416
# variance of regression coefficient
V_b1 <- S2/Sxx; V_b1; S2/10
## [1] 0.03666667
## [1] 0.03666667
V_b0 <- (S2*data$x2[6])/(n*Sxx); V_b0; S2/5
## [1] 0.07333333
## [1] 0.07333333
# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)
## 
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
## 
## Residuals:
##          1          2          3          4          5 
##  4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01  6.000e-01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     0.2708   3.693   0.0345 *
## x[1:5]        0.7000     0.1915   3.656   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.7556 
## F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535

 

These derivations establish the means and variances of the estimators \(\hat\beta_0\) and \(\hat\beta_1\) and show that \(S^2 = \frac{SSE}{(n-2)}\) is an unbiased estimator for the parameter \(\sigma^2\). Thus far, the only assumptions that we have made about the error term \(\epsilon\) in the model \(Y = \beta_0 + \beta_1 + \epsilon\) is that \(E(\epsilon) = 0\) and that \(V(\epsilon) = \sigma^2\), independent of \(x\). The form of the sampling distributions for \(\hat\beta_0\) and \(\hat\beta_1\) depends on the distribution of the error term \(\epsilon\). Because of the common occurrence of the normal distribution in nature, it is often reasonable to assume that \(\epsilon\) is normally distributed with mean 0 and variance \(\sigma ^2\) or \(\epsilon \sim N(0, \sigma^2)\)

If this assumption of normality is warranted, it follows that \(Y_i\) is normally distributed with mean \(\beta_0 + \beta_1x_i\) and variance \(\sigma^2\). Because both \(\beta_0\) and \(\beta_1\) are linear functions of \(y_1, y_2, ..., y_n\), the estimators are normally distributed, with means and variances as previously derived. Further, if the assumption of normality is warranted, it follows that \[\frac{(n-2)S^2}{\sigma^2} = \frac{SSE}{\sigma^2}\] has a \(\chi^2\) distribution with \(n-2\) degrees of freedom (df). The assumption of normality of the distribution of the eror term \(\epsilon\) and the resulting normal distributions for \(\hat\beta_0\) and \(\hat\beta_1\) will allow us to develop tests and confidence intervals based on the \(t\) distribution.

Statistical inference of parameters (s10)

Suppose that an researcher has fit the model \[Y = \beta_0 + \beta_1 x + \epsilon\] where \(Y\) is the strength of gene expression after 28 hrs and \(x\) is the drug/DMSO ratio used in the experiment.If the strength of gene expression does not change with the drug/DMSO ratio, then \(\beta_1 = 0\). Thus the researcher may wish to test \(H_0:\beta_1 = 0\) versus \(H_a: \beta_1 \neq 0\) in order to assess whether the independent variable has an influence on the dependent variable. Or the researcher may wish to estimate the mean rate of change \(\beta_1\) in \(E(Y)\) for a 1-unit change in the drug/DMSO ratio \(x\).

In general, for any linear regression model, if the random error \(\epsilon\) is normally distributed, we have established that \(\hat\beta_i\) is an unbiased, normally distributed estimator of \(\beta_i\) with \[V(\hat\beta_0) = C_{00}\sigma^2, \hspace{1cm} where \ \ C_{00} = \frac{\sum_{i=1}^n x_i^2}{nS_{xx}}\] and \[V(\hat\beta_1) = C_{11}\sigma^2, \hspace{1cm} where \ \ C_{11} = \frac{1}{S_{xx}}\]

That is, the variances of both estimators are constant multiples of \(\sigma^2\), the variance of the error term in the model. Using this information, we can construct a test of the hypothesis \(H_0:\beta_i = \beta_{i0}\) (\(\beta_{i0}\) is a specified value of \(\beta_i\)), using the test statistic

\[Z = \frac{\hat\beta_i - \hat\beta_{i0}}{\sigma\sqrt{c_{ii}}}\]

The rejection region for a two-tailed test is given by \[|Z| \geq Z_{\alpha/2}\]

If we estimate \(\sigma\) with \(S = \sqrt{\frac{SSE}{n-2}}\), then resulting quantity

\[T = \frac{\hat\beta_i - \hat\beta_{i0}}{S\sqrt{c_{ii}}}\] can be shown to ossess a Student’s t distribution with \(n-2\) df

In summary, to test hypothesis for \(\beta_i\)

  • \(H_0\): \(\beta_i = \beta_{i0}\)

  • \(H_{a1}\): \(\beta_i > \beta_{i0}\) (upper-tail rejection region),

  • \(H_{a2}\): \(\beta_i < \beta_{i0}\) (lower-tail rejection region),

  • \(H_{a3}\): \(\beta_i \neq \beta_{i0}\) (two-tailed rejection region),

  • Test statistic: \(T = \frac{\hat\beta_i - \hat\beta_{i0}}{S\sqrt{c_{ii}}}\)

  • Rejection region:

    • \(RR_1\): \(t > t_{\alpha}\) (upper-tail alternative),

    • \(RR_2\): \(t < -t_{\alpha}\) (lower-tail alternative),

    • \(RR_3\): \(|t| > t_{\alpha/2}\) (two-tailed alternative),

where \[C_{00} = \frac{\sum_{i=1}^n x_i^2}{nS_{xx}} \hspace{1cm} and \hspace{1cm} C_{11} = \frac{1}{S_{xx}}\] Notice that \(t_{\alpha}\) is based on \((n-2)\) df.

Example of inference of parameter (s11, Wakerly p585)

Do the data of previous example present sufficient evidence to indicate that the slope differs from 0? Test using \(\alpha = 0.05\) and give bounds for the attained significance level.

Solution

From previous example, we can get

\[C_{11} = \frac{1}{S_{xx}} = \frac{1}{10} = 0.1, \hspace{1cm} S = \sqrt{S^2} = \sqrt{0.367} = 0.606\] and \[t = \frac{\hat\beta_1 - 0}{s\sqrt{C_{11}}} = \frac{0.7-0}{0.606\sqrt{0.1}} = 3.65\] If we take \(\alpha = 0.05\), the value of \(t_{\alpha/2} = t_{0.025}\) for 3 df is 3.182 from the t-table. and the rejection region is \[reject \ \ if \ \ |t| \geq 3.182\] Therefore, we reject the null hypothesis that \(\beta_1 = 0\) at the \(\alpha = 0.05\) level of significance.

df <- n-2
c11 <- 1/Sxx
S <- sqrt(S2)
t <- (b1-0)/(S * sqrt(c11))
t_crit <- qt(0.975, df)
pval <- 2*pt(t, df, lower.tail = FALSE)

knitr::kable(data.frame(df, c11, S, t, t_crit, pval))
df c11 S t t_crit pval
3 0.1 0.6055301 3.655631 3.182446 0.0353528
# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)
## 
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
## 
## Residuals:
##          1          2          3          4          5 
##  4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01  6.000e-01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     0.2708   3.693   0.0345 *
## x[1:5]        0.7000     0.1915   3.656   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.7556 
## F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535

Example of inference of interval for \(\beta_i\) (s10, Wakerly p586)

Calculate a 95% confidence interval for the parameter \(\beta_1\) of previous example.

Solution From the t-Table, we can get the critical t-value, \(t_{0.25}\), based on 3 df, 3.182. Then the 95% confidence interval for \(\beta_1\) is

\[\hat\beta_1 \pm t_{0.25}s\sqrt{C_{11}}\] Substituting, we get \[0.7 \pm (3.182)(0.606)\sqrt{0.1} \hspace{1cm} or \hspace{1cm} 0.7 \pm 0.610\] If we wish to estimate \(\beta_1\) correctly within 0.15 unit, it is obvious that the confidence interval is too wide and that the sample size must be increased.

b1 <- 0.7
t_crit <- qt(0.975, df)
S <- sqrt(S^2)
c11 <- 1/Sxx

lower <- b1 - (t_crit*S*sqrt(c11))
upper <- b1 + (t_crit*S*sqrt(c11))

knitr::kable(data.frame(lower, upper))
lower upper
0.0906079 1.309392
# base R
confint(slr2, "x[1:5]", level=0.95)
##             2.5 %   97.5 %
## x[1:5] 0.09060793 1.309392

Statistical inference of linear function of parameters \(E(y)\) (s12)

In addition to making inferences about a single \(\beta_i\), we frequently are interested in making inferences about linear functions of the model parameters \(\beta_0\) and \(\beta_1\). For example, we might wish to estimate \(E(Y)\), given by \[E(Y) = \beta_0 + \beta_1x\] where \(E(Y)\) represents the mean yield of a biochemical process for the settings of controlled process variable \(x\) such as concentration of an enzyme. Suppose that we wish to make an inference about the linear function \[\theta = a_0\beta_0 + a_1\beta_1\] where \(a_0\) and \(a_1\) are constants (one of which may equal zero). Then, the same linear function of the parameter estimators \[\hat \theta = a_0\hat\beta_0 + a_1\hat\beta_1\] is an unbiased estmiator of \(\theta\) because \[E(\hat\theta) = a_0E(\hat\beta_0) + a_1E(\hat\beta_1) = a_0\beta_0 + a_1\beta_1 = \theta\] and \[V(\hat\theta) = a_0^2V(\hat\beta_0) + a_1^2V(\hat\beta_1) + 2a_0a_1Cov(\hat\beta_0, \hat\beta_1)\] where \(V(\hat\beta_i)\) = \(c_{ii}\sigma^2\) and \(Cov(\hat\beta_0, \hat\beta_1) = c_{01}\sigma^2\), with \[c_{00} = \frac{\sum_{i=1}^n x_i^2}{nS_{xx}}, \hspace{0.5cm} c_{11} = \frac{1}{S_{xx}}, \hspace{0.5cm} c_{01}=\frac{-\bar x}{S_{xx}}\] Therefore, \[V(\hat\theta) = a_0^2\frac{\sum_{i=1}^n x_i^2}{nS_{xx}}\sigma^2 + a_1^2\frac{1}{S_{xx}}+2a_0a_1\frac{-\bar x}{S_{xx}} = \bigg(\frac{a_0^2\frac{\sum_{i=1}^n x_i^2}{n}+a_1^2-2a_0a_1\bar x}{S_{xx}}\bigg)\sigma^2\] Finally, recalling that \(\hat\beta_0\) and \(\hat\beta_1\) are normally distributed in repeated sampling, it is clear that \(\hat\theta\) is a linear function of normally distributed random variables, implying that \(\hat\theta\) is normally distributed. Thus, we conclude that \[Z = \frac{\hat\theta - \theta}{\sigma_{\hat\theta}}\] has a standard normal distribution and could be employed to test the hypothesis \[H_0: \theta = \theta_0\] when \(\theta_0\) is some specified value of \(\theta = a_0\beta_0 + a_1\beta_1\). Likewise, a \(100(1-\alpha)\%\) confidence interval for \(\theta = a_0\beta_0 + a_1\beta_1\) is \[\hat\theta \pm z_{\alpha/2}\sigma_{\hat\theta}\] and here \(\sigma_{\hat\theta} = \sqrt{V(\hat\theta)}\) If we substitute \(S\) for \(\sigma\) in the expression for \(Z\), the resulting expression possesses a Student’s \(t\) distribution in repeated sampling, with \(n-2\) df, and provides a test statistic to test hypotheses about \(\theta = a_0\beta_0 + a_1\beta_1\).

In summary,

  • A Test for \(\theta = a_0\beta_0 + a_1\beta_1\)

  • \(H_0: \theta = \theta_0\)

  • \(H_{a1}: \theta > \theta_0\)

  • \(H_{a2}: \theta < \theta_0\)

  • \(H_{a3}: \theta \neq \theta_0\)

  • Test statistic: \(T = \frac{\hat\theta - \theta_0}{S\sqrt{\bigg(\frac{a_0^2\frac{\sum_{i=1}^n x_i^2}{n}+a_1^2-2a_0a_1\bar x}{s_{xx}}}\bigg)}\)

  • Rejection region:

    • \(RR_1: t > t_{\alpha}\)

    • \(RR_2: t < - t_{\alpha}\)

    • \(RR_1: |t| > t_{\alpha/2}\)

Here, \(t_{\alpha}\) and \(t_{\alpha/2}\) are based on \(n-2\) df.

The corresponding \(100(1-\alpha)\%\) confidence interval is \[\hat\theta \pm t_{\alpha/2} S\sqrt{\bigg(\frac{a_0^2\frac{\sum_{i=1}^n x_i^2}{n}+a_1^2-2a_0a_1\bar x}{s_{xx}}\bigg)}\] where \(t_{\alpha/2}\) is based on \(n-2\) df.

We can apply the above equation to estimate \(E(Y)\), the mean value of \(Y\). For a particular value of \(x^*\), \[E(Y) = \beta_0 + \beta_1x^*\] Notice that \(E(Y)\) is a special case of \(a_0\beta_0 + a_1\beta_1\), with \(a_0=1\) and \(a_1=x^*\). Then the part of equation under square-root above become as follows.

\[\begin{aligned}\bigg(\frac{a_0^2\frac{\sum_{i=1}^n x_i^2}{n}+a_1^2-2a_0a_1\bar x}{s_{xx}}\bigg) &= \frac{\frac{\sum_{i=1}^n x_i^2}{n}+{x^*}^2-2x^*\bar x}{s_{xx}}\\ &= \frac{\frac{\sum_{i=1}^n x_i^2}{n} - \bar x^2}{s_{xx}} + \frac{(x^* - \bar x)^2}{s_{xx}}\\ &= \frac{1}{n}\frac{\sum_{i=1}^n x_i^2 - n \bar x^2}{\sum_{i=1}^n (x_i - \bar x)^2} + \frac{(x^* - \bar x)^2}{s_{xx}}\\ &= \frac{1}{n}\frac{\sum_{i=1}^n x_i^2 - n \bar x^2}{\sum_{i=1}^n x_i^2 - 2\bar x\sum_{i=1}^n x + n \bar x^2} + \frac{(x^* - \bar x)^2}{s_{xx}}\\ &= \frac{1}{n}\frac{\sum_{i=1}^n x_i^2 - n \bar x^2}{\sum_{i=1}^n x_i^2 - 2n\bar x^2 + n \bar x^2} + \frac{(x^* - \bar x)^2}{s_{xx}}\\ &= \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}} \end{aligned}\]

A confidence interval for the mean value of \(Y\) when \(x = x^*\), a particular value of \(x\), is as follows. \[\hat\beta_0 + \hat\beta_1x^* \pm t_{\alpha/2}S\sqrt{\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}\] where \(t_{\alpha/2}\) is based on \(n-2\) df.

This formular makes it easy to see that for a fixed value of \(n\) and for given \(x\)-values, the shortest confidence interval for \(E(Y)\) is obtained when \(x^* = \bar x\), the average of the \(x\)-values used in the experiment. If our objective is to plan an experiment that yields short confidence intevals for \(E(Y)\) when \(x = x^*\), \(n\) should be large, \(S_{xx}\) should be large (if possible), and \(\bar x\) should be near \(x^*\). The physical interpretation of a large \(S_{xx}\) is that when possible the values of \(x\) used in the experiment should be spread out as much as possible.

Example of inference of \(E(Y)\) (s13, Wakerly p592)

For the same example, find a 90% confidence interval for \(E(Y)\) when \(x=1\). (wackerly p592)

Solution For the model of previous example \[E(Y) = \beta_0 + \beta_1x\] To estimate \(E(Y)\) for any fixed value \(x=x^*\), we use the unbiased estimator \(\hat{E(Y)} = \hat\beta_0 + \hat\beta_1x^*\). Then, \[\hat\beta_0 +\hat\beta_1x^* = 1 + 0.7x^*\] For the case of \(x^* = 1\), and because \(n=5, \ \bar x = 0,\) and \(S_{xx} = 10\), it follows that \[\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}} = \frac{1}{5} + \frac{(1-0)^2}{10} = 0.3\] We already know that \(S^2 = 0.367\) or \(S = 0.606\) and \(t_{0.05}\) with \(n-2 = 3\) df is 2.353. Thus, the confidence interval for \(E(Y)\) when \(x=1\) is \[\hat\beta_0 + \hat\beta_1x^* \pm t_{\alpha/2}S\sqrt{\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}\] \[[(1 + (0.7)(1)] \pm (2.353)(0.606)\sqrt{0.3} = 1.7 \pm 0.781\] That is, we are 90% confident that, when the independent variable takes on the value \(x=1\), the mean value \(E(Y)\) of the dependent variable is between 0.919 and 2.481. This interval obviously is very wide, but remember that it is based on only 5 data points.

n <- 5
b0 <- 1
b1 <- 0.7
df <- n-2
t_crit <- qt(0.95, df)
x <- 1

y <- b0+b1*x
lower <- b0+b1*x - (t_crit*S*sqrt((1/n)+((x - data$x[6])^2/Sxx)))
upper <- b0+b1*x + (t_crit*S*sqrt((1/n)+((x - data$x[6])^2/Sxx)))

knitr::kable(data.frame(y, lower, upper))
y lower upper
1.7 0.9194776 2.480522
# base R
new.data <- data.frame(x = 1)
predict(slr2, new.data, level=0.90, interval = "confidence")[1,]
## Warning: 'newdata' had 1 row but variables found have 5 rows
##       fit       lwr       upr 
## 1.7000000 0.9194776 2.4805224

Prediction of particular value of Y (s14)

Suppose that for a fixed pressure the yield \(Y\) for a biochemical experiment is a function of the temperature \(x\) at which the experiment is run. Assume that a linear model of the form \[y = \beta_0 + \beta_1x + \epsilon\] adequately represnets the response function traced by \(y\) over the experimental region of interest. Now, instead of estimating the mean yield at \(x^*\), we wish to predict the particular response \(y\) that will observe if the experiment is run at some time in the future (for example next Monday). This situation would occur if, for some reason, the response next Monday held a special significance to us. Prediction problmes frequently occur when we may be interested on a specific issue.

Notice that \(y\) is a random variable, not a parameter; predicting its value therefore represents a departure from our previous objective of making inferences about population parameters. if it is reasonable to assume that \(\epsilon\) is normally distributed with mean 0 and variance \(\sigma^2\), it follows that \(y\) is normally distributed with mean \(\beta_0 + \beta_1x\) and variance \(\sigma^2\).

If we are interested in the value of \(y\) when \(x = x^*\), call it \(y^*\), we could employ \(\hat{y^*} = \hat\beta_0 + \hat\beta_1x^*\) as a predictor of a particular value of \(y^*\) and as an estimator of \(E(y)\) as well. The error of prediction is definded as \[error = y^* - \hat y^*\] Notice that the error is also normally distributed. Let’s investigate the properties of this error in repeated sampling. Applying expectation theorem we utilized so far, we get \[E(error)= E(y^* - \hat y^*)\] and because \(E(\hat y^*) = \beta_0 + \beta_1 x^* = E(y^*)\), \[E(error) = 0\] Likewise, \[V(error) = V(y^* - \hat y^*)= v(y^*) + V(\hat y^*) - 2Cov(y^*, \hat y^*)\] Because we are predicting a future value \(y^*\) that is employed in the computation of \(\hat y^*\), it follows that \(y^*\) and \(\hat y^*\) are independent and hence that \(Cov(y^*, \hat y^*) = 0\). Then, \[\begin{aligned} V(error)=v(y^*) + V(\hat y^*) &= \sigma^2 + V(\hat\beta_0 + \hat\beta_1x^*)\\ &= \sigma^2 + \bigg(\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}\bigg)\sigma^2 \\ &= \sigma^2 \bigg[1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}\bigg] \end{aligned}\]

We have shown that the error of predicting a particular value of \(Y\) is normally distributed with mean 0 and variance as given in the above equation. It follows that \[Z = \frac{y^* - \hat y^*}{\sigma \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}}\] has a standard normal distribution. Furthermore, if \(S\) is substituted for \(\sigma\), it can be shown that \[T = \frac{y^* - \hat y^*}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}}\] possesses a Student’s \(t\) distribution with \(n-2\) df. We can use this result to construct a prediction interval for the random variable \(y^*\). \[P(-t_{\alpha/2} < T < t_{\alpha/2}) = 1-\alpha\] Substituting for T, we obtain \[P\Bigg[-t_{\alpha/2} < \frac{y^* - \hat y^*}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}}< t_{\alpha/2}\Bigg] = 1-\alpha\] In other words, in repeated sampling the inequality within the brackets will hold with a probability equal to \((1-\alpha)\). By manipulating terms, we get

\[P\Bigg[\hat y^*-t_{\alpha/2}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}} < y^*< \hat y^*+t_{\alpha/2}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}}\Bigg] = 1-\alpha\] Thus, we have placed an interval about \(\hat y^*\) that in repeated sampling will contain the actual value of \(y^*\) with probability \(1-\alpha\). The \(100(1-\alpha)\%\) prediction interval for \(y^*\) is given as

\[\hat\beta_0+ \hat\beta_1x^* \pm t_{\alpha/2}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}}\] In other words, we would expect the error to be less in absolute value than \[t_{\alpha/2}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}}\] with probability equal to \((1-\alpha)\)

Notice that the length of a confidence interval for \(E(Y)\) when \(x=x^*\) is given by \[2 \times t_{\alpha/2}{S \sqrt{\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}}\] whereas the length of a prediction interval for an actual value of \(y\) when \(x=x^*\) is given by \[2 \times t_{\alpha/2}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}}\] Thus, we observe that prediction intervals for the actual value of \(y\) are longer than confidence intervals for \(E(y)\) if both are determined for the same value of \(x^*\)

Example of inference of particular value of \(y\) (s14, Wakerly p596)

Suppose that the experiment that generated the data of previous example is to be run again with \(x=2\). Predict the particular value of \(y\) with \(1-\alpha= 0.90\) (Wackerly p596)

Solution The predicted value of \(y\) with \(x=2\) is \[\hat\beta_0 + \hat\beta_1x^* = 1 + (0.7)(2) = 2.4\] Further, with \(x^*=2\), \[\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}} = \frac{1}{5} + \frac{(2-0)^2}{10} = 0.6\] Thus the prediction interval is \[\hat\beta_0+ \hat\beta_1x^* \pm t_{\alpha/2}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}} = 2.4 \pm (2.353)(0.606)\sqrt{1+0.6} = 2.4 \pm 1.804\]

n <- 5
b0 <- 1
b1 <- 0.7
df <- n-2
t_crit <- qt(0.95, df)
x <- 2

y <- b0+b1*x
lower <- b0+b1*x - (t_crit*S*sqrt(1+(1/n)+((x - data$x[6])^2/Sxx)))
upper <- b0+b1*x + (t_crit*S*sqrt(1+(1/n)+((x - data$x[6])^2/Sxx)))

knitr::kable(data.frame(y, lower, upper))
y lower upper
2.4 0.5974608 4.202539
# base R
new.data <- data.frame(x = 2)
predict(slr2, new.data, level=0.90, interval = "prediction")[1,]
## Warning: 'newdata' had 1 row but variables found have 5 rows
##       fit       lwr       upr 
## 2.4000000 0.5974608 4.2025392

 

As you can see in the figure above, confidence bands for \(E(Y)\) is closer than prediction bands for \(y\).

Fitness of model (s15)

The analysis of variance table (s17)

An ANOVA table can also be constructed for a simple linear regression analysis. Here, \[H_0: \beta_1 = 0, \hspace{1cm} H_a: \beta_1 \neq 0\] The total sum of squares (or simply \(S_{yy}\)) \[SST = \sum_{i=1}^n (y_i - \bar y)^2\] measures the total variability in the values of the dependent variable. It has n-1 degrees of freedom and is partitioned into sum of squares for regression (SSR) with 1 degree of freedom and sum of squares for error (SSE) with n-2 degrees of freedom. SSR measures the amount of variability in the dependent variable that is accounted for by the fitted regression line, and SSE measures the variability about the fitted regression line. \[SST = SSR + SSE\] \[\sum_{i=1}^n (y_i - \bar y)^2 = \sum_{i=1}^n (\hat y_i - \bar y)^2 + \sum_{i=1}^n (y_i - \hat y)^2\] The figure below illustrates these three sources of variability

The mean squares are \[MSR = \frac{SSR}{1} = MSR \hspace{1cm} and \hspace{1cm} MSE = \frac{SSE}{n-2}=\hat\sigma^2 \sim \sigma^2 \frac{\chi^2_{n-2}}{n-2}\] (Think about why \(\hat\sigma^2\) follows \(\chi^2\) distribution)

An analysis of variance table can be constructed as shown below.

(compare what we performed in the project and the table above)

The analysis procedure is as follows.

  • \(H_0: \beta_1=0, \hspace{1cm} H_a:\beta_1\neq0\)
  • \(SSR \sim \chi^2_1\), \(SSE \sim \chi^2_{n-2}\)
  • \(F = \frac{MSR}{MSE} \sim F_{1,n-2}\)
  • p-value = \(P(X > F)\) (two-sided)
  • \(t = \frac{\hat\beta_1\sqrt{S_{xx}}}{\hat\sigma}\)
  • \(t^2 = \frac{\hat\beta_1^2S_{xx}}{\hat\sigma^2} =\frac{MSR}{MSE} = F\) (for more information about this equality, see this post)

Therefore, in simple linear regression problem, the p-value given in the ANOVA table and the p-value given for the slope parameter are always exactly the same.

The proportion of the total variability in the dependent variable \(y\) that is accounted for by the regression line is \[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = \frac{1}{1+\frac{SSE}{SSR}}\] which is known as the coefficient of determination (COD). This coefficient takes a value between 0 and 1, and the closer it is to one the smaller is the sum of squares for error SSE in relation to the sum of squares for regression SSR. Thus, larger values of \(R^2\) tend to indicate that the data points are closer to the fitted regression line. Nevertheless, a low value of \(R^2\) should not necessarily be interpreted as implying that the fitted regression line is not appropriate or is not useful. A fitted regression line may be accurate and informative even though a small value of \(R^2\) is obtained because of a large error variance \(\sigma^2\).

Example of the analysis of variance table (Hayter p546)

The manager of a car plant wishes to investigate how the plant’s electricity usage depends upon the plant’s production. The data set shown below is compiled and provides the plant’s production and electrical usage for each month of the previous year.The electrical usage is in units of a million kilowatt-hours, and the production is measured as the value in million-dollar units of the cars produced in that month.

Production Usage
Jan 4.51 2.48
Feb 3.58 2.26
Mar 4.31 2.47
Apr 5.06 2.77
May 5.64 2.99
Jun 4.99 3.05
Jul 5.29 3.18
Aug 5.83 3.46
Sep 4.70 3.03
Oct 5.61 3.26
Nov 4.90 2.67
Dec 4.20 2.53

As shown in the scatter plot of the data set, the manager concludes that it is sensible to fit a straight line to the data points. The electricity usage is taken to be the dependent variable \(y\), and the production is taken to be the explanatory variable \(x\). The linear model \(y = \beta_0 + \beta_1x\) will then allow a month’s electrical usage to be estimated as function of the month’s production. We can perform ANOVA test for the regression line.

# Fitting linear model
fit <- lm(Usage ~ Production, data = data)
summary(fit)
## 
## Call:
## lm(formula = Usage ~ Production, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23245 -0.16704  0.03919  0.13480  0.27645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.40905    0.38599   1.060    0.314    
## Production   0.49883    0.07835   6.367 8.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1729 on 10 degrees of freedom
## Multiple R-squared:  0.8021, Adjusted R-squared:  0.7823 
## F-statistic: 40.53 on 1 and 10 DF,  p-value: 8.176e-05
# ANOVA
res <- summary(aov(fit)); res
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Production   1 1.2124  1.2124   40.53 8.18e-05 ***
## Residuals   10 0.2991  0.0299                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
res <- unlist(res)

# R^2 = SSR/SST
R2 <- res[3]/(res[3]+res[4]); names(R2) <- "R2"; R2
##        R2 
## 0.8021094
# visualize
library(ggplot2)
qplot(data$Production, data$Usage, ylab = "Electricity Usage", 
      xlab = "Production", main = "Production vs Electricity Usage") + stat_smooth(method = "lm")

Residual Analysis (s17)

The residuals are defined to be \[e_i = y_i - \hat y_i, \hspace{1cm} 1\leq i \leq n\] so that they are the differences between the observed values of the dependent variable \(y_i\) and the corresponding fitted values \(\hat y_i\). A property of these residuals is that they sum to 0 \[\sum_{i=1}^n e_i = 0\] Residual analysis can be used

  • Identify data points that are outliers (residual-leverage, cook’s distance)

  • Check whether the fitted model is appropriate (Scale-location plot)

  • Check whether the error variance is constant (residual plots, standardized residual plots)

  • Check whether the error terms are normally distributed (normal probability plot)

Here we list some of residual plots which can be used for the diagnostics for the regression model

Random scatter and no patterns - best fit

Random scatter with outliers

Nonlinear relationship is exist - SLR is not appropriate

A funnel shape residual plot - Non constant error variance

Straight line in a normal scores plot - Normally distributed

nonlinear line in a normal scores plot - nonlinear patters

Plot the residuals \(e_i\) against the values of the explanatory variable \(x_i\)

# Residuals vs Fitted
plot(fit$fitted.values, fit$residuals, xlab = "Fitted value", 
     ylab = "Residuals", main = "Residuals vs Fitted")
abline(h=0, col="grey", lty = 2)
abline(h=0.2, col="grey", lty = 2); abline(h=-0.2, col="grey", lty = 2)

# base R - plot.lm
plot(fit, which=1)

Notice that the red line in the residual vs fitted value plot represents the average value of the residuals at each value of fitted value (scatterplot smoother). From this plot, we can identify outlier or nonlinear pattern of data.

Plot the standardized residual \(\frac{e_i}{\hat\sigma}\) against the values of the explanatory variable \(x_i\) or leverage \(h_i\)

Standardized residual

The standardized residual is simply ratio between residual and estimate of error variance as shown below. \[\frac{e_i}{\hat\sigma}\] we can take square root of the absolute value of the standardized residual to diminish skewness of the \(E(\frac{e_i}{\hat\sigma})\)

If we replace estimate of error variance with sample variance under consideration of leverage, we can get the Studentized residual. \[\frac{e_i}{S \sqrt{1-h_i}}\] Leverage (\(h_i\))

An outlier is a data point whose response \(y\) does not follow the general trend of the rest of the data. A data point has high leverage if it has “extreme” predictor \(x\) values. A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results.

The statistic called leverage is given as follows \[h_i = \frac{1}{n} + \frac{(x-\bar x)^2}{\sum_{i=1}^n (x-\bar x)^2}\] and indicates the proportion of the SST of the \(x\) contribued by \(i_{th}\) point. Here I summarized the properties of levarage

  • The leverage \(h_i\) is a measure of the distance between the \(x\) value for the \(i_{th}\) data point and the mean of the \(x\) values for all \(n\) data points.
  • The leverage hii is a number between 0 and 1, inclusive.
  • The sum of the \(h_i\) equals \(p\), the number of parameters (regression coefficients including the intercept). Therefore, in simple linear regression case, \(p = 2\).

 

Cook’s Distance (\(D_i\))

Cook’s distance is a statistic that quantifies influence of data points. I summarized the definition and properties of Cook’s distance below.

\[D_i = \frac{(y_i - \hat y_i)^2}{2\times MSE}\bigg[\frac{h_{i}}{(1-h_{i})^2}\bigg] = \frac{e_i^2}{2\times MSE}\bigg[\frac{h_{i}}{(1-h_{i})^2}\bigg]\] - \(D_i\) directly summarizes how much all of the fitted values change when the \(i_{th}\) observation is deleted

  • A data point having a large \(D_i\) indicates that the data point strongly influences the fitted values
    • If \(D_i\) is greater than 0.5, then the \(i_{th}\) data point is worthy of further investigation as it may be influential
    • If \(D_i\) is greater than 1, then the \(i_{th}\) data point is quite likely to be influential
    • or if \(D_i\) is easy to recognize from the other \(D_i\) values, it is almost certainly influential.
  • For more information of \(D_i\), please see this page
# Standardized residual vs Fitted
fit.stdres <- rstandard(fit)
plot(fit$fitted.values, fit.stdres, xlab = "Fitted value", 
     ylab = "Standardized Residuals", ylim = c(-2, 2), main = "Standardized residuals vs Fitted")
abline(h=0, col="grey", lty = 3)
abline(h=1, col="grey", lty = 2); abline(h=-1, col="grey", lty = 2)
abline(h=2, col="grey", lty = 2); abline(h=-2, col="grey", lty = 2)

# Square root |Standardized residual| vs Fitted
plot(fit$fitted.values, sqrt(abs(fit.stdres)), xlab = "Fitted value", ylim = c(0, 1.3),
     ylab = bquote(sqrt(abs("Standardized residuals"))), main = "Scale-Location")

plot(fit, which=3)

# Cook's distance - Influential outlier detection
plot(fit, which=4)

# Standardized residual vs leverage - Influential outlier detection
plot(fit, which=5)

Plot a normal probability plot (also referred to as a normal scores plot, or Q-Q plot)

Normal scores plot is used to check whether the error terms \(\epsilon_i\) appear to be normally distributed. In such a plot the residuals \(e_i\) are plotted against their “normal scores”, which can be thought of as being their “expected values” if they have a normal distribution. The normal score of \(i_{th}\) smallest residual is \[\Phi^{-1}\Bigg(\frac{i - \frac{3}{8}}{n+\frac{1}{4}}\Bigg)\]

qqnorm(fit.stdres, ylab = "Standardize residuals", xlab = "Normal scores", 
       main = "Normal Probability Plot", ylim = c(-1.5, 1.9))
qqline(fit.stdres, lty=3, col="grey")

# Q-Q plot
plot(fit, which=2)

Lack of fit test (s17)

A perfect regression model results in a fitted line that passes exactly through all observed data points. This perfect model will give us a zero error sum of square, i.e. \(SSE = 0\). However, if you conduct the same experiment with the same explantory variable \(x_i\) and record response variable \(y_i\) multiple times, the observations from the experiment (2nd trial) may deviate from perfect model. This error is originated from “purely” random variation or noise. The sum of squares due to this error, called sum of square of pure error is abberiviated as \(SSPE\) quantifies thses variations. \(SSPE\) is calculated by taking repeated observations at some or all values of \(x_i\) and adding up the square of deviations at each level of \(x_i\). The remaining error is called sum of square of lac of fit and abberiviated as \(SSLF\). Therefore,

\[SSE = SSPE + SSLE\] As you can see the table below, in lack of fit test, we analyze residual error SSE in terms of SSLF and SSPE by taking MSE of each error separately.

Example of lack of fit test (s18)

Suppose that we built a simple linear regression model between the size of cancer mass (\(x\)) and the increased size of it (\(y\)) in a given time nonetheless its curved distribution as shown below. The regression showed that the \(\hat\beta_0=50.7\) and \(\hat\beta_1=0.49\) (Calculate \(\beta_0\), and \(\beta_1\), the variance and confidence intervals of \(\beta_0\), \(\beta_1\), \(\hat y\) by yourself too. set \(\alpha=0.05\)).

 

Solution

 

cancer_mass <- read.table("http://bisyn.kaist.ac.kr/bis335/10-Regression-tumorsize.txt", 
                          sep = "\t", header = T)
colnames(cancer_mass) <- c("Mass", "Increase_t1", "Increase_t2")

# Find b0, b1, yhat
fit <- lm(Increase_t1 ~ Mass, data=cancer_mass)
summary(fit)
## 
## Call:
## lm(formula = Increase_t1 ~ Mass, data = cancer_mass)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.23 -34.06  12.61  32.44  48.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  50.7225    39.3979   1.287     0.23
## Mass          0.4867     0.2747   1.772     0.11
## 
## Residual standard error: 40.47 on 9 degrees of freedom
## Multiple R-squared:  0.2586, Adjusted R-squared:  0.1762 
## F-statistic: 3.139 on 1 and 9 DF,  p-value: 0.1102
knitr::kable(data.frame(y = cancer_mass$Mass, y_hat = fit$fitted.values))
y y_hat
75 87.22513
75 87.22513
100 99.39267
100 99.39267
125 111.56021
125 111.56021
150 123.72775
175 135.89529
175 135.89529
200 148.06283
200 148.06283
# Manual calculation of Lack of fit ANOVA table
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- cancer_mass

# Degrees of freedom
df.SSPE <- length(data$Mass) - length(unique(data$Mass))
df.SSLF <- length(unique(data$Mass)) - 2

# calculation of SSPE & SSLF  
data2 <- data %>% group_by(Mass) %>% summarise(Group_mean = mean(Increase_t1))
data <- left_join(data, data2, by="Mass")
data$SSPE <- (data$Increase_t1 - data$Group_mean)^2
data$Pred <- data$Mass * fit$coefficients[2] + fit$coefficients[1]
data$SSLF <- (data$Group_mean - data$Pred)^2
SSPE <- sum(data$SSPE)
SSLF <- sum(data$SSLF)

# calculation of F statistic
MSLF <- SSLF/df.SSLF
MSPE <- SSPE/df.SSPE
a <- 0.05

f <- round(MSLF / MSPE, 4)
f_crit <- qf(1-a, df1 = df.SSLF, df2 = df.SSPE); f_crit
## [1] 5.192168
# pvalue
p <- round(1 - pf(f, df.SSLF, df.SSPE),4)

# summary table
knitr::kable(data.frame(cbind(DF = c(df.SSLF, df.SSPE), 
                              SS = c(SSLF, SSPE), 
                              MS = c(MSLF, MSPE), 
                              F = c(f,NA), 
                              p = c(p, NA)
                              ) # end of cbind
                        ) # end of data.frame
             ) # end of kable
DF SS MS F p
4 13593.57 3398.393 14.8014 0.0056
5 1148.00 229.600 NA NA
# with R package
library(alr3)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
fit <- lm(Increase_t1 ~ Mass, data = cancer_mass)
summary(fit)
## 
## Call:
## lm(formula = Increase_t1 ~ Mass, data = cancer_mass)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.23 -34.06  12.61  32.44  48.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  50.7225    39.3979   1.287     0.23
## Mass          0.4867     0.2747   1.772     0.11
## 
## Residual standard error: 40.47 on 9 degrees of freedom
## Multiple R-squared:  0.2586, Adjusted R-squared:  0.1762 
## F-statistic: 3.139 on 1 and 9 DF,  p-value: 0.1102
anova(fit)
## Analysis of Variance Table
## 
## Response: Increase_t1
##           Df  Sum Sq Mean Sq F value Pr(>F)
## Mass       1  5141.3  5141.3  3.1389 0.1102
## Residuals  9 14741.6  1638.0
pureErrorAnova(fit)
## Analysis of Variance Table
## 
## Response: Increase_t1
##              Df  Sum Sq Mean Sq F value   Pr(>F)   
## Mass          1  5141.3  5141.3  22.393 0.005186 **
## Residuals     9 14741.6  1638.0                    
##  Lack of fit  4 13593.6  3398.4  14.801 0.005594 **
##  Pure Error   5  1148.0   229.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because p-value is less than significance level of our test, we can conclude that there is lack of fit in the simple linear regression model.

Correlation analysis (s19, s22)

The linear model \[Y = \beta_0 + \beta_1x + \epsilon\] implies that \[E(Y) = \beta_0 + \beta_1 x\] when the value of \(x\) may be completely controlled by experimenter. If the variable \(x\) may be observed value of a random variable \(X\) that is not controllable by experimenter, the same model implies that \[E(Y | X=x) = \beta_0 + \beta_1x\] In this setting, we generally assume that the vector random variable \((X,Y)\) has a bivariate normal distribution with \(E(X) = \mu_X, \ E(Y) = \mu_Y, \ V(X) = \sigma^2_X, \ V(Y) = \sigma^2_Y\), and correlation coefficient \(\rho\), in which case it can be shown that \[E(Y|X=x) = \beta_0 + \beta_1 x, \hspace{1cm} where \hspace{0.5cm} \beta_ = \frac{\sigma_Y}{\sigma_X}\rho\] The statistical theory for making inferences about the parameters \(\beta_0\) and \(\beta_1\) is exaclty the same for both of these cases, but the differences in model interpretation should be kept in mind.

In the second case, experimenter may want to know only whether the random variables \(X\) and \(Y\) are independent. If \((X, Y)\) has a bivariate normal distribution, then testing for independence is equivalent to testing whether the correlation coefficient \(\rho\) is equal to zero.

Let \((X_1, Y_1), (X_2, Y_2),...,(X_n, Y_n)\) denote a random sample from a bivariate normal distribution. The maximum-likelihood estimator of \(\rho\) is given by the sample correlation coefficient \[r = \frac{\sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y)}{\sqrt{\sum_{i=1}^n(X_i - \bar X)^2 \sum_{i=1}^n(Y_i - \bar Y)^2}}\] Notice that we can express \(r\) in terms of familiar quantities \[r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} = \hat\beta_1\sqrt{\frac{S_{xx}}{S_{yy}}}\] It follows that \(r\) and \(\hat\beta_1\) have the same sign.

We can test hypothesis as follows

  • \(H_0: \rho = 0\)

  • \(H_{a1}: \rho > 0\) (which is equivalent to testing \(H_{a}: \beta_1 >0\) against \(H_{0}: \beta_1 = 0\))

  • \(H_{a2}: \rho < 0\)

  • \(H_{a3}: \rho \neq 0\)

  • statistic: \(t = \frac{\beta_1 -0}{\frac{S}{\sqrt{S_{xx}}}} = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\)

    (Note: \(\hat\beta_1 =r\sqrt{\frac{S_{yy}}{S_{xx}}}, \hspace{0.3cm} S_{xy} = \hat\beta_1S_{xx}, \hspace{0.3cm} \hat\beta_1^2S_{xx} = r^2S_{yy}, \hspace{0.3cm}S=\sqrt{\frac{SSE}{n-2}}, \hspace{0.3cm} SSE = S_{yy} - \hat\beta_1S_{xy}\))

  • degrees of freedom: \(n-2\)

We can also use \(Z\) statistic for large sample case, by using fact that \[\frac{1}{2}ln\bigg[\frac{1+r}{1-r}\bigg] \sim N\bigg(\frac{1}{2}ln\bigg[\frac{1+\rho}{1-\rho}\bigg], \frac{1}{n-3} \bigg)\] Thus, for testing the hypothesis \(H_0: \rho = \rho_0\), we can employ a \(Z\)-test in which \[\begin{aligned}Z &= \frac{\frac{1}{2} \Bigg(ln\bigg[\frac{1+r}{1-r}\bigg] - ln\bigg[\frac{1+\rho_0}{1-\rho_0}\bigg]\Bigg)} {\frac{1}{\sqrt{n-3}}} \\ &= \frac{\sqrt{n-3} \ ln \bigg[\frac{(1+r)(1-\rho_0)}{(1-r)(1+\rho_0)}\bigg]}{2} \end{aligned}\]

The rejection regions are follows

  • \(H_a: \rho > \rho_0\), \(RR_1: z > z_\alpha\)

  • \(H_a: \rho < \rho_0\), \(RR_2: z < z_\alpha\)

  • \(H_a: \rho \neq \rho_0\), \(RR_3: |z| > z_{\alpha/2}\)

Example of Correlation analysis (s20, s22, Wakerly p600)

The data in the table below represent a sample of mathematics achievement test scores and calculus grades for ten independently selected college freshmen. From this evidence, would you say that the achievement test scores and calculus grades are independent? Use \(\alpha = 0.05\). Identify the corresponding attained significance level.

Student Math_Test_Score Final_Calculus_Grade
1 39 65
2 43 78
3 21 52
4 64 82
5 57 92
6 47 89
7 28 73
8 75 98
9 34 56
10 52 75

Solution

  • \(H_0: \rho = 0\)

  • \(H_a: \rho \neq 0\)

  • test statistic: \(t = \frac {r\sqrt{n-2}}{\sqrt{1-r^2}}\)

data <- data.frame(student = 1:10, 
                   x = c(39, 43, 21, 64, 57, 47, 28, 75, 34, 52), 
                   y = c(65, 78, 52, 82, 92, 89, 73, 98, 56, 75)
                   )
data2 <- data
# setting parameters
n <- length(data$student)
df <- n-2
a <- 0.05

# calculation of SS
data2$xy <- data2$x * data$y
data2$x2 <- data2$x^2
data2$y2 <- data2$y^2

SS <- colSums(data2)[-1]
Sxx <- SS[4] - SS[1]^2/n
Syy <- SS[5] - SS[2]^2/n
Sxy <- SS[3] - SS[1]*SS[2]/n

knitr::kable(data.frame(Sxx = as.numeric(Sxx), 
                        Syy = as.numeric(Syy), 
                        Sxy = as.numeric(Sxy)))
Sxx Syy Sxy
2474 2056 1894
# Calculation of r and t
r <- as.numeric(Sxy/sqrt(Sxx*Syy))
t <- (r * sqrt(n-2))/(sqrt(1 - r^2))
t_crit <- qt(1-(a/2), df)
p <- 2 * pt(t, df, lower.tail = FALSE)

knitr::kable(data.frame(r = round(r,3), 
                        t = round(t,3), 
                        t_crit = round(t_crit,3), 
                        p = round(p,5)), 
             align='c')
r t t_crit p
0.84 4.375 2.306 0.00236
# base R
cor.test(data$x, data$y)
## 
##  Pearson's product-moment correlation
## 
## data:  data$x and data$y
## t = 4.375, df = 8, p-value = 0.002365
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4459632 0.9611846
## sample estimates:
##       cor 
## 0.8397859
# cf) Spearman rank correlation (for more information, see nonparametric test tutorial and the example below)
cor.test(data$x, data$y, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  data$x and data$y
## S = 20, p-value = 0.001977
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8787879

Example of Nonparametric Spearman’s Ranked Correlation Test

We discussed Spearman’s ranked correlation in the lecture of nonparametric test. As you already know, the statstic, abbreviated as \(r_S\), is a nonparametric version of Pearson’s correlation statistic. (The formular is the same as Pearson’s formular except it utilizes ranked values instead of actual value of \(x_i\) and \(y_i\)). When there are no ties in either the \(x\) observations or the \(y\) observations, the expression for \(r_S\) algebraically reduces to a simpler expression:

\[r_S = 1 - \frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}, \hspace{1cm} where \ d_i =R(x_i) - R(y_i)\] The statistical significance of Spearman’s correlation statistic can be determined by the t statistic as well as Pearson correlation statistic. Now, let’s consider simple example. The correlation between IQ and the hours of TV watching per week.

IQTV <- data.frame(IQ = c(86, 97, 99, 100, 101, 103, 106, 110, 112, 113), 
                   TV = c(0, 20, 28, 27, 50, 29, 7, 17, 6, 12))

n <- length(IQTV$IQ)
df <- n-2
IQTV$rank_IQ <- rank(IQTV$IQ)
IQTV$rank_TV <-  rank(IQTV$TV)
IQTV$d <- IQTV$rank_IQ - IQTV$rank_TV
IQTV$d2 <- IQTV$d^2

knitr::kable(IQTV)
IQ TV rank_IQ rank_TV d d2
86 0 1 1 0 0
97 20 2 6 -4 16
99 28 3 8 -5 25
100 27 4 7 -3 9
101 50 5 10 -5 25
103 29 6 9 -3 9
106 7 7 3 4 16
110 17 8 5 3 9
112 6 9 2 7 49
113 12 10 4 6 36
rs <- 1-6*sum(IQTV$d2)/(n*(n^2-1))
t <- (rs * sqrt(n-2))/(sqrt(1 - rs^2))
t_crit <- qt(a/2,df)
p <- 2*pt(t, df)

knitr::kable(data.frame(rs = round(rs,3), 
                        t = round(t,3), 
                        t_crit = round(t_crit,3), 
                        p = round(p,5)), 
             align='c')
rs t t_crit p
-0.176 -0.505 -2.306 0.62719
# base R
cor.test(IQTV$IQ, IQTV$TV, method = "spearman", conf.level = 0.95)
## 
##  Spearman's rank correlation rho
## 
## data:  IQTV$IQ and IQTV$TV
## S = 194, p-value = 0.632
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1757576

The test result implies that there is no correlation between IQ and hours of TV watching per week but negative sign implies there could be negative correlation between two although it is very low.

Extension of Linear Model

Nonlinear regression: Variable transformation (s24)

Let’s consider the exponential model \[y = \gamma_0e^{\gamma_1x}\] by taking logarithms of both sides, we can get the linear format as below. \[ln(y) = ln(\gamma_0) + \gamma_1 x\] A linear regression model can then be fitted to the data points \((x_i, ln(y_i))\). In the fitted linear model, the slope parameter \(\beta_1\) corresponds to \(\gamma_1\), and the intercept parameter \(\beta_0\) corresponds to \(ln(\gamma_0)\). Notice that the usual error assumptions are required for the linear relationship, thus, for the exponential model example, it is assumed that \[ln(y_i) = ln(\gamma_0) + \gamma_1 x_i + \epsilon_i\] where the error terms \(epsilon_i\) are independent observations from a \(N(0, \sigma^2)\) distribution. This assumption can be checked in the usual manner through residual analysis. However, notice that in terms of the original exponential model, it implies that \[y = \gamma_0e^{\gamma_1x}\epsilon_i^*\] so that there are multiplicative error terms \(\epsilon_i^* = e^{\epsilon_i}\).

A model that can be transformed into a linear relationship is known as an intrinsically linear model. However, there are some models that cannot be transformed into a linear format, for example, \[y = \gamma_0 + e^{\gamma_1x}\] are not intrinsically linear and need to be handled with nonlinear regression techniques.

Here I summarized some of intrinsically linear models

Example of Variable transformation (s25)

A bioengineer measures the growth rate of a substance by counting the number of cells \(N\) present at various times \(t\) as shown below. Fit the model \(N=\gamma_0e^{\gamma_1t}\) and calculate two-sided 95% confidence intervals for the unknown parameters \(\gamma_1\)

time Num_cell
1 12
2 31
3 42
4 75
5 119
6 221
7 327
8 546

Solution with two unknown parameters \(\gamma_0\) and \(\gamma_1\) is commonly used for this problem. With a logarithmic transformation, the linear relationship \[y = \beta_0 + \beta_1 x\] is obtained with \(y = ln(N)\), \(\beta_0 = ln(\gamma_0)\), \(\beta_1 = \gamma_1\), \(x = t\).

A straight line is fitted to the transformed data points \((x_i, y_i)\) and the fitted regression line is \[y = 0.523x + 2.176\] The equation can be transformed back into a model in terms of the original variables \[N = e^{ln(\gamma_0) + \gamma_1t}=e^{ln(2.176) + 0.523t}\] \[\beta_1 = \gamma_1 \in (0.472, 0.573)\]

cell_growth <- data.frame(time = 1:8, Num_cell = c(12, 31, 42, 75, 119, 221, 327, 546))

# transform of variables
cell_growth$lnt <- log(cell_growth$time)
cell_growth$lnN <- log(cell_growth$Num_cell)

# fit linear model directly
model1 <- lm(Num_cell ~ time, data = cell_growth)
summary(model1)
## 
## Call:
## lm(formula = Num_cell ~ time, data = cell_growth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -87.14 -56.16 -21.64  44.47 132.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -139.04      64.20  -2.166  0.07350 . 
## time           69.04      12.71   5.430  0.00162 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.39 on 6 degrees of freedom
## Multiple R-squared:  0.8309, Adjusted R-squared:  0.8027 
## F-statistic: 29.49 on 1 and 6 DF,  p-value: 0.001617
plot(Num_cell ~ time, data = cell_growth, pch = 16, 
     ylab = "N", xlab = "t")
lines(cell_growth$time, model1$fitted.values)

# fit linear model with transformed variable
model2 <- lm(lnN ~ time, data = cell_growth)
summary(model2)
## 
## Call:
## lm(formula = lnN ~ time, data = cell_growth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.21446 -0.05110 -0.01050  0.05717  0.21144 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.17619    0.10357   21.01 7.57e-07 ***
## time         0.52318    0.02051   25.51 2.39e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1329 on 6 degrees of freedom
## Multiple R-squared:  0.9909, Adjusted R-squared:  0.9893 
## F-statistic: 650.7 on 1 and 6 DF,  p-value: 2.391e-07
plot(lnN ~ time, data = cell_growth, pch = 16, 
     ylab = "ln(N)", xlab = "t")
lines(cell_growth$time, model2$fitted.values)

slr <- function(x) exp(2.176 + 0.523*x)
plot(cell_growth$time, cell_growth$Num_cell, pch = 16, 
     ylab = "N", xlab = "t")
curve(slr, add = TRUE)

# confidence interval
confint(model2, "time", level=0.95)
##         2.5 %    97.5 %
## time 0.472994 0.5733623

Summary example 1 - \(V0_2-max\) Aerobic Fitness Measurements (s27, Hayter p560)

The table below is a data set concerning the aerobic fitness of a sample of 20 male subjects. An exercising individual breathes through an apparatus that measures the amount of oxygen in the inhaled air that is used by the individual. The maximum value per unit time of the utilized oxygen is then scaled by the person’s body weight to come up with a variable \(VO_2-max\), which is a general indication of the aerobic fitness of the individual. In other words, \(VO_2-max\) quantifies the maximum rate of oxygen consumption measured during incremental exercise. Fit a linear regression model with \(VO_2-max\) as the dependent variable and age as the explanatory variable. Solve questions below.

VO2.max Age
23 59
45 47
29 44
55 32
48 45
42 61
32 71
33 32
34 28
52 36
40 36
35 51
45 31
47 44
26 73
42 47
35 40
41 43
29 68
38 40
  1. Estimate intercept (\(\hat\beta_0\)) and slope (\(\hat\beta_1\)) parameters.
vo2 <- read.csv("http://bisyn.kaist.ac.kr/bis335/10-Regression-vo2max-aerobic-fitness.csv")

# prep
colnames(vo2) <- c("y", "x")
data <- vo2 %>% select(x, y) %>% mutate(xy = x*y, x2 = x^2, y2 = y^2)

# SS
n <- length(data$x)
SS <- colSums(data)
Sxx <- SS[4] - SS[1]^2/n
Syy <- SS[5] - SS[2]^2/n
Sxy <- SS[3] - SS[1]*SS[2]/n
knitr::kable(data.frame(Sxx = as.numeric(Sxx), Syy = as.numeric(Syy), Sxy = as.numeric(Sxy)))
Sxx Syy Sxy
3486.8 1428.95 -1177.4
# Slope & Intercept
b1 <- Sxy/Sxx
b0 <- mean(data$y) - b1*mean(data$x)

knitr::kable(data.frame(intercept = as.numeric(round(b0,4)), 
                        slope = as.numeric(round(b1,4))))
intercept slope
54.2181 -0.3377
  1. Calculate \(r^2\), \(r\), and determine the significance of \(r\) using \(t\)-statistic
df <- n - 2
SSE <- as.numeric(Syy - b1*Sxy)
S2 <- SSE/df # error variance

# R^2 & r
SSR <- b1*Sxy
SST <- Syy
r2 <- SSR/SST
r <- Sxy/sqrt(Sxx*Syy)

# pvalue
t <- (r * sqrt(n-2))/(sqrt(1 - r^2))
t_crit <- qt(1-(a/2), df)
p <- 2 * pt(t, df, lower.tail = FALSE)

library(kableExtra)
kable(data.frame(S2 = as.numeric(round(S2,4)),
                        r2 = as.numeric(round(r2,4)), 
                        r = as.numeric(round(r,4)),
                        sqrt_r2 = as.numeric(round(-sqrt(r2),4)),
                        p = as.numeric(round(p,4))),
             col.names = c("$S^2$", "$r^2$", "r", "$-\\sqrt {r^2}$", "p"), 
             align = 'c', escape = FALSE, format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(S^2\) \(r^2\) r \(-\sqrt {r^2}\) p
57.2985 0.2782 -0.5275 -0.5275 1.9832
  1. Find one-sided 95% confidence interval for \(\hat\beta_1\)
a <- 0.05
t_crit <- qt(1 - a, df)
S <- sqrt(S2)
c11 <- 1/Sxx

upper <- b1 + (t_crit*S*sqrt(c11)); upper
##         xy 
## -0.1153818
kable(data.frame(lower = "-$\\infty$", 
                 upper = as.numeric(round(upper,4))), 
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
lower upper
-\(\infty\) -0.1154
  1. Find two-sided 95% confidence and prediction interval for 50 years old male
# Confidence interval
t_crit <- qt(1 - (a/2), df)
x <- 50
y_hat <- b0+b1*x
K <- (1/n)+((x - mean(vo2$x))^2/Sxx)

lower <- y_hat - (t_crit*S*sqrt(K))
upper <- y_hat + (t_crit*S*sqrt(K))

kable(data.frame(y_hat = as.numeric(round(y_hat, 4)), 
                 lower = as.numeric(round(lower, 4)),
                 upper = as.numeric(round(upper,4))),
      col.names = c("$\\hat y$", "lower", "upper"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(\hat y\) lower upper
37.3344 33.6485 41.0202
# Prediction interval
lower <- y_hat - (t_crit*S*sqrt(1+K))
upper <- y_hat + (t_crit*S*sqrt(1+K))

kable(data.frame(y_hat = as.numeric(round(y_hat, 4)), 
                 lower = as.numeric(round(lower, 4)),
                 upper = as.numeric(round(upper,4))),
      col.names = c("$\\hat y$", "lower", "upper"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(\hat y\) lower upper
37.3344 21.0097 53.659
  1. Using base R function, perform above analysis
vo2 <- read.csv("http://bisyn.kaist.ac.kr/bis335/10-Regression-vo2max-aerobic-fitness.csv")

# 1. lm
fit <- lm(VO2.max ~ Age, data = vo2)
fit$coefficients # b0, b1
## (Intercept)         Age 
##  54.2180509  -0.3376735
# 2. r^2, r
## r^2
r2 <- summary(fit)
kable(data.frame(r2 = as.numeric(round(r2$r.squared, 4)), 
                 f = as.numeric(round(r2$f[1], 4)),
                 p = as.numeric(round(r2$coefficients[2,4],4)),
                 sigma = as.numeric(round(r2$sigma,4))),
      col.names = c("$r^2$", "F", "p", "$\\hat\\sigma$"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(r^2\) F p \(\hat\sigma\)
0.2782 6.9387 0.0168 7.5696
## r, t, p
r <- cor.test(vo2$VO2.max, vo2$Age)

kable(data.frame(r = as.numeric(round(r$estimate, 4)), 
                 t = as.numeric(round(r$statistic, 4)),
                 p = as.numeric(round(r$p.value,4))),
      col.names = c("$r$", "t", "p"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(r\) t p
-0.5275 -2.6341 0.0168
# 3. conf.int for b1 (one-sided, upper)
confint(fit, "Age", level = 1-2*a, "one-sided")[2]
## [1] -0.1153818
# 4. conf.int and pred.int for y_hat
newdata <- data.frame(Age = 50)
predict(fit, newdata, level=1-a, interval = "confidence")[1,]
##      fit      lwr      upr 
## 37.33438 33.64853 41.02022
predict(fit, newdata, level=1-a, interval = "prediction")[1,]
##      fit      lwr      upr 
## 37.33438 21.00974 53.65901
# 5. visualize
library(ggplot2)
qplot(vo2$Age, vo2$VO2.max, ylab = "VO2 Max", 
      xlab = "Age", main = "VO2 max vs Age") + stat_smooth(method = "lm")

  1. Questions
  • Does the sign of \(\hat\beta_1\) surprise you? What does your model predict as the change in \(VO2-max\) for an additional five years of age? \[-0.3377 \times 5 = -1.6885\]
  • What does the model tell you about the aerobic fitness of a 15-year-old boy?

If the model is used then it would be extrapolation, so the prediction may be inaccurate.

Summary example 2 - Initial mass of tumor vs Increased size of tumor

The following table summarize the initial (\(S_0\)) and increased masses of tumor (\(S_1\)) in a given time. From the scatter plot of the data set, you may notice that the relationship between \(S_0\) and \(S_1\) seems nonlinear. Find best linear fit of this data by using variable transformation and answer questions below. You can also test the variable \(\Delta S = S_1 - S_0\) to find best fit between initial (possibly transformed) mass and difference or increased mass.

tumor <- data_frame(s0 = c(55,55,80,80,105,105,130,155,155,180,180),
                    s1 = c(73,87,182,206,255,245,272,301,269,294,274))
kable(tumor, col.names = c("$S_0$", "$S_1$"), 
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(S_0\) \(S_1\)
55 73
55 87
80 182
80 206
105 255
105 245
130 272
155 301
155 269
180 294
180 274
plot(s1 ~ s0, data = tumor)

  1. Calculate the \(R^2\) for \((S_0, \Delta S)\), \((S_0, S_1)\)
# prep
tumor$sd <- tumor$s1 - tumor$s0

# reordering
fit1 <- lm(sd ~ s0, data = tumor)
fit2 <- lm(s1 ~ s0, data = tumor)

# r^2
summary(fit1)$r.squared
## [1] 0.2585808
summary(fit2)$r.squared
## [1] 0.7649424
  1. Perform appropriate variable transformation and calculate \(R^2\) for \((S_0^*, \Delta S)\), \((S_0^*, S_1)\)
# prep
tumor$s0_t <- log(tumor$s0) # ln(s_0)

# reordering
fit3 <- lm(sd ~ s0_t, data = tumor)
fit4 <- lm(s1 ~ s0_t, data = tumor)

# r^2
summary(fit3)$r.squared
## [1] 0.4075046
summary(fit4)$r.squared
## [1] 0.8827907
  1. Test which variable pairs among the four relations above are significantly dependent with \(\alpha\) level of 0.05 and 0.01.
# Extract the statistic
t <- data.frame(fit1 = summary(fit1)$coefficients[2,3],
                fit2 = summary(fit2)$coefficients[2,3],
                fit3 = summary(fit3)$coefficients[2,3],
                fit4 = summary(fit4)$coefficients[2,3])

# Setting parameters
a1 <- 0.05 
a2 <- 0.01
df <- length(tumor$s0)-2

# Calculate critical t (two-sided)
t_crit1 <- qt(1-(a1/2), df)
t_crit2 <- qt(1-(a2/2), df)

# result
t > t_crit1
##       fit1 fit2 fit3 fit4
## [1,] FALSE TRUE TRUE TRUE
t > t_crit2
##       fit1 fit2  fit3 fit4
## [1,] FALSE TRUE FALSE TRUE
# cf) R^2 cutoff
r2_crit1 <- t_crit1^2 / (t_crit1^2+df)
r2_crit2 <- t_crit2^2 / (t_crit2^2+df)

r2_crit <- data.frame(r2_crit1, r2_crit2); r2_crit
##    r2_crit1 r2_crit2
## 1 0.3624868 0.539911

Notice that for testing significance, \(\hat\beta_1\) and r use the same t statistic. (see the theory part)

Recall that \[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = \frac{1}{1+\frac{SSE}{SSR}}\] 4. Test if there is the “lack of fit” to the models you created with significance level of 0.05 and 0.01

library(alr3)
f_lof <- data.frame(fit1 = pureErrorAnova(fit1)[3,4],
                    fit2 = pureErrorAnova(fit2)[3,4],
                    fit3 = pureErrorAnova(fit3)[3,4],
                    fit4 = pureErrorAnova(fit4)[3,4])

# setting parameters
df1 <- pureErrorAnova(fit1)[3,1]
df2 <- pureErrorAnova(fit1)[4,1]
a1 <- 0.05
a2 <- 0.01

# critical F statistic
f_crit1 <- qf(1-a1, df1, df2)
f_crit2 <- qf(1-a2, df1, df2)

# result
f_lof > f_crit1
##      fit1 fit2 fit3 fit4
## [1,] TRUE TRUE TRUE TRUE
f_lof > f_crit2
##      fit1 fit2 fit3  fit4
## [1,] TRUE TRUE TRUE FALSE
  1. Calculate prediction interval of the model where \(x_0 = \bar x\) and \(x_0 = \bar x + 2S\) with significance level of 0.05
# prep
S <- data.frame(fit1 = summary(fit1)$sigma,
                fit2 = summary(fit2)$sigma,
                fit3 = summary(fit3)$sigma,
                fit4 = summary(fit4)$sigma)
a <- 0.05

# for x0 = xbar
newdata.fit12 <- data.frame(s0 = mean(tumor$s0))
newdata.fit34 <- data.frame(s0_t = mean(tumor$s0_t))

# prediction interval for x0 = xbar
predict(fit1, newdata.fit12, level=1-a, interval = "prediction")[1,]
##       fit       lwr       upr 
## 107.09091  11.46674 202.71508
predict(fit2, newdata.fit12, level=1-a, interval = "prediction")[1,]
##      fit      lwr      upr 
## 223.4545 127.8304 319.0787
predict(fit3, newdata.fit34, level=1-a, interval = "prediction")[1,]
##       fit       lwr       upr 
## 107.09091  21.60821 192.57361
predict(fit4, newdata.fit34, level=1-a, interval = "prediction")[1,]
##      fit      lwr      upr 
## 223.4545 155.9300 290.9791
# for x0 = xbar + 2S
newdata.fit1 <- data.frame(s0 = mean(tumor$s0) + 2 * as.numeric(summary(fit1)$sigma))
newdata.fit2 <- data.frame(s0 = mean(tumor$s0) + 2 * as.numeric(summary(fit2)$sigma))
newdata.fit3 <- data.frame(s0_t = mean(tumor$s0_t) + 2 * as.numeric(summary(fit3)$sigma))
newdata.fit4 <- data.frame(s0_t = mean(tumor$s0_t) + 2 * as.numeric(summary(fit4)$sigma))

# prediction interval for x0 = xbar + 2S
predict(fit1, newdata.fit1, level=1-a, interval = "prediction")[1,]
##       fit       lwr       upr 
## 146.48611  38.43892 254.53331
predict(fit2, newdata.fit2, level=1-a, interval = "prediction")[1,]
##      fit      lwr      upr 
## 343.7930 235.7458 451.8402
predict(fit3, newdata.fit3, level=1-a, interval = "prediction")[1,]
##       fit       lwr       upr 
## 4820.8822  534.0701 9107.6942
predict(fit4, newdata.fit4, level=1-a, interval = "prediction")[1,]
##       fit       lwr       upr 
##  9956.777  7281.596 12631.958
  1. visulalize the best regression line in transformed space and original space.
# transformed space
plot(tumor$s0_t, tumor$s1, pch = 16, xlab = "S0_t", ylab = "S1")
lines(tumor$s0_t, fit4$fitted.values)

# original space
slrv <- function(x){fit4$coefficients[2]*log(x)+fit4$coefficients[1]}
plot(tumor$s0, tumor$s1, pch = 16, xlab = "S0", ylab = "S1")
curve(slrv, add = TRUE)

Simple linear regression with matrix notation

For \(n\) observations from a simple linear model of the form \[Y = \beta_0 + \beta_1 x + \epsilon\] we have

\[ Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \end{bmatrix}, \hspace{1cm} X = \begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \end{bmatrix}, \hspace{1cm} \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \hspace{1cm} \epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots\\ \epsilon_n \end{bmatrix}\]

The least-squares equations for \(\beta_0\) and \(\beta_1\) were given below \[n \hat\beta_0 + \hat\beta_1\sum_{i=1}^n x_i = \sum_{i=1}^n y_i, \hspace{1cm} \hat\beta_0\sum_{i=1}^n x_i + \hat\beta_1\sum_{i=1}^n x_i^2 = \sum_{i=1}^n x_iy_i\] and because \[X'X = \begin{bmatrix} 1 & 1 & \dots & 1 \\ x_1 & x_2 & \dots & x_n \end{bmatrix} \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} = \begin{bmatrix} n & \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{bmatrix}\] and \[X'Y = \begin{bmatrix} \sum_{i=1}^n y_i \\ \sum_{i=1}^n x_iy_i \end{bmatrix}\] if \[\hat\beta = \begin{bmatrix} \hat\beta_0 \\ \hat\beta_1 \end{bmatrix}\] We see that the least-squares equations are given by \[(X'X)\hat\beta = X'Y\] Hence, \[\hat\beta = (X'X)^{-1}X'Y\]

Example of simple linear regression with matrix notation (Wackerly p611)

Use the matrix notation, perform simple linear regression with \(n=5\) data points given in the table below.

x y
-2 0
-1 0
0 1
1 1
2 3

Solution

We can see that \[ Y = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 3 \end{bmatrix}, \hspace{1cm} X = \begin{bmatrix} 1 & -2 \\ 1 & -1 \\ 1 &\phantom{-}0 \\ 1 &\phantom{-}1 \\ 1 &\phantom{-}2 \end{bmatrix}\] It follows that

\[X'X =\begin{bmatrix} 5 & 0 \\ 0 & 10 \end{bmatrix}, \hspace{1cm} X'Y =\begin{bmatrix} 5 \\ 7 \end{bmatrix},\hspace{1cm} (X'X)^{-1} =\begin{bmatrix} 1/5 & 0 \\ 0 & 1/10 \end{bmatrix} \] Thus, \[\hat\beta = (X'X)^{-1}X'Y =\begin{bmatrix} 1/5 & 0 \\ 0 & 1/10 \end{bmatrix} \begin{bmatrix} 5 \\ 7 \end{bmatrix} =\begin{bmatrix} 1 \\ 0.7 \end{bmatrix} \] or \(\hat\beta_0 = 1\) and \(\hat\beta_1 = 0.7\). Thus, \[\hat y = 1+0.7x\] just as the previous result.

toy <- data.frame(x = c(-2,-1,0,1,2), y = c(0,0,1,1,3))

X <- as.matrix(data.frame(rep(1,5), toy$x))
Y <- as.matrix(toy$y)
  
tX_X <- t(X) %*% X 
tX_Y <- t(X) %*% Y

inv_tx_x <- solve(tX_X) # inverse matrix

b <- inv_tx_x %*% tX_Y
b <- as.data.frame(t(b))

kable(b, col.names = c("$\\hat \\beta_0$", "$\\hat \\beta_1$"), 
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(\hat \beta_0\) \(\hat \beta_1\)
1 0.7

Multiple linear regression (extra)

Suppose that we have the linear model \[Y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \epsilon\]

and we make \(n\) independent observations, \(y_1, y_2, ..., y_n\), on \(Y\). We can write the observation \(y_i\) as \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_k x_{ik} + \epsilon_i\] where \(x_{ij}\) is the setting of the \(j_{th}\) independent variable for the \({i_th}\) observation, \(i=1,2,...,n\).

The coefficients \(\beta_0, ..., \beta_k\) are unknown parameters, and the error terms \(\epsilon_i, 1 \leq i \leq n\), are taken to be independent observations from a \(N(0, \sigma^2)\) distribution.

The expected value of the response variable at \(x = (x_1, ... , x_k)\) is \[E(Y|x) = \beta_0 + \beta_1x_1 + ... + \beta_k x_k\] For example, with k = 2 the expected values of the response variable lie on the plane \(y = \beta_0 + \beta_1x_1 + \beta_2x_2\). The parameter \(\beta_0\) is referred to as the intercept parameter, and the parameter \(\beta_i\) determines how the \(i_{th}\) input variable \(x_i\) influences the response variable when the other input variables are kept fixed. If \(\beta_i > 0\), then the expected value of the response variable increases as the ith input variable increases (with the other input variables kept fixed), and if \(\beta_ < 0\), then the expected value of the response variable decreases as the ith input variable increases (with the other input variables kept fixed). If \(\beta_i = 0\), then the response variable is not influenced by changes in the ith input variable (with the other input variables kept fixed).

To fit the MLR model, we use the method of least squares the same as in SLR model. with the normal distribution of the error terms \(\epsilon_i\), the method of least squares produces the maximum likelihood estimates of the parameters. The sum of squares is

\[Q = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_{1i} + ... + \beta_kx_{ki}))^2\] which can be minimized by taking derivatives with respect to each of the unknown parameters, setting the resulting expressions equal to 0, and then simultaneously solving the equations to obtain the parameter estimates \(\hat\beta_i\). The derivative of \(Q\) with respect to \(\beta_0\) is

\[\frac{\partial Q}{\partial \beta_0} = -2 \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_{1i}+...+\beta_kx_{ki}))\] and the derivative of \(Q\) with respect to \(\beta_j\), \(1\leq j\leq k\), is \[\frac{\partial Q}{\partial \beta_0} = -2 \sum_{i=1}^{n} x_{ji} (y_i - (\beta_0 + \beta_1 x_{1i}+...+\beta_kx_{ki}))\] Setting these expressions equal to 0 results in the \(k + 1\) equations

\[\begin {aligned} \sum_{i=1}^n y_i &= n\beta_0 + \beta_1\sum_{i=1}^nx_{1i} + \beta_2\sum_{i=1}^nx_{2i} + ... + \beta_k\sum_{i=1}^nx_{ki}\\ \sum_{i=1}^n y_ix_{1i} &= \beta_0\sum_{i=1}^nx_{1i} + \beta_1\sum_{i=1}^nx_{1i}^2 + \beta_2\sum_{i=1}^nx_{2i}x_{1i} + ... + \beta_k\sum_{i=1}^nx_{ki}x_{1i} \\ & \vdots \\ \sum_{i=1}^n y_ix_{ki} &= \beta_0\sum_{i=1}^nx_{ki} + \beta_1\sum_{i=1}^nx_{1i}x_{ki}+ \beta_2\sum_{i=1}^nx_{2i}x_{ki} + ... + \beta_k\sum_{i=1}^nx_{ki}^2 \end{aligned}\] The parameter estimates \(\hat\beta_0, \hat\beta_1, \dots, \hat\beta_k\) are the solutions to these equations, which are known as the normal equations.

The fitted value of the response variable at the data point \((x_{1i}, ..., x_{ki})\) is \[\hat y_i = \hat\beta_0 + \hat\beta_1 x_{1i} + \hat\beta_k x_{ki}\] and the \(i_{th}\) residual is \[e_i = y_i - \hat y_i\] As in simple linear regression, the sum of squares for error is defined to be \[SSE = \sum_{i=1}^n (y_i-\hat y_i)^2 = \sum_{i=1}^n e_i^2\] The estimate of the error variance \(\sigma^2\) is \[MSE = \hat\sigma^2 = \frac{SSE}{n-k-1}\] (where the degrees of freedom for error \(n-k-1\) are equal to the sample size \(n\) minus the number of parameters estimated \(k+1\)), which is distributed as \[\sigma^2 \sim \sigma^2 \frac{\chi^2_{n-k-1}}{n-k-1}\]

Also, as with simple linear regression th total sum of squares \[SST = \sum_{i = 1}^n (y_i - \bar y)^2\] can be composed as \[SST = SSR + SSE\] where the regression sum of squares is \[SSR = \sum_{i=1}^n(\hat y_i - \bar y)^2\] Differ from simple linear regression, For MLR, we now \(k\) degrees of freedom for regression and \(n-k-1\) degrees of freedom for error.

The \(p-value\) in the analysis of variance table is for the numm hypothesis \[H_0: \beta_1 = ... = \beta_k = 0\] (with the alternative hypothesis that at least one of these \(\beta_i\) is nonzero), which has the interpretation that the response variable is not related to any of the \(k\) input variables. A large \(p-vlaue\) therefore indicates that there is no evidence that any of the input variables affects the distribution of the response variable \(y\). The cofficient of (multiple determination) \[R^2 = \frac{SSR}{SST}\] takes a value between 0 and 1 and indicates the amount of total variability in thevalues of the response variable that is accounted for by the fitted regression model. The F-statistic in the ANOVA table can be written as \[F = \frac{(n-k-1)R^2}{k(1-R^2)}\]

A small \(p-value\) in the analysis of variance table indicates that the response variable is related to at least one of the input variables. Computer output provides the user with the parameter estimates \(\hat\beta_0, \hat\beta_1, \hat\beta_k\) together with their standard errors \(s.e.(\hat\beta_0), s.e.(\hat\beta_1), ... , s.e.(\hat\beta_k)\) and these can be used to determine which input variables are needed in the regression model. If \(\beta_i = 0\), then the distribution of the response variable does not directly depend on the input variable \(x_i\), which can therefore be “dropped” from the model. Consequently, it is useful to test the hypotheses \[H_0 : \beta_i = 0 \ \ versus \ \ H_A : \beta_i \neq 0\] If the null hypothesis is accepted, then there is no evidence that the response variable is directly related to the input variable \(x_i\), and it can therefore be dropped from the model. If the null hypothesis is rejected, then there is evidence that the response variable is related to the input variable, and it should therefore be kept in the model. These hypotheses are tested with the \(t\)-statistics \[t = \frac{\hat\beta_i}{s.e.(\hat\beta_i)}\] which are compared to a t-distribution with \(n-k-1\) degrees of freedom. The (two-sided) \(p-value\) is therefore \[p-value = 2 \times P(X>|t|)\] where the random variable \(X\) has a \(t\)-ditsribution with \(n-k-1\) degrees of freedom. If we dropped parameters from the model, we can test the significance of difference between original and modified model. Formally, \[H_0: \beta_i = b_i \ \ versus \ \ H_A:\beta_i \neq b_i\] for a fixed value of \(b_i\) of interest can be tested with the \(t\)-statistic \[t = \frac{\hat\beta_i - b_i}{s.e.(\hat\beta_i)}\] which is again compared to a t-distribution with \(n-k-1\) degrees of freedom.

Typically, model fitting may be performed in the following manner. First, an experimenter starts by fitting all \(k\) input variables. If all variables are needed in the model, then no reduction is necessary. If one or more input variables has a p-value larger than \(10\%\), the variable with the largest p-value (smallest absolute value of the t-statistic) is removed. The reduced model with \(k-1\) input variables is then fitted, and the process is repeated. This is known as a backwards elimination modeling procedure and most computer packages will perform it automatically on request.

Alternatively, a forward selection modeling procedure can be employed whereby the model starts off without any variables, and input variables are then added one at a time until no further additions are needed. Most computer packages can also implement a stepwise procedure that is a combination of the backward elimination and forward selection procedures, and that allows either the addition of new variables or the removal of variables already in the model.

Inferences on the response variable Suppose that the final model includes the \(k\) input variables \(x_1, ..., x_k\) which may be only a subet of the inital set of input variables that were investigated. The fitted value of the response variable at a specific set of values of the input variables \(x_i^*\) of interest is \[\hat y|_{x^*} = \hat\beta_0 + \hat\beta_1x_1^* + ... + \hat\beta_kx_k^*\] which is an estimate of the expected value of the response variable y at these values of the input variables. A \(1-\alpha\) confidence level level confidence interval for the expected value of the response variable at \(x_i^*\) is \[\hat\beta_0 + \hat\beta_1x_1^* + ... + \hat\beta_kx_k^* \in (\hat y|_{x^*} - t_{\alpha/2, n-k-1}s.e.(\hat y|_{x^*} ), \ + t_{\alpha/2, n-k-1}s.e.(\hat y|_{x^*}))\]

We now define the following matrices, with \(x_0 = 1\):

\[Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \end{bmatrix}, \hspace{1cm} X = \begin{bmatrix} x_0 & x_{11} & x_{12} & ... & x_{1k} \\ x_0 & x_{21} & x_{22} & ... & x_{2k} \\ \vdots & \vdots & \vdots & & \vdots \\ x_0 & x_{n1} & x_{n2} & ... & x_{nk} \end{bmatrix}, \hspace{1cm} \beta = \begin{bmatrix} \beta_0 \\ \beta_2 \\ \vdots\\ \beta_n \end{bmatrix}, \hspace{1cm} \epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots\\ \epsilon_n \end{bmatrix}\]

Thus, the n equations representing \(y_i\) as a function of the x’s, \(\beta\)’s, and \(\epsilon\)’s can be simultaneously written as

\[Y = X\beta + \epsilon\] In simple linear regression, we see that the least-squares equations are given by \[(X'X)\hat\beta = X'Y\]

Hence, \[\hat\beta = (X'X)^{-1}X'Y\] The \(n \times 1\) vector of fitted values is \[\hat Y = \begin{pmatrix} \hat y_1 \\ \hat y_2 \\ \vdots \\ \hat y_n \end{pmatrix} = X\hat\beta\] and the \(n \times 1\) vector of residuals is \[e = \begin{pmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{pmatrix} = Y - \hat Y = \begin{pmatrix} y_1 - \hat y_1 \\ y_2 - \hat y_2 \\ \vdots \\ y_n - \hat y_n \end{pmatrix}\] The sum of squares for error can then be calculated as \[SSE = e^{'}e = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat y_i)^2\] and

\[\hat\sigma^2 = MSE = \frac{SSE}{n-k-1}\]

summary

Least-squares equations and solutions for a general linear model

  • Equations: \((X'X)\beta = X'Y\)

  • Solutions: \(\beta = (X'X)^{-1} X'Y\)

Nonlinear regression: “Curvilinear” regression (s26)

While linear regression modeling is very versatile and provides useful models for many important applications, it is sometimes necessary to fit a nonlinear model to a data set. In nonlinear regression a function \[f(x_1, ..., x_k; \theta_1, ..., \theta_p)\] is specified that relates the values of k input variables \(x_1, ..., x_k\) to the expected value of a response variable y. This function depends upon a set of unknown parameters \[\theta_1, ..., \theta_p\] As with linear regression, the criterion of least squares is generally employed to fit the model. With a data set

\[(y_1, x_{11}, x_{21}, ..., x_{k1})\\ \vdots \\ (y_n, x_{1n}, x_{2n}, ..., x_{kn})\]

consisting of \(n\) sets of values of the response variable and the \(k\) input variables, the parameter estimates \[\hat\theta_1, ..., \hat\theta_p\] are therefore chosen to minimize the sum of squares \[\sum_{i=1}^n (y_ - f(x_{1i}, ..., x_{ki}; \theta_1,..., \theta_p))^2\] However, unlike linear regression, for a nonlinear regression problem there is in general no simple expression that can be used to calculate the parameter estimates, and in practice they need to be calculated by an iterative computer search procedure. These procedures generally require the user to specify “initial guesses” of the parameter values, which need to be suitably close to the actual values that minimize the sum of squares. The calculation of standard errors for the parameter estimates is also rather awkward for nonlinear regression problems and they are usually based on some general asymptotic arguments and should be treated as giving only a general idea of the sensitivity of the parameter estimates. Residual analysis can be used in a similar manner to linear regression problems to estimate the error variance and to assess whether the modeling assumptions appear to be appropriate.

Example of multiple linear regression - Curvilinear Regression Model (Polynomial) (s25, Wackerly p612)

x y
-2 0
-1 0
0 1
1 1
2 3

Fit a parabola to the data above, using the model \[Y = \beta_0 + \beta_1x + \beta_2x^2 +\epsilon\]

Solution The \(X\) matrix for this example differs from that of simple linear regression case only by the addition of a third column corresponding to \(x^2\). (Notice that \(x_1 = x, x_2 = x^2\), and \(k = 2\) in the notation of the general linear model) Thus,

\[Y = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 3 \end{bmatrix}, \hspace{1cm} X = \begin{bmatrix} 1 & -2 & 4 \\ 1 & -1 & 1 \\ 1 & \phantom{-}0 & 0 \\ 1 & \phantom{-}1 & 1 \\ 1 & \phantom{-}2 & 4 \end{bmatrix}\]

Thus, for the first measurement, \(y = 0, x_0 = 1, x = -2\), and \(x^2 = 4\) and for the second measurement, \(y = 0, x_0 = 1, x = -1\), and \(x^2 = 1\). Succeeding rows of the \(Y\) and \(X\) matrices are obtained in a similar manner. The matrix products \(X'X\) and \(X'Y\) are

\[X'X =\begin{bmatrix} \phantom{-}1 & \phantom{-}1 & 1 & 1 & 1 \\ -2 & -1 & 0 & 1 &2 \\ \phantom{-}4 & \phantom{-}1 & 0 & 1 & 4\end{bmatrix} \begin{bmatrix} 1 & -2 & 4 \\ 1 & -1 & 1 \\ 1 & \phantom{-} 0 & 0\\ 1 & \phantom{-} 1 & 1\\ 1 & \phantom{-} 2 & 4\\ \end{bmatrix} =\begin{bmatrix} 5 & 0 & 10 \\ 0 & 10 & 0 \\ 10 & 0 & 34 \end{bmatrix}\]

\[X'Y =\begin{bmatrix} \phantom{-}1 & \phantom{-}1 & 1 & 1 & 1 \\ -2 & -1 & 0 & 1 &2 \\ \phantom{-}4 & \phantom{-}1 & 0 & 1 & 4\end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 3 \end{bmatrix} =\begin{bmatrix} 5 \\ 7 \\ 13 \end{bmatrix}\]

By inverting \(X'X\), we get, \[(X'X)^{-1} =\begin{bmatrix} \phantom{-} 17/35 & 0 & -1/7 \\ \phantom{-} 0 & 1/10 & 0 \\ -1/7 & 0 & \phantom{-} 1/14\end{bmatrix}\]

Finally, \[\begin{aligned} \hat\beta &= (X'X)^{-1} X'Y \\ &=\begin{bmatrix} \phantom{-} 17/35 & 0 & -1/7 \\ \phantom{-} 0 & 1/10 & 0 \\ -1/7 & 0 & \phantom{-} 1/14\end{bmatrix} \begin{bmatrix} 5 \\ 7 \\ 13 \end{bmatrix} =\begin{bmatrix} 4/7 \\ 7/10 \\ 3/14 \end{bmatrix} \approx \begin{bmatrix} 0.571 \\ 0.700 \\ 0.214 \end{bmatrix}\end{aligned}\]

Hence, \(\hat\beta_0 = 0.571, \ \hat\beta_1 = 0.7\), and \(\hat\beta_2 = 0.214\). and the prediction equation is \[\hat y = 0.571 + 0.7x + 0.214 x^2\]

toy <- data.frame(x = c(-2,-1,0,1,2), y = c(0,0,1,1,3))
toy$x2 <- toy$x^2
X <- as.matrix(data.frame(x0 = rep(1,5), x = toy$x, x2 = toy$x2))
Y <- as.matrix(toy$y)
n <- length(Y)
k <- dim(X)[2]-1

# B_hat matrix
B_hat <- solve(t(X) %*% X) %*% t(X) %*% Y; B_hat
##         [,1]
## x0 0.5714286
## x  0.7000000
## x2 0.2142857
Y_hat <- X %*% B_hat # prediction
e <- Y-Y_hat # residual
SSE <- t(e) %*% e # sum of square error
S_hat <- sqrt(SSE/(n-k-1)) # residual standard error (or MSE)


# base R
fit <- lm(y ~ x+x2, data = toy)
summary(fit)
## 
## Call:
## lm(formula = y ~ x + x2, data = toy)
## 
## Residuals:
##        1        2        3        4        5 
## -0.02857 -0.08571  0.42857 -0.48571  0.17143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.5714     0.3332   1.715   0.2285  
## x             0.7000     0.1512   4.630   0.0436 *
## x2            0.2143     0.1278   1.677   0.2355  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4781 on 2 degrees of freedom
## Multiple R-squared:  0.9238, Adjusted R-squared:  0.8476 
## F-statistic: 12.13 on 2 and 2 DF,  p-value: 0.07619

for more information, see Wakerly p615 ~ p634

Logistic regression (extra)

##Logistic Regression(1)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
## The following object is masked from 'package:dplyr':
## 
##     select
data(menarche)
menarche
##      Age Total Menarche
## 1   9.21   376        0
## 2  10.21   200        0
## 3  10.58    93        0
## 4  10.83   120        2
## 5  11.08    90        2
## 6  11.33    88        5
## 7  11.58   105       10
## 8  11.83   111       17
## 9  12.08   100       16
## 10 12.33    93       29
## 11 12.58   100       39
## 12 12.83   108       51
## 13 13.08    99       47
## 14 13.33   106       67
## 15 13.58   105       81
## 16 13.83   117       88
## 17 14.08    98       79
## 18 14.33    97       90
## 19 14.58   120      113
## 20 14.83   102       95
## 21 15.08   122      117
## 22 15.33   111      107
## 23 15.58    94       92
## 24 15.83   114      112
## 25 17.58  1049     1049
str(menarche)
## 'data.frame':    25 obs. of  3 variables:
##  $ Age     : num  9.21 10.21 10.58 10.83 11.08 ...
##  $ Total   : num  376 200 93 120 90 88 105 111 100 93 ...
##  $ Menarche: num  0 0 0 2 2 5 10 17 16 29 ...
summary(menarche)
##       Age            Total           Menarche      
##  Min.   : 9.21   Min.   :  88.0   Min.   :   0.00  
##  1st Qu.:11.58   1st Qu.:  98.0   1st Qu.:  10.00  
##  Median :13.08   Median : 105.0   Median :  51.00  
##  Mean   :13.10   Mean   : 156.7   Mean   :  92.32  
##  3rd Qu.:14.58   3rd Qu.: 117.0   3rd Qu.:  92.00  
##  Max.   :17.58   Max.   :1049.0   Max.   :1049.00
plot(Menarche/Total~Age, data=menarche)

##Logistic Regression(2)
glm1 <- glm(cbind(Menarche, Total-Menarche)~Age, family=binomial(logit), data=menarche)
glm1
## 
## Call:  glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit), 
##     data = menarche)
## 
## Coefficients:
## (Intercept)          Age  
##     -21.226        1.632  
## 
## Degrees of Freedom: 24 Total (i.e. Null);  23 Residual
## Null Deviance:       3694 
## Residual Deviance: 26.7  AIC: 114.8
plot(Menarche/Total~Age, data = menarche)
lines(menarche$Age, glm1$fitted.values, type='l', col="red")