Part I: Linear Regression Model

Regression Model, Simple Linear Regression, Least Square Method
Inference of Regression Parameters and the Function

Part II: Fitness of Regression Model

Coefficient of Determination \(R^2\)
Residuals Analysis
Correlation Analysis
Non parametric Spearman’s Ranked Correlation Analysis

Part III: Extension of Linear Model

Variable Transformation
Curvilinear Regression
Multiple Linear Regression

Part IV: Summary

Summary example 1 - Age vs \(V0_2-max\)
Summary example 2 - Initial mass of tumor vs Increased size of tumor

Supplements

Simple Linear Regression - eruption
Multiple Linear Regression - stackloss
Logistic Linear Regression - Menarche

Tutorial

Part I: Linear Regression Model

The method of least squares (s6)

Suppose that we wish to fit the model

\[E(Y) = \beta_0 + \beta_1 X_i\]

to the set of data points as shown in the figure below. (Note that the independent variable \(x\) could be \(w^2\) or \(w^{\frac{1}{2}}\) or \(ln(w)\) and so on, for some other independent variable \(w\))

The least-squares procedure for fitting a line through a set of \(n\) data points is similar to the method that we might use if we fit a line by eye. A convenient way to accomplish this, and one that yields estimators with good properties, is to minimize the sum of squares of the vertical deviations from the fitted line. Thus, if

\[\hat{y_i}=\hat\beta_0 + \hat\beta_1x_i\] is the predicted value of \(i_{th}\) \(y\) value (when \(x = x_i\)), then the deviation (or error) of the observed value of \(y_i\) from \(\hat{y_i}=\hat\beta_0 + \hat\beta_1x_i\) is the difference \(y_i - \hat y_i\) and the sum of squares of deviations to be minimized is

\[SSE = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\] To minimize this equation, we need to take partial derivatives of SSE with respect to \(\hat\beta_0\) and \(\hat\beta_1\) and setting them equal to zero. Thus,

\[\frac{\partial SSE}{\partial \hat\beta_0} = \frac{\partial \{ \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\}}{\partial \hat\beta_0} = -\sum_{i=1}^n 2[y_i - (\hat\beta_0 + \hat\beta_1x_i)] \\ = -2 \bigg(\sum_{i=1}^n y_i - n\hat\beta_0 - \hat\beta_1\sum_{i=1}^nx_i \bigg) = 0\] and

\[\frac{\partial SSE}{\partial \hat\beta_1} = \frac{\partial \{ \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\}}{\partial \hat\beta_1} = -\sum_{i=1}^n 2[y_i - (\hat\beta_0 + \hat\beta_1x_i)]x_i \\ = -2 \bigg(\sum_{i=1}^n x_iy_i - \hat\beta_0\sum_{i=1}^n x_i - \hat\beta_1\sum_{i=1}^nx_i^2 \bigg) = 0\]

The equations \(\frac{\partial SSE}{\partial \hat\beta_0} = 0\) and \(\frac{\partial SSE}{\partial \hat\beta_1} = 0\) are called the least-squares equations for estimating the parameters of a line. The least-squares equations are linear in \(\hat\beta_0\) and \(\hat\beta_1\) and hence can be solve simultaneously. You can verify that the solutions are

\[\hat\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2} = \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n}\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{\sum_{i=1}^n x_i^2 - \frac{1}{n} \bigg( \sum_{i=1}^n x_i \bigg)^2}\] \[\hat\beta_0 = \bar y - \beta_1 \bar x\] Note that

\(\sum_{i=1}^n \bar x y_i = \frac{1}{n} \sum_{i=1}^n x_i \sum_{i=1}^n y_i\)

and

\(\sum_{i=1}^n \bar x \bar y - \sum_{i=1}^n \bar y x_i = \bar y \sum_{i=1}^n (\bar x - x_i) = 0\),

because

\(\sum_{i=1}^n (\bar x - x_i) = 0\)

We will denotes \(S_{xy}\) as a quantity \(\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)\),

\(S{xx}\) as a quantity \(\sum_{i=1}^n (x_i - \bar x)^2\),

and \(S{yy}\) as a quantity \(\sum_{i=1}^n (y_i - \bar y)^2\).

Then the above equation become \[\hat \beta_1 = \frac{S_{xy}}{S_{xx}}, \ \ \hat\beta_0 = \bar y - \hat \beta_1 \bar x\] where \[\begin{aligned}S_{xx} &= \sum_{i=1}^n (x_i - \bar x)^2 \\ &= \sum_{i=1}^n x_i^2 - 2n\bar x^2 + n \bar x^2 \\ &= \sum_{i=1}^n x_i^2 - n \bar x^2 \\ &= \sum_{i=1}^n x_i^2 - n \bigg(\frac{\sum_{i=1}^n x_i}{n} \bigg)^2 \\ &= \sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^n x_i)^2}{n}\end{aligned}\]

\[\begin{aligned} S_{xy} &= \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) \\ &= \sum_{i=1}^n x_iy_i - n \bar x \bar y \\ &= \sum_{i=1}^n x_iy_i - n \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n^2} \\ &= \sum_{i=1}^n x_iy_i - \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n}\end{aligned}\] and \[SSE = S_{yy} - \hat \beta_1 S_{xy}\]

Maximum Likelihood Estimation (MLE) Approach

Recall the model, \[Y_i = \beta_0 + \beta_1 x_i + \epsilon_i\] where \(\epsilon_i \sim N(0, \sigma^2)\)

Then we can find the mean and variance of each \(Y_i\) \[E[Y_i|X_i = x_i] = \beta_0 + \beta_1x_i\] and \[Var[Y_i|X_i = x_i] = \sigma^2\] Additionally, the \(Y_i\) follow a normal distribution conditioned on the \(x_i\), \[Y_i|X_i \sim N(\beta_0+\beta_1 x_i, \sigma^2)\] Recall that the P.D.F of a random variable \(X \sim N(\mu, \sigma^2)\) is given by \[f_X(x;\mu, \sigma^2) = \frac{1}{\sqrt{{2\pi\sigma^2}}}\bigg[-\frac{1}{2} \bigg(\frac{x-\mu}{\sigma}\bigg)^2\bigg]\] Then we can write the P.D.F of each of the \(Y_i\) as \[f_{Y_i}(y_i,x_i;\beta_0, \beta_1,\sigma^2) = \frac{1}{\sqrt{{2\pi\sigma^2}}} exp\bigg[-\frac{1}{2} \bigg(\frac{y_i-(\beta_0-\beta_1x_i)}{\sigma}\bigg)^2\bigg]\] Given \(n\) data points \((x_i, y_i)\) we can write the likelihood, which is a function of the three parameters \(\beta_0, \beta_1\) and \(\sigma^2\). Since the data have been observed, we use lower case \(y_i\) to denote that these values are no longer random. \[L(\beta_0, \beta_1,\sigma^2) = \prod_{i=1}^n\frac{1}{\sqrt{{2\pi\sigma^2}}} exp\bigg[-\frac{1}{2} \bigg(\frac{y_i-\beta_0-\beta_1x_i}{\sigma}\bigg)^2\bigg]\] Our goal is to find values of \(\beta_0, \beta_1\), and \(\sigma^2\) which maximize this function, which is a straightforward multivariate calculus problem.

We’ll start by doing a bit of rearraging to make our task easier. \[L(\beta_0, \beta_1,\sigma^2) = \bigg(\frac{1}{\sqrt{{2\pi\sigma^2}}}\bigg)^2 exp\bigg[-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2\bigg]\] Then, as is often the case when finding MLEs, for mathematical onvenience we will take the natural logarithm of the likelihood function since log is a monotonically increasing function. Then we will proceed to maximize the log-likelihood, and the resulting estimates will be the same as if we we had not taken the log. \[\log L(\beta_0, \beta_1,\sigma^2) = -\frac{n}{2} \log{(2\pi)} - \frac{n}{2} \log{(\sigma^2)} - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2\] Note that we use log to mean the natural logarithm (or \(\ln\)). We now take a partial derivative with respect to each of the parameters.

\[\begin{aligned} \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \beta_0} &= \frac{1}{\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 \\ \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \beta_1} &= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i) (y_i-\beta_0-\beta_1x_i)^2 \\ \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \sigma^2} &= \frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 \end{aligned}\]

When then set each of the partial derivatives equal to zero and solve the resulting system of equations.

\[\begin{aligned} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 &= 0\\ \sum_{i=1}^n (x_i) (y_i-\beta_0-\beta_1x_i)^2 &= 0\\ \frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 &=0 \end{aligned}\]

You may notice that the first two equations also appear in the least squares approach. Then, skipping the issue of actually checking if we have found a maximum, we then arrive at our estimates. We call these estimates the maximum likelihood estimates.

\[\begin{aligned} \hat\beta_1 &= \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n} (\sum_{i=1}^n x_i) (\sum_{i=1}^n y_i)}{\sum_{i=1}^n x_i^2 - \frac{1}{n} (\sum_{i=1}^n x_i)^2} = \frac{S_{xy}}{S_{xx}}\\ \hat\beta_0 &= \bar y - \hat\beta_1 \bar x \\ \hat\sigma^2 &= \frac{1}{n} \sum_{i=1}^n (y_i-\hat y_i)^2 \end{aligned}\]

Note that \(\hat\beta_0\) and \(\hat\beta_1\) are the same as the least squares estimates. However we now have a new estimate of \(\sigma^2\), that is \(\hat\sigma^2\). So we now have two different estimates of \(\sigma^2\).

\[\begin{aligned}S_e^2 &= \frac{1}{n-2} \sum_{i=1}^{n}(y_i - \hat y_i)^2 = \frac{1}{n-2} \sum_{i=1}^{n}e_i^2 \hspace{1cm} Least \ \ Squares \\ \hat\sigma^2 &= \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat y_i)^2 = \frac{1}{n} \sum_{i=1}^{n}e_i^2 \hspace{1cm} MLE \end{aligned}\]

Least square estimate is unbiased estimate while estimate from MLE is biased estimate since the expectation of \(\sigma^2\) is the same as the \(S_e^2\) but not \(\hat\sigma^2\) (See the supplementary tutorial).

Question: Is bias always a bad thing?

try to answer yourself and see these posts to get more idea

Example of parameter estimation in simple linear regression (s7)

Use the mothod of least squares to fit a straight line to the \(n=5\) data points given in the table below.(Wackrly p571)

x	y
-2	0
-1	0
0	1
1	1
2	3

Solution

We commence computation of the least-squares estimates for the slope and intecept of the fitted line by constructing table below.

Using the results from this table, we obtain

\[\hat \beta_1 = \frac{S_{xy}}{S_{xx}} = \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n}\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{\sum_{i=1}^n x_i^2 - \frac{1}{n} \bigg( \sum_{i=1}^n x_i \bigg)^2} = \frac {7 - \frac{1}{5} (0)(5)}{10 - \frac{1}{5} (0)^2} = 0.7\] \[\hat\beta_0 = \bar y - \hat\beta_1 \bar x = \frac{5}{5} - (0.7)(0) = 1\] and the fitted line is

\[\hat y = 1 + 0.7x\]

slr <- function(x){1+0.7*x}
plot(data$x[1:5], data$y[1:5], xlab = "x", ylab = "y", ylim = c(-1.5, 3))
curve(slr, add = TRUE)

# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)

## 
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
## 
## Residuals:
##          1          2          3          4          5 
##  4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01  6.000e-01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     0.2708   3.693   0.0345 *
## x[1:5]        0.7000     0.1915   3.656   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.7556 
## F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535

Example of variance estimation in simple linear regression (s9)

Find the variances of the estimators \(\hat\beta_0\) and \(\hat\beta_1\) for the table below.(Wackrly p579)

x	y
-2	0
-1	0
0	1
1	1
2	3

Solution Previous example we found that \[n = 5, \hspace{0.5cm} \sum_{i=1}^n x_i = 0, \hspace{0.5cm} \sum_{i=1}^n x_i^2 = 10, \hspace{0.5cm} S_{xx} = 10, \hspace{0.5cm} and \hspace{0.5cm} \bar x = 0\] Thus, \[V(\hat\beta_0) = \frac{\sigma^2\sum_{i=1}^n x_i^2}{nS_{xx}} = \frac{\sigma^2 (10)}{5(10)} = \bigg(\frac{1}{5}\bigg)\sigma^2\] and \[V(\hat\beta_1) = \bigg(\frac{1}{10}\bigg)\sigma^2\] Notice that \(Cov(\hat\beta_0, \hat\beta_1) = 0\) in this case since \(\sum_{i=1}^n x_i = 0\)

To get the answer, we need to estimate \(\sigma^2\). For these data, \(n = 5\) and we have already determined that \[\sum_{i=1}^n y_i = 5, \hspace{0.5cm} S_{xy} = 7, \hspace{0.5cm} and \hspace{0.5cm} \hat\beta_1 = 0.7\] It is easily determined that \(\sum_{i=1}^n y_i^2 = 11\) and that \[S_{yy} = \sum_{i=1}^n(y_i - \bar y)^2 = \sum_{i=1}^ny_i^2 - n(\bar y)^2 = 11 - 5(1)^2 = 6.0\] Therefore, \[SSE = S_{yy} - \hat\beta_1 S_{xy} = 6.0 - (0.7)(7) = 1.1\] and

\[S^2 = \frac{SSE}{n-2} = \frac{1.1}{5-2} = \frac{1.1}{3} = 0.367\]

	x	y	xy	\(x^2\)	\(y^2\)
1	-2	0	0	4	0
2	-1	0	0	1	0
3	0	1	0	0	1
4	1	1	1	1	1
5	2	3	6	4	9
Sum	0	5	7	10	11

n <- 5
Sxx <- data$x2[6] - n * (data$x[6]/n)
Syy <- data$y2[6] - n * (data$y[6]/n)
Sxy <- data$xy[6]
b1 <- 0.7
b0 <- 1

# estimation of population variance
SSE <- Syy - b1*Sxy
S2 <- SSE/(n-2)
S <- sqrt(S2)

# cf) MLE
S2_mle <- SSE/n
S_mle <- sqrt(S2_mle)

# Formatting
knitr::kable(data.frame(Sxx, Syy, Sxy, SSE, S2, S, S2_mle, S_mle))

Sxx	Syy	Sxy	SSE	S2	S	S2_mle	S_mle
10	6	7	1.1	0.3666667	0.6055301	0.22	0.4690416

# variance of regression coefficient
V_b1 <- S2/Sxx; V_b1; S2/10

## [1] 0.03666667

## [1] 0.03666667

V_b0 <- (S2*data$x2[6])/(n*Sxx); V_b0; S2/5

## [1] 0.07333333

## [1] 0.07333333

# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)

## 
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
## 
## Residuals:
##          1          2          3          4          5 
##  4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01  6.000e-01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     0.2708   3.693   0.0345 *
## x[1:5]        0.7000     0.1915   3.656   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.7556 
## F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535

Example of inference of parameter (s11)

Do the data of previous example present sufficient evidence to indicate that the slope differs from 0? Test using \(\alpha = 0.05\) and give bounds for the attained significance level. (Wackerly p585)

Solution

From previous example, we can get

\[C_{11} = \frac{1}{S_{xx}} = \frac{1}{10} = 0.1, \hspace{1cm} S = \sqrt{S^2} = \sqrt{0.367} = 0.606\] and \[t = \frac{\hat\beta_1 - 0}{s\sqrt{C_{11}}} = \frac{0.7-0}{0.606\sqrt{0.1}} = 3.65\] If we take \(\alpha = 0.05\), the value of \(t_{\alpha/2} = t_{0.025}\) for 3 df is 3.182 from the t-table. and the rejection region is \[reject \ \ if \ \ |t| \geq 3.182\] Therefore, we reject the null hypothesis that \(\beta_1 = 0\) at the \(\alpha = 0.05\) level of significance.

df <- n-2
c11 <- 1/Sxx
S <- sqrt(S2)
t <- (b1-0)/(S * sqrt(c11))
t_crit <- qt(0.975, df)
pval <- 2*pt(t, df, lower.tail = FALSE)

knitr::kable(data.frame(df, c11, S, t, t_crit, pval))

df	c11	S	t	t_crit	pval
3	0.1	0.6055301	3.655631	3.182446	0.0353528

# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)

## 
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
## 
## Residuals:
##          1          2          3          4          5 
##  4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01  6.000e-01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     0.2708   3.693   0.0345 *
## x[1:5]        0.7000     0.1915   3.656   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.7556 
## F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535

Example of inference of interval for \(\beta_i\) (s10)

Calculate a 95% confidence interval for the parameter \(\beta_1\) of previous example. (Wakerly p586)

Solution From the t-Table, we can get the critical t-value, \(t_{0.25}\), based on 3 df, 3.182. Then the 95% confidence interval for \(\beta_1\) is

\[\hat\beta_1 \pm t_{0.25}s\sqrt{C_{11}}\] Substituting, we get \[0.7 \pm (3.182)(0.606)\sqrt{0.1} \hspace{1cm} or \hspace{1cm} 0.7 \pm 0.610\] If we wish to estimate \(\beta_1\) correctly within 0.15 unit, it is obvious that the confidence interval is too wide and that the sample size must be increased.

b1 <- 0.7
t_crit <- qt(0.975, df)
S <- sqrt(S^2)
c11 <- 1/Sxx

lower <- b1 - (t_crit*S*sqrt(c11))
upper <- b1 + (t_crit*S*sqrt(c11))

knitr::kable(data.frame(lower, upper))

lower	upper
0.0906079	1.309392

# base R
confint(slr2, "x[1:5]", level=0.95)

##             2.5 %   97.5 %
## x[1:5] 0.09060793 1.309392

Example of inference of \(E(Y)\) (s13)

For the same example, find a 90% confidence interval for \(E(Y)\) when \(x=1\). (wackerly p592)

Solution For the model of previous example \[E(Y) = \beta_0 + \beta_1x\] To estimate \(E(Y)\) for any fixed value \(x=x^*\), we use the unbiased estimator \(\hat{E(Y)} = \hat\beta_0 + \hat\beta_1x^*\). Then, \[\hat\beta_0 +\hat\beta_1x^* = 1 + 0.7x^*\] For the case of \(x^* = 1\), and because \(n=5, \ \bar x = 0,\) and \(S_{xx} = 10\), it follows that \[\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}} = \frac{1}{5} + \frac{(1-0)^2}{10} = 0.3\] We already know that \(S^2 = 0.367\) or \(S = 0.606\) and \(t_{0.05}\) with \(n-2 = 3\) df is 2.353. Thus, the confidence interval for \(E(Y)\) when \(x=1\) is \[\hat\beta_0 + \hat\beta_1x^* \pm t_{\alpha/2}S\sqrt{\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}\] \[[(1 + (0.7)(1)] \pm (2.353)(0.606)\sqrt{0.3} = 1.7 \pm 0.781\] That is, we are 90% confident that, when the independent variable takes on the value \(x=1\), the mean value \(E(Y)\) of the dependent variable is between 0.919 and 2.481. This interval obviously is very wide, but remember that it is based on only 5 data points.

a <- 0.1
n <- 5
b0 <- 1
b1 <- 0.7
df <- n-2
t_crit <- qt(1 - (a/2), df)
x <- 1
y_hat <- b0+b1*x

lower <- y_hat - (t_crit*S*sqrt((1/n)+((x - data$x[6])^2/Sxx)))
upper <- y_hat + (t_crit*S*sqrt((1/n)+((x - data$x[6])^2/Sxx)))

knitr::kable(data.frame(y_hat, lower, upper))

y_hat	lower	upper
1.7	0.9194776	2.480522

# base R
new.data <- data.frame(x = 1)
predict(slr2, new.data, level=0.90, interval = "confidence")[1,]

## Warning: 'newdata' had 1 row but variables found have 5 rows

##       fit       lwr       upr 
## 1.7000000 0.9194776 2.4805224

Example of inference of particular value of \(y\) (s14)

Suppose that the experiment that generated the data of previous example is to be run again with \(x=2\). Predict the particular value of \(y\) with \(1-\alpha= 0.90\) (Wackerly p596)

Solution The predicted value of \(y\) with \(x=2\) is \[\hat\beta_0 + \hat\beta_1x^* = 1 + (0.7)(2) = 2.4\] Further, with \(x^*=2\), \[\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}} = \frac{1}{5} + \frac{(2-0)^2}{10} = 0.6\] Thus the prediction interval is \[\hat\beta_0+ \hat\beta_1x^* \pm t_{\alpha/2}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}} = 2.4 \pm (2.353)(0.606)\sqrt{1+0.6} = 2.4 \pm 1.804\]

n <- 5
b0 <- 1
b1 <- 0.7
df <- n-2
t_crit <- qt(0.95, df)
x <- 2

y <- b0+b1*x
lower <- b0+b1*x - (t_crit*S*sqrt(1+(1/n)+((x - data$x[6])^2/Sxx)))
upper <- b0+b1*x + (t_crit*S*sqrt(1+(1/n)+((x - data$x[6])^2/Sxx)))

knitr::kable(data.frame(y, lower, upper))

y	lower	upper
2.4	0.5974608	4.202539

# base R
new.data <- data.frame(x = 2)
predict(slr2, new.data, level=0.90, interval = "prediction")[1,]

## Warning: 'newdata' had 1 row but variables found have 5 rows

##       fit       lwr       upr 
## 2.4000000 0.5974608 4.2025392

As you can see in the figure above, confidence bands for \(E(Y)\) is closer than prediction bands for \(y\).

Part II: Fitness of Regression Model

Example of the analysis of variance table (s17, Hayter p546)

The manager of a car plant wishes to investigate how the plant’s electricity usage depends upon the plant’s production. The data set shown below is compiled and provides the plant’s production and electrical usage for each month of the previous year.The electrical usage is in units of a million kilowatt-hours, and the production is measured as the value in million-dollar units of the cars produced in that month.

	Production	Usage
Jan	4.51	2.48
Feb	3.58	2.26
Mar	4.31	2.47
Apr	5.06	2.77
May	5.64	2.99
Jun	4.99	3.05
Jul	5.29	3.18
Aug	5.83	3.46
Sep	4.70	3.03
Oct	5.61	3.26
Nov	4.90	2.67
Dec	4.20	2.53

As shown in the scatter plot of the data set, the manager concludes that it is sensible to fit a straight line to the data points. The electricity usage is taken to be the dependent variable \(y\), and the production is taken to be the explanatory variable \(x\). The linear model \(y = \beta_0 + \beta_1x\) will then allow a month’s electrical usage to be estimated as function of the month’s production. We can perform ANOVA test for the regression line.

# Fitting linear model
fit <- lm(Usage ~ Production, data = data); summary(fit)

## 
## Call:
## lm(formula = Usage ~ Production, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23245 -0.16704  0.03919  0.13480  0.27645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.40905    0.38599   1.060    0.314    
## Production   0.49883    0.07835   6.367 8.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1729 on 10 degrees of freedom
## Multiple R-squared:  0.8021, Adjusted R-squared:  0.7823 
## F-statistic: 40.53 on 1 and 10 DF,  p-value: 8.176e-05

# ANOVA
res <- summary(aov(fit)); res

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Production   1 1.2124  1.2124   40.53 8.18e-05 ***
## Residuals   10 0.2991  0.0299                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

res <- unlist(res)

# R^2 = SSR/SST
R2 <- res[3]/(res[3]+res[4]); names(R2) <- "R2"; R2

##        R2 
## 0.8021094

# visualize
library(ggplot2)
qplot(data$Production, data$Usage, ylab = "Electricity Usage", 
      xlab = "Production", main = "Production vs Electricity Usage") + 
  stat_smooth(method = "lm")

Residual Analysis (s18, Hayter p585)

The residuals are defined to be \[e_i = y_i - \hat y_i, \hspace{1cm} 1\leq i \leq n\] so that they are the differences between the observed values of the dependent variable \(y_i\) and the corresponding fitted values \(\hat y_i\). A property of these residuals is that they sum to 0 \[\sum_{i=1}^n e_i = 0\] Residual analysis can be used

Identify data points that are outliers (residual-leverage, cook’s distance)
Check whether the fitted model is appropriate (Scale-location plot)
Check whether the error variance is constant (residual plots, standardized residual plots)
Check whether the error terms are normally distributed (normal probability plot)

Here I list some of residual plots which can be used for the diagnostics for the regression model

Random scatter and no patterns - best fit

Random scatter with outliers

Nonlinear relationship is exist - Simple linear regression is not appropriate

A funnel shape residual plot - Non constant error variance

Straight line in a normal scores plot - Normally distributed

nonlinear line in a normal scores plot - nonlinear patters

Plot the residuals \(e_i\) against the values of the explanatory variable \(x_i\)

# Residuals vs Fitted
plot(fit$fitted.values, fit$residuals, xlab = "Fitted value", 
     ylab = "Residuals", main = "Residuals vs Fitted")
abline(h=0, col="grey", lty = 2)
abline(h=0.2, col="grey", lty = 2); abline(h=-0.2, col="grey", lty = 2)

# base R - plot.lm
plot(fit, which=1)

Notice that the red line in the residual vs fitted value plot represents the average value of the residuals at each value of fitted value (scatterplot smoother). From this plot, we can identify outlier or nonlinear pattern of data.

Plot the standardized residual \(\frac{e_i}{\hat\sigma}\) against the values of the explanatory variable \(x_i\) or leverage \(h_i\)

Standardized residual

The standardized residual is simply ratio between residual and estimate of error variance as shown below. \[\frac{e_i}{\hat\sigma}\] we can take square root of the absolute value of the standardized residual to diminish skewness of the \(E(\frac{e_i}{\hat\sigma})\)

If we replace estimate of error variance with sample variance under consideration of leverage, we can get the Studentized residual. \[\frac{e_i}{S \sqrt{1-h_i}}\]

Leverage (\(h_i\))

An outlier is a data point whose response \(y\) does not follow the general trend of the rest of the data. A data point has high leverage if it has “extreme” predictor \(x\) values. A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results.

The statistic called leverage is given as follows \[h_i = \frac{1}{n} + \frac{(x-\bar x)^2}{\sum_{i=1}^n (x-\bar x)^2}\] and indicates the proportion of the SST of the \(x\) contribued by \(i_{th}\) point. Here I summarized the properties of levarage

The leverage \(h_i\) is a measure of the distance between the \(x\) value for the \(i_{th}\) data point and the mean of the \(x\) values for all \(n\) data points.
The leverage hii is a number between 0 and 1, inclusive.
The sum of the \(h_i\) equals \(p\), the number of parameters (regression coefficients including the intercept). Therefore, in simple linear regression case, \(p = 2\).

Cook’s Distance (\(D_i\))

Cook’s distance is a statistic that quantifies influence of data points. I summarized the definition and properties of Cook’s distance below.

\[D_i = \frac{(y_i - \hat y_i)^2}{2\times MSE}\bigg[\frac{h_{i}}{(1-h_{i})^2}\bigg] = \frac{e_i^2}{2\times MSE}\bigg[\frac{h_{i}}{(1-h_{i})^2}\bigg]\] - \(D_i\) directly summarizes how much all of the fitted values change when the \(i_{th}\) observation is deleted

A data point having a large \(D_i\) indicates that the data point strongly influences the fitted values
- If \(D_i\) is greater than 0.5, then the \(i_{th}\) data point is worthy of further investigation as it may be influential
- If \(D_i\) is greater than 1, then the \(i_{th}\) data point is quite likely to be influential
- or if \(D_i\) is easy to recognize from the other \(D_i\) values, it is almost certainly influential.
For more information of \(D_i\), please see this page

# Standardized residual vs Fitted
fit.stdres <- rstandard(fit)
plot(fit$fitted.values, fit.stdres, xlab = "Fitted value", 
     ylab = "Standardized Residuals", ylim = c(-2, 2), main = "Standardized residuals vs Fitted")
abline(h=0, col="grey", lty = 3)
abline(h=1, col="grey", lty = 2); abline(h=-1, col="grey", lty = 2)
abline(h=2, col="grey", lty = 2); abline(h=-2, col="grey", lty = 2)

# Square root |Standardized residual| vs Fitted
plot(fit$fitted.values, sqrt(abs(fit.stdres)), xlab = "Fitted value", ylim = c(0, 1.3),
     ylab = bquote(sqrt(abs("Standardized residuals"))), main = "Scale-Location")

plot(fit, which=3)

# Cook's distance - Influential outlier detection
plot(fit, which=4)

# Standardized residual vs leverage - Influential outlier detection
plot(fit, which=5)

Plot a normal probability plot (also referred to as a normal scores plot, or Q-Q plot)

Normal scores plot is used to check whether the error terms \(\epsilon_i\) appear to be normally distributed. In such a plot the residuals \(e_i\) are plotted against their “normal scores”, which can be thought of as being their “expected values” if they have a normal distribution. The normal score of \(i_{th}\) smallest residual is \[\Phi^{-1}\Bigg(\frac{i - \frac{3}{8}}{n+\frac{1}{4}}\Bigg)\]

qqnorm(fit.stdres, ylab = "Standardize residuals", xlab = "Normal scores", 
       main = "Normal Probability Plot", ylim = c(-1.5, 1.9))
qqline(fit.stdres, lty=3, col="grey")

# Q-Q plot
plot(fit, which=2)

Example of lack of fit test (s18)

A perfect regression model results in a fitted line that passes exactly through all observed data points. This perfect model will give us a zero error sum of square, i.e. \(SSE = 0\). However, if you conduct the same experiment with the same explantory variable \(x_i\) and record response variable \(y_i\) multiple times, the observations from the experiment (2nd trial) may deviate from perfect model. This error is originated from “purely” random variation or noise. The sum of squares due to this error, called sum of square of pure error is abberiviated as \(SSPE\) quantifies thses variations. \(SSPE\) is calculated by taking repeated observations at some or all values of \(x_i\) and adding up the square of deviations at each level of \(x_i\). The remaining error is called sum of square of lac of fit and abberiviated as \(SSLF\). Therefore,

\[SSE = SSPE + SSLE\] As you can see the table below, in lack of fit test, we analyze residual error SSE in terms of SSLF and SSPE by taking MSE of each error separately.

Suppose that we built a simple linear regression model between the size of cancer mass (\(x\)) and the increased size of it (\(y\)) in a given time nonetheless its curved distribution as shown below. The regression showed that the \(\hat\beta_0=50.7\) and \(\hat\beta_1=0.49\) (Calculate \(\beta_0\), and \(\beta_1\), the variance and confidence intervals of \(\beta_0\), \(\beta_1\), \(\hat y\) by yourself too. set \(\alpha=0.05\)).

Solution

Mass	Increase_t1	Increase_t2
75	28	85
75	42	88
100	112	97
100	136	100
125	160	108
125	150	113
150	152	126
175	156	134
175	124	139
200	124	147
200	104	150

cancer_mass <- read.table("http://bisyn.kaist.ac.kr/bis335/10-Regression-tumorsize.txt", 
                          sep = "\t", header = T)
colnames(cancer_mass) <- c("Mass", "Increase_t1", "Increase_t2")

# Find b0, b1, yhat
fit <- lm(Increase_t1 ~ Mass, data=cancer_mass)
summary(fit)

## 
## Call:
## lm(formula = Increase_t1 ~ Mass, data = cancer_mass)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.23 -34.06  12.61  32.44  48.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  50.7225    39.3979   1.287     0.23
## Mass          0.4867     0.2747   1.772     0.11
## 
## Residual standard error: 40.47 on 9 degrees of freedom
## Multiple R-squared:  0.2586, Adjusted R-squared:  0.1762 
## F-statistic: 3.139 on 1 and 9 DF,  p-value: 0.1102

knitr::kable(data.frame(y = cancer_mass$Mass, y_hat = fit$fitted.values))

y	y_hat
75	87.22513
75	87.22513
100	99.39267
100	99.39267
125	111.56021
125	111.56021
150	123.72775
175	135.89529
175	135.89529
200	148.06283
200	148.06283

# Manual calculation of Lack of fit ANOVA table
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- cancer_mass

# Degrees of freedom
df.SSPE <- length(data$Mass) - length(unique(data$Mass))
df.SSLF <- length(unique(data$Mass)) - 2

# calculation of SSPE & SSLF  
data2 <- data %>% group_by(Mass) %>% summarise(Group_mean = mean(Increase_t1))
data <- left_join(data, data2, by="Mass")
data$SSPE <- (data$Increase_t1 - data$Group_mean)^2
data$Pred <- data$Mass * fit$coefficients[2] + fit$coefficients[1]
data$SSLF <- (data$Group_mean - data$Pred)^2
SSPE <- sum(data$SSPE)
SSLF <- sum(data$SSLF)

# calculation of F statistic
MSLF <- SSLF/df.SSLF
MSPE <- SSPE/df.SSPE
a <- 0.05

f <- round(MSLF / MSPE, 4)
f_crit <- qf(1-a, df1 = df.SSLF, df2 = df.SSPE); f_crit

## [1] 5.192168

# pvalue
p <- round(1 - pf(f, df.SSLF, df.SSPE),4)

# summary table
knitr::kable(data.frame(cbind(DF = c(df.SSLF, df.SSPE), 
                              SS = c(SSLF, SSPE), 
                              MS = c(MSLF, MSPE), 
                              F = c(f,NA), 
                              p = c(p, NA)
                              ) # end of cbind
                        ) # end of data.frame
             ) # end of kable

DF	SS	MS	F	p
4	13593.57	3398.393	14.8014	0.0056
5	1148.00	229.600	NA	NA

# with R package
library(alr3)

## Loading required package: car

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

fit <- lm(Increase_t1 ~ Mass, data = cancer_mass)
summary(fit)

## 
## Call:
## lm(formula = Increase_t1 ~ Mass, data = cancer_mass)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.23 -34.06  12.61  32.44  48.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  50.7225    39.3979   1.287     0.23
## Mass          0.4867     0.2747   1.772     0.11
## 
## Residual standard error: 40.47 on 9 degrees of freedom
## Multiple R-squared:  0.2586, Adjusted R-squared:  0.1762 
## F-statistic: 3.139 on 1 and 9 DF,  p-value: 0.1102

anova(fit)

## Analysis of Variance Table
## 
## Response: Increase_t1
##           Df  Sum Sq Mean Sq F value Pr(>F)
## Mass       1  5141.3  5141.3  3.1389 0.1102
## Residuals  9 14741.6  1638.0

pureErrorAnova(fit)

## Analysis of Variance Table
## 
## Response: Increase_t1
##              Df  Sum Sq Mean Sq F value   Pr(>F)   
## Mass          1  5141.3  5141.3  22.393 0.005186 **
## Residuals     9 14741.6  1638.0                    
##  Lack of fit  4 13593.6  3398.4  14.801 0.005594 **
##  Pure Error   5  1148.0   229.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because p-value is less than significance level of our test, we can conclude that there is lack of fit in the simple linear regression model.

Example of Correlation analysis (s20, s22, Wakerly p600)

The data in the table below represent a sample of mathematics achievement test scores and calculus grades for ten independently selected college freshmen. From this evidence, would you say that the achievement test scores and calculus grades are independent? Use \(\alpha = 0.05\). Identify the corresponding attained significance level.

Student	Math_Test_Score	Final_Calculus_Grade
1	39	65
2	43	78
3	21	52
4	64	82
5	57	92
6	47	89
7	28	73
8	75	98
9	34	56
10	52	75

Solution

\(H_0: \rho = 0\)
\(H_a: \rho \neq 0\)
test statistic: \(t = \frac {r\sqrt{n-2}}{\sqrt{1-r^2}}\)

data <- data.frame(student = 1:10, 
                   x = c(39, 43, 21, 64, 57, 47, 28, 75, 34, 52), 
                   y = c(65, 78, 52, 82, 92, 89, 73, 98, 56, 75)
                   )
data2 <- data
# setting parameters
n <- length(data$student)
df <- n-2
a <- 0.05

# calculation of SS
data2$xy <- data2$x * data$y
data2$x2 <- data2$x^2
data2$y2 <- data2$y^2
data2[length(data2$student)+1,] <- colSums(data2)

SS <- data2[11,-1]; SS

##      x   y    xy    x2    y2
## 11 460 760 36854 23634 59816

Sxx <- n*SS[4] - SS[1]^2
Syy <- n*SS[5] - SS[2]^2
Sxy <- n*SS[3] - SS[1]*SS[2]

knitr::kable(data.frame(Sxx = as.numeric(Sxx), 
                        Syy = as.numeric(Syy), 
                        Sxy = as.numeric(Sxy)))

Sxx	Syy	Sxy
24740	20560	18940

# Calculation of r and t
r <- as.numeric(Sxy/sqrt(Sxx*Syy))
t <- (r * sqrt(n-2))/(sqrt(1 - r^2))
t_crit <- qt(1-(a/2), df)
p <- 2 * pt(t, df, lower.tail = FALSE)

knitr::kable(data.frame(r = round(r,3), 
                        t = round(t,3), 
                        t_crit = round(t_crit,3), 
                        p = round(p,5)), 
             align='c')

r	t	t_crit	p
0.84	4.375	2.306	0.00236

# base R
cor.test(data$x, data$y)

## 
##  Pearson's product-moment correlation
## 
## data:  data$x and data$y
## t = 4.375, df = 8, p-value = 0.002365
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4459632 0.9611846
## sample estimates:
##       cor 
## 0.8397859

# cf) Spearman rank correlation (for more information, see nonparametric test tutorial)
cor.test(data$x, data$y, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  data$x and data$y
## S = 20, p-value = 0.001977
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8787879

Example of Nonparametric Spearman’s Ranked Correlation Test

We discussed Spearman’s ranked correlation in the lecture of nonparametric test. As you already know, the statstic, abbreviated as \(r_S\), is a nonparametric version of Pearson’s correlation statistic. (The formular is the same as Pearson’s formular except it utilizes ranked values instead of actual value of \(x_i\) and \(y_i\)). When there are no ties in either the \(x\) observations or the \(y\) observations, the expression for \(r_S\) algebraically reduces to a simpler expression:

\[r_S = 1 - \frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}, \hspace{1cm} where \ d_i =R(x_i) - R(y_i)\] The statistical significance of Spearman’s correlation statistic can be determined by the t statistic as well as Pearson correlation statistic. Now, let’s consider simple example. The correlation between IQ and the hours of TV watching per week.

IQTV <- data.frame(IQ = c(86, 97, 99, 100, 101, 103, 106, 110, 112, 113), 
                   TV = c(0, 20, 28, 27, 50, 29, 7, 17, 6, 12))

n <- length(IQTV$IQ)
df <- n-2
IQTV$rank_IQ <- rank(IQTV$IQ)
IQTV$rank_TV <-  rank(IQTV$TV)
IQTV$d <- IQTV$rank_IQ - IQTV$rank_TV
IQTV$d2 <- IQTV$d^2

knitr::kable(IQTV)

IQ	TV	rank_IQ	rank_TV	d	d2
86	0	1	1	0	0
97	20	2	6	-4	16
99	28	3	8	-5	25
100	27	4	7	-3	9
101	50	5	10	-5	25
103	29	6	9	-3	9
106	7	7	3	4	16
110	17	8	5	3	9
112	6	9	2	7	49
113	12	10	4	6	36

rs <- 1-6*sum(IQTV$d2)/(n*(n^2-1))
t <- (rs * sqrt(n-2))/(sqrt(1 - rs^2))
t_crit <- -qt(1-(a/2),df)
p <- 2*pt(t, df)

knitr::kable(data.frame(rs = round(rs,3), 
                        t = round(t,3), 
                        t_crit = round(t_crit,3), 
                        p = round(p,5)), 
             align='c')

rs	t	t_crit	p
-0.176	-0.505	-2.306	0.62719

# base R
cor.test(IQTV$IQ, IQTV$TV, method = "spearman", conf.level = 0.95)

## 
##  Spearman's rank correlation rho
## 
## data:  IQTV$IQ and IQTV$TV
## S = 194, p-value = 0.632
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1757576

The test result implies that there is no correlation between IQ and hours of TV watching per week but negative sign implies there could be negative correlation between two although it is very low.

Part III: Extension of Linear Model

Example of Variable transformation (s25)

A bioengineer measures the growth rate of a substance by counting the number of cells \(N\) present at various times \(t\) as shown below. Fit the model \(N=\gamma_0e^{\gamma_1t}\) and calculate two-sided 95% confidence intervals for the unknown parameters \(\gamma_1\)

time	Num_cell
1	12
2	31
3	42
4	75
5	119
6	221
7	327
8	546

Solution with two unknown parameters \(\gamma_0\) and \(\gamma_1\) is commonly used for this problem. With a logarithmic transformation, the linear relationship \[y = \beta_0 + \beta_1 x\] is obtained with \(y = ln(N)\), \(\beta_0 = ln(\gamma_0)\), \(\beta_1 = \gamma_1\), \(x = t\).

A straight line is fitted to the transformed data points \((x_i, y_i)\) and the fitted regression line is \[y = 0.523x + 2.176\] The equation can be transformed back into a model in terms of the original variables \[N = e^{ln(\gamma_0) + \gamma_1t}=e^{ln(2.176) + 0.523t}\] \[\beta_1 = \gamma_1 \in (0.472, 0.573)\]

cell_growth <- data.frame(time = 1:8, Num_cell = c(12, 31, 42, 75, 119, 221, 327, 546))

# transform of variables
cell_growth$lnt <- log(cell_growth$time)
cell_growth$lnN <- log(cell_growth$Num_cell)

# fit linear model directly
model1 <- lm(Num_cell ~ time, data = cell_growth)
summary(model1)

## 
## Call:
## lm(formula = Num_cell ~ time, data = cell_growth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -87.14 -56.16 -21.64  44.47 132.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -139.04      64.20  -2.166  0.07350 . 
## time           69.04      12.71   5.430  0.00162 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.39 on 6 degrees of freedom
## Multiple R-squared:  0.8309, Adjusted R-squared:  0.8027 
## F-statistic: 29.49 on 1 and 6 DF,  p-value: 0.001617

plot(Num_cell ~ time, data = cell_growth, pch = 16, 
     ylab = "N", xlab = "t")
lines(cell_growth$time, model1$fitted.values)

# fit linear model with transformed variable
model2 <- lm(lnN ~ time, data = cell_growth)
summary(model2)

## 
## Call:
## lm(formula = lnN ~ time, data = cell_growth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.21446 -0.05110 -0.01050  0.05717  0.21144 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.17619    0.10357   21.01 7.57e-07 ***
## time         0.52318    0.02051   25.51 2.39e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1329 on 6 degrees of freedom
## Multiple R-squared:  0.9909, Adjusted R-squared:  0.9893 
## F-statistic: 650.7 on 1 and 6 DF,  p-value: 2.391e-07

plot(lnN ~ time, data = cell_growth, pch = 16, 
     ylab = "ln(N)", xlab = "t")
lines(cell_growth$time, model2$fitted.values)

slr <- function(x) exp(2.176 + 0.523*x)
plot(cell_growth$time, cell_growth$Num_cell, pch = 16, 
     ylab = "N", xlab = "t")
curve(slr, add = TRUE)

# confidence interval
confint(model2, "time", level=0.95)

##         2.5 %    97.5 %
## time 0.472994 0.5733623

Example of Multiple linear Regression (s25)

For multiple linear regression, We define the following matrices, with \(x_0 = 1\):

\[Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \end{bmatrix}, \hspace{1cm} X = \begin{bmatrix} x_0 & x_{11} & x_{12} & ... & x_{1k} \\ x_0 & x_{21} & x_{22} & ... & x_{2k} \\ \vdots & \vdots & \vdots & & \vdots \\ x_0 & x_{n1} & x_{n2} & ... & x_{nk} \end{bmatrix}, \hspace{1cm} \beta = \begin{bmatrix} \beta_0 \\ \beta_2 \\ \vdots\\ \beta_n \end{bmatrix}, \hspace{1cm} \epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots\\ \epsilon_n \end{bmatrix}\]

Thus, the n equations representing \(y_i\) as a function of the x’s, \(\beta\)’s, and \(\epsilon\)’s can be simultaneously written as

\[Y = X\beta + \epsilon\] In simple linear regression, we see that the least-squares equations are given by \[(X'X)\hat\beta = X'Y\]

Hence, \[\hat\beta = (X'X)^{-1}X'Y\] The \(n \times 1\) vector of fitted values is \[ \hat Y = \begin{pmatrix} \hat y_1 \\ \hat y_2 \\ \vdots \\ \hat y_n \end{pmatrix} = X\hat\beta\] and the \(n \times 1\) vector of residuals is \[ e = \begin{pmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{pmatrix} = Y - \hat Y = \begin{pmatrix} y_1 - \hat y_1 \\ y_2 - \hat y_2 \\ \vdots \\ y_n - \hat y_n \end{pmatrix}\] The sum of squares for error can then be calculated as \[SSE = e^{'}e = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat y_i)^2\] and \[\hat\sigma^2 = MSE = \frac{SSE}{n-k-1}\] where \(k\) indicates number of input variables.

Example of multiple linear regression - Curvilinear Regression Model (Polynomial) (s25, Wackerly p612)

x	y
-2	0
-1	0
0	1
1	1
2	3

Fit a parabola to the data above, using the model \[Y = \beta_0 + \beta_1x + \beta_2x^2 +\epsilon\]

Solution The \(X\) matrix for this example differs from that of simple linear regression case only by the addition of a third column corresponding to \(x^2\). (Notice that \(x_1 = x, x_2 = x^2\), and \(k = 2\) in the notation of the general linear model) Thus,

\[Y = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 3 \end{bmatrix}, \hspace{1cm} X = \begin{bmatrix} 1 & -2 & 4 \\ 1 & -1 & 1 \\ 1 & \phantom{-}0 & 0 \\ 1 & \phantom{-}1 & 1 \\ 1 & \phantom{-}2 & 4 \end{bmatrix}\]

Thus, for the first measurement, \(y = 0, x_0 = 1, x = -2\), and \(x^2 = 4\) and for the second measurement, \(y = 0, x_0 = 1, x = -1\), and \(x^2 = 1\). Succeeding rows of the \(Y\) and \(X\) matrices are obtained in a similar manner. The matrix products \(X'X\) and \(X'Y\) are

\[X'X =\begin{bmatrix} \phantom{-}1 & \phantom{-}1 & 1 & 1 & 1 \\ -2 & -1 & 0 & 1 &2 \\ \phantom{-}4 & \phantom{-}1 & 0 & 1 & 4\end{bmatrix} \begin{bmatrix} 1 & -2 & 4 \\ 1 & -1 & 1 \\ 1 & \phantom{-} 0 & 0\\ 1 & \phantom{-} 1 & 1\\ 1 & \phantom{-} 2 & 4\\ \end{bmatrix} =\begin{bmatrix} 5 & 0 & 10 \\ 0 & 10 & 0 \\ 10 & 0 & 34 \end{bmatrix}\]

\[X'Y =\begin{bmatrix} \phantom{-}1 & \phantom{-}1 & 1 & 1 & 1 \\ -2 & -1 & 0 & 1 &2 \\ \phantom{-}4 & \phantom{-}1 & 0 & 1 & 4\end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 3 \end{bmatrix} =\begin{bmatrix} 5 \\ 7 \\ 13 \end{bmatrix}\]

By inverting \(X'X\), we get, \[(X'X)^{-1} =\begin{bmatrix} \phantom{-} 17/35 & 0 & -1/7 \\ \phantom{-} 0 & 1/10 & 0 \\ -1/7 & 0 & \phantom{-} 1/14\end{bmatrix}\]

Finally, \[\begin{aligned} \hat\beta &= (X'X)^{-1} X'Y \\ &=\begin{bmatrix} \phantom{-} 17/35 & 0 & -1/7 \\ \phantom{-} 0 & 1/10 & 0 \\ -1/7 & 0 & \phantom{-} 1/14\end{bmatrix} \begin{bmatrix} 5 \\ 7 \\ 13 \end{bmatrix} =\begin{bmatrix} 4/7 \\ 7/10 \\ 3/14 \end{bmatrix} \approx \begin{bmatrix} 0.571 \\ 0.700 \\ 0.214 \end{bmatrix}\end{aligned}\]

Hence, \(\hat\beta_0 = 0.571, \ \hat\beta_1 = 0.7\), and \(\hat\beta_2 = 0.214\). and the prediction equation is \[\hat y = 0.571 + 0.7x + 0.214 x^2\]

toy <- data.frame(x = c(-2,-1,0,1,2), y = c(0,0,1,1,3))
toy$x2 <- toy$x^2
X <- as.matrix(data.frame(x0 = rep(1,5), x = toy$x, x2 = toy$x2))
Y <- as.matrix(toy$y)
n <- length(Y)
k <- dim(X)[2]-1

# B_hat matrix
B_hat <- solve(t(X) %*% X) %*% t(X) %*% Y; B_hat

##         [,1]
## x0 0.5714286
## x  0.7000000
## x2 0.2142857

Y_hat <- X %*% B_hat # prediction
e <- Y-Y_hat # residual
SSE <- t(e) %*% e # sum of square error
S_hat <- sqrt(SSE/(n-k-1)) # residual standard error (or MSE)


# base R
fit <- lm(y ~ x+x2, data = toy)
summary(fit)

## 
## Call:
## lm(formula = y ~ x + x2, data = toy)
## 
## Residuals:
##        1        2        3        4        5 
## -0.02857 -0.08571  0.42857 -0.48571  0.17143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.5714     0.3332   1.715   0.2285  
## x             0.7000     0.1512   4.630   0.0436 *
## x2            0.2143     0.1278   1.677   0.2355  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4781 on 2 degrees of freedom
## Multiple R-squared:  0.9238, Adjusted R-squared:  0.8476 
## F-statistic: 12.13 on 2 and 2 DF,  p-value: 0.07619

for more information on the equations for the inference of parameters of MLR model, see Wakerly p615 ~ p634

Part IV: Summary

Summary example 1 - \(V0_2-max\) Aerobic Fitness Measurements (s27, Hayter p560)

The table below is a data set concerning the aerobic fitness of a sample of 20 male subjects. An exercising individual breathes through an apparatus that measures the amount of oxygen in the inhaled air that is used by the individual. The maximum value per unit time of the utilized oxygen is then scaled by the person’s body weight to come up with a variable \(VO_2-max\), which is a general indication of the aerobic fitness of the individual. In other words, \(VO_2-max\) quantifies the maximum rate of oxygen consumption measured during incremental exercise. Fit a linear regression model with \(VO_2-max\) as the dependent variable and age as the explanatory variable. Solve questions below.

VO2.max	Age
23	59
45	47
29	44
55	32
48	45
42	61
32	71
33	32
34	28
52	36
40	36
35	51
45	31
47	44
26	73
42	47
35	40
41	43
29	68
38	40

Estimate intercept (\(\hat\beta_0\)) and slope (\(\hat\beta_1\)) parameters.

vo2 <- read.csv("http://bisyn.kaist.ac.kr/bis335/10-Regression-vo2max-aerobic-fitness.csv")

# prep
colnames(vo2) <- c("y", "x")
data <- vo2 %>% select(x, y) %>% mutate(xy = x*y, x2 = x^2, y2 = y^2)

# SS
n <- length(data$x)
SS <- colSums(data)
Sxx <- SS[4] - SS[1]^2/n
Syy <- SS[5] - SS[2]^2/n
Sxy <- SS[3] - SS[1]*SS[2]/n
kable(data.frame(Sxx = as.numeric(Sxx), Syy = as.numeric(Syy), Sxy = as.numeric(Sxy)), 
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")

Sxx	Syy	Sxy
3486.8	1428.95	-1177.4

# Slope & Intercept
b1 <- Sxy/Sxx
b0 <- mean(data$y) - b1*mean(data$x)

kable(data.frame(intercept = as.numeric(round(b0,4)), 
                        slope = as.numeric(round(b1,4))), 
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")

intercept	slope
54.2181	-0.3377

Calculate \(r^2\), \(r\), and determine the significance of \(r\) using \(t\)-statistic

df <- n - 2
SSE <- as.numeric(Syy - b1*Sxy)
S2 <- SSE/df # error variance

# R^2 & r
SSR <- b1*Sxy
SST <- Syy
r2 <- SSR/SST
r <- Sxy/sqrt(Sxx*Syy)

# pvalue
t <- (r * sqrt(n-2))/(sqrt(1 - r^2))
t_crit <- qt(1-(a/2), df)
p <- 2 * pt(t, df, lower.tail = FALSE)

library(kableExtra)
kable(data.frame(S2 = as.numeric(round(S2,4)),
                        r2 = as.numeric(round(r2,4)), 
                        r = as.numeric(round(r,4)),
                        sqrt_r2 = as.numeric(round(-sqrt(r2),4)),
                        p = as.numeric(round(p,4))),
             col.names = c("$S^2$", "$r^2$", "r", "$-\\sqrt {r^2}$", "p"), 
             align = 'c', escape = FALSE, format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")

\(S^2\)	\(r^2\)	r	\(-\sqrt {r^2}\)	p
57.2985	0.2782	-0.5275	-0.5275	1.9832

Find one-sided 95% confidence interval for \(\hat\beta_1\)

a <- 0.05
t_crit <- qt(1 - a, df)
S <- sqrt(S2)
c11 <- 1/Sxx

upper <- b1 + (t_crit*S*sqrt(c11)); upper

##         xy 
## -0.1153818

kable(data.frame(lower = "-$\\infty$", 
                 upper = as.numeric(round(upper,4))), 
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")

lower	upper
-\(\infty\)	-0.1154

Find two-sided 95% confidence and prediction interval for 50 years old male

# Confidence interval
t_crit <- qt(1 - (a/2), df)
x <- 50
y_hat <- b0+b1*x
K <- (1/n)+((x - mean(vo2$x))^2/Sxx)

lower <- y_hat - (t_crit*S*sqrt(K))
upper <- y_hat + (t_crit*S*sqrt(K))

kable(data.frame(y_hat = as.numeric(round(y_hat, 4)), 
                 lower = as.numeric(round(lower, 4)),
                 upper = as.numeric(round(upper,4))),
      col.names = c("$\\hat y$", "lower", "upper"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")

\(\hat y\)	lower	upper
37.3344	33.6485	41.0202

# Prediction interval
lower <- y_hat - (t_crit*S*sqrt(1+K))
upper <- y_hat + (t_crit*S*sqrt(1+K))

kable(data.frame(y_hat = as.numeric(round(y_hat, 4)), 
                 lower = as.numeric(round(lower, 4)),
                 upper = as.numeric(round(upper,4))),
      col.names = c("$\\hat y$", "lower", "upper"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")

\(\hat y\)	lower	upper
37.3344	21.0097	53.659

Using base R function, perform above analysis

vo2 <- read.csv("http://bisyn.kaist.ac.kr/bis335/10-Regression-vo2max-aerobic-fitness.csv")

# 1. lm
fit <- lm(VO2.max ~ Age, data = vo2)
fit$coefficients # b0, b1

## (Intercept)         Age 
##  54.2180509  -0.3376735

# 2. r^2, r
## r^2
r2 <- summary(fit)
kable(data.frame(r2 = as.numeric(round(r2$r.squared, 4)), 
                 f = as.numeric(round(r2$f[1], 4)),
                 p = as.numeric(round(r2$coefficients[2,4],4)),
                 sigma = as.numeric(round(r2$sigma,4))),
      col.names = c("$r^2$", "F", "p", "$\\hat\\sigma$"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")

\(r^2\)	F	p	\(\hat\sigma\)
0.2782	6.9387	0.0168	7.5696

## r, t, p
r <- cor.test(vo2$VO2.max, vo2$Age)

kable(data.frame(r = as.numeric(round(r$estimate, 4)), 
                 t = as.numeric(round(r$statistic, 4)),
                 p = as.numeric(round(r$p.value,4))),
      col.names = c("$r$", "t", "p"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")

\(r\)	t	p
-0.5275	-2.6341	0.0168

# 3. conf.int for b1 (one-sided, upper)
confint(fit, "Age", level = 1-2*a, "one-sided")[2]

## [1] -0.1153818

# 4. conf.int and pred.int for y_hat
newdata <- data.frame(Age = 50)
predict(fit, newdata, level=1-a, interval = "confidence")[1,]

##      fit      lwr      upr 
## 37.33438 33.64853 41.02022

predict(fit, newdata, level=1-a, interval = "prediction")[1,]

##      fit      lwr      upr 
## 37.33438 21.00974 53.65901

# 5. visualize
library(ggplot2)
qplot(vo2$Age, vo2$VO2.max, ylab = "VO2 Max", 
      xlab = "Age", main = "VO2 max vs Age") + stat_smooth(method = "lm")

Questions

Does the sign of \(\hat\beta_1\) surprise you? What does your model predict as the change in \(VO2-max\) for an additional five years of age? \[-0.3377 \times 5 = -1.6885\]
What does the model tell you about the aerobic fitness of a 15-year-old boy?

If the model is used then it would be extrapolation, so the prediction may be inaccurate.

Summary example 2 - Initial mass of tumor vs Increased size of tumor

The following table summarize the initial (\(S_0\)) and increased masses of tumor (\(S_1\)) in a given time. From the scatter plot of the data set, you may notice that the relationship between \(S_0\) and \(S_1\) seems nonlinear but intrinisically linear. Find best linear fit of this data by using variable transformation and answer questions below. You can also test the variable \(\Delta S = S_1 - S_0\) to find best fit between initial (possibly transformed) mass and difference or increased mass.

\(S_0\)	\(S_1\)
55	73
55	87
80	182
80	206
105	255
105	245
130	272
155	301
155	269
180	294
180	274

Calculate the \(R^2\) for \((S_0, \Delta S)\), \((S_0, S_1)\)

# prep
tumor$sd <- tumor$s1 - tumor$s0

# reordering
fit1 <- lm(sd ~ s0, data = tumor)
fit2 <- lm(s1 ~ s0, data = tumor)

# r^2
summary(fit1)$r.squared

## [1] 0.2585808

summary(fit2)$r.squared

## [1] 0.7649424

Perform appropriate variable transformation and calculate \(R^2\) for \((S_0^*, \Delta S)\), \((S_0^*, S_1)\)

# prep
tumor$s0_t <- log(tumor$s0) # ln(s_0)

# reordering
fit3 <- lm(sd ~ s0_t, data = tumor)
fit4 <- lm(s1 ~ s0_t, data = tumor)

# r^2
summary(fit3)$r.squared

## [1] 0.4075046

summary(fit4)$r.squared

## [1] 0.8827907

Test which variable pairs among the four relations above are significantly dependent with \(\alpha\) level of 0.05 and 0.01.

# Extract the statistic
t <- data.frame(fit1 = summary(fit1)$coefficients[2,3],
                fit2 = summary(fit2)$coefficients[2,3],
                fit3 = summary(fit3)$coefficients[2,3],
                fit4 = summary(fit4)$coefficients[2,3])

# Setting parameters
a1 <- 0.05 
a2 <- 0.01
df <- length(tumor$s0)-2

# Calculate critical t (two-sided)
t_crit1 <- qt(1-(a1/2), df)
t_crit2 <- qt(1-(a2/2), df)

# result
t > t_crit1

##       fit1 fit2 fit3 fit4
## [1,] FALSE TRUE TRUE TRUE

t > t_crit2

##       fit1 fit2  fit3 fit4
## [1,] FALSE TRUE FALSE TRUE

# cf) R^2 cutoff
r2_crit1 <- t_crit1^2 / (t_crit1^2+df)
r2_crit2 <- t_crit2^2 / (t_crit2^2+df)

r2_crit <- data.frame(r2_crit1, r2_crit2); r2_crit

##    r2_crit1 r2_crit2
## 1 0.3624868 0.539911

Notice that for testing significance, \(\hat\beta_1\) and r use the same t statistic. (see the theory part)

Recall that \[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = \frac{1}{1+\frac{SSE}{SSR}}\]

Test if there is the “lack of fit” to the models you created with significance level of 0.05 and 0.01

library(alr3)
f_lof <- data.frame(fit1 = pureErrorAnova(fit1)[3,4],
                    fit2 = pureErrorAnova(fit2)[3,4],
                    fit3 = pureErrorAnova(fit3)[3,4],
                    fit4 = pureErrorAnova(fit4)[3,4])

# setting parameters
df1 <- pureErrorAnova(fit1)[3,1]
df2 <- pureErrorAnova(fit1)[4,1]
a1 <- 0.05
a2 <- 0.01

# critical F statistic
f_crit1 <- qf(1-a1, df1, df2)
f_crit2 <- qf(1-a2, df1, df2)

# result
f_lof > f_crit1

##      fit1 fit2 fit3 fit4
## [1,] TRUE TRUE TRUE TRUE

f_lof > f_crit2

##      fit1 fit2 fit3  fit4
## [1,] TRUE TRUE TRUE FALSE

Calculate prediction interval of the model where \(x_0 = \bar x\) and \(x_0 = \bar x + 2S\) with significance level of 0.05

# prep
S <- data.frame(fit1 = summary(fit1)$sigma,
                fit2 = summary(fit2)$sigma,
                fit3 = summary(fit3)$sigma,
                fit4 = summary(fit4)$sigma)
a <- 0.05

# for x0 = xbar
newdata.fit12 <- data.frame(s0 = mean(tumor$s0))
newdata.fit34 <- data.frame(s0_t = mean(tumor$s0_t))

# prediction interval for x0 = xbar
predict(fit1, newdata.fit12, level=1-a, interval = "prediction")[1,]

##       fit       lwr       upr 
## 107.09091  11.46674 202.71508

predict(fit2, newdata.fit12, level=1-a, interval = "prediction")[1,]

##      fit      lwr      upr 
## 223.4545 127.8304 319.0787

predict(fit3, newdata.fit34, level=1-a, interval = "prediction")[1,]

##       fit       lwr       upr 
## 107.09091  21.60821 192.57361

predict(fit4, newdata.fit34, level=1-a, interval = "prediction")[1,]

##      fit      lwr      upr 
## 223.4545 155.9300 290.9791

# for x0 = xbar + 2S
newdata.fit1 <- data.frame(s0 = mean(tumor$s0) + 2 * as.numeric(summary(fit1)$sigma))
newdata.fit2 <- data.frame(s0 = mean(tumor$s0) + 2 * as.numeric(summary(fit2)$sigma))
newdata.fit3 <- data.frame(s0_t = mean(tumor$s0_t) + 2 * as.numeric(summary(fit3)$sigma))
newdata.fit4 <- data.frame(s0_t = mean(tumor$s0_t) + 2 * as.numeric(summary(fit4)$sigma))

# prediction interval for x0 = xbar + 2S
predict(fit1, newdata.fit1, level=1-a, interval = "prediction")[1,]

##       fit       lwr       upr 
## 146.48611  38.43892 254.53331

predict(fit2, newdata.fit2, level=1-a, interval = "prediction")[1,]

##      fit      lwr      upr 
## 343.7930 235.7458 451.8402

predict(fit3, newdata.fit3, level=1-a, interval = "prediction")[1,]

##       fit       lwr       upr 
## 4820.8822  534.0701 9107.6942

predict(fit4, newdata.fit4, level=1-a, interval = "prediction")[1,]

##       fit       lwr       upr 
##  9956.777  7281.596 12631.958

visulalize the best regression line in transformed space and original space.

# transformed space
plot(tumor$s0_t, tumor$s1, pch = 16, xlab = "S0_t", ylab = "S1")
lines(tumor$s0_t, fit4$fitted.values)

# original space
slrv <- function(x){fit4$coefficients[2]*log(x)+fit4$coefficients[1]}
plot(tumor$s0, tumor$s1, pch = 16, xlab = "S0", ylab = "S1")
curve(slrv, add = TRUE)

Supplements

Simple Linear Regression

##Simple Linear Regression(1)
data("faithful")
head(faithful)

##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

summary(faithful)

##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0

with(faithful, plot(eruptions,waiting)) #Scatterplot of the data

with(faithful, cor(eruptions,waiting)) #correlation

## [1] 0.9008112

##Simple Linear Regression(2)
lm(waiting ~ eruptions, data=faithful)

## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Coefficients:
## (Intercept)    eruptions  
##       33.47        10.73

##Simple Linear Regression(3)
slr1 <- lm(waiting ~ eruptions, data = faithful)
summary(slr1)

## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0796  -4.4831   0.2122   3.9246  15.9719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.4744     1.1549   28.98   <2e-16 ***
## eruptions    10.7296     0.3148   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.914 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

summary(slr1)$r.squared

## [1] 0.8114608

##SLR - Confidence Interval and Prediction Interval
newdata = data.frame(eruptions=4)
predict(slr1, newdata, interval="confidence")

##        fit     lwr      upr
## 1 76.39296 75.6189 77.16702

predict(slr1, newdata, interval="predict")

##        fit      lwr     upr
## 1 76.39296 64.72382 88.0621

##Graph of Simple Linear Regression
plot(faithful$waiting ~ faithful$eruptions)
abline(slr1, col="red")

par(mfrow=c(2,2))
plot(slr1)

Multiple Linear Regression

##Multiple Linear Regression(1)
data("stackloss")
head(stackloss)

##   Air.Flow Water.Temp Acid.Conc. stack.loss
## 1       80         27         89         42
## 2       80         27         88         37
## 3       75         25         90         37
## 4       62         24         87         28
## 5       62         22         87         18
## 6       62         23         87         18

summary(stackloss)

##     Air.Flow       Water.Temp     Acid.Conc.      stack.loss   
##  Min.   :50.00   Min.   :17.0   Min.   :72.00   Min.   : 7.00  
##  1st Qu.:56.00   1st Qu.:18.0   1st Qu.:82.00   1st Qu.:11.00  
##  Median :58.00   Median :20.0   Median :87.00   Median :15.00  
##  Mean   :60.43   Mean   :21.1   Mean   :86.29   Mean   :17.52  
##  3rd Qu.:62.00   3rd Qu.:24.0   3rd Qu.:89.00   3rd Qu.:19.00  
##  Max.   :80.00   Max.   :27.0   Max.   :93.00   Max.   :42.00

cor(stackloss)

##             Air.Flow Water.Temp Acid.Conc. stack.loss
## Air.Flow   1.0000000  0.7818523  0.5001429  0.9196635
## Water.Temp 0.7818523  1.0000000  0.3909395  0.8755044
## Acid.Conc. 0.5001429  0.3909395  1.0000000  0.3998296
## stack.loss 0.9196635  0.8755044  0.3998296  1.0000000

##Multiple Linear Regression(2)
mlr = lm(stack.loss~Air.Flow+Water.Temp+Acid.Conc., data=stackloss)
mlr

## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
##     data = stackloss)
## 
## Coefficients:
## (Intercept)     Air.Flow   Water.Temp   Acid.Conc.  
##    -39.9197       0.7156       1.2953      -0.1521

summary(mlr)

## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
##     data = stackloss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2377 -1.7117 -0.4551  2.3614  5.6978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
## Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
## Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
## Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.8983 
## F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

##Multiple Linear Regression(3)
pairs(stackloss) #Scatter plot of each variables

par(mfrow=c(2,2))
plot(mlr)

##MLR - Confidence Interval and Prediction Interval
attach(stackloss)

## The following object is masked _by_ .GlobalEnv:
## 
##     stack.loss

## The following object is masked from package:datasets:
## 
##     stack.loss

mlr2 <- lm(stack.loss~Air.Flow+Water.Temp+Acid.Conc.)
newdata = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)
predict(mlr2, newdata, interval="confidence")

##        fit      lwr    upr
## 1 24.58173 20.21846 28.945

predict(mlr2, newdata, interval="prediction")

##        fit     lwr      upr
## 1 24.58173 16.4661 32.69736

Logistic Regression

##Logistic Regression(1)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:alr3':
## 
##     forbes

## The following object is masked from 'package:dplyr':
## 
##     select

data(menarche)
menarche

##      Age Total Menarche
## 1   9.21   376        0
## 2  10.21   200        0
## 3  10.58    93        0
## 4  10.83   120        2
## 5  11.08    90        2
## 6  11.33    88        5
## 7  11.58   105       10
## 8  11.83   111       17
## 9  12.08   100       16
## 10 12.33    93       29
## 11 12.58   100       39
## 12 12.83   108       51
## 13 13.08    99       47
## 14 13.33   106       67
## 15 13.58   105       81
## 16 13.83   117       88
## 17 14.08    98       79
## 18 14.33    97       90
## 19 14.58   120      113
## 20 14.83   102       95
## 21 15.08   122      117
## 22 15.33   111      107
## 23 15.58    94       92
## 24 15.83   114      112
## 25 17.58  1049     1049

str(menarche)

## 'data.frame':    25 obs. of  3 variables:
##  $ Age     : num  9.21 10.21 10.58 10.83 11.08 ...
##  $ Total   : num  376 200 93 120 90 88 105 111 100 93 ...
##  $ Menarche: num  0 0 0 2 2 5 10 17 16 29 ...

summary(menarche)

##       Age            Total           Menarche      
##  Min.   : 9.21   Min.   :  88.0   Min.   :   0.00  
##  1st Qu.:11.58   1st Qu.:  98.0   1st Qu.:  10.00  
##  Median :13.08   Median : 105.0   Median :  51.00  
##  Mean   :13.10   Mean   : 156.7   Mean   :  92.32  
##  3rd Qu.:14.58   3rd Qu.: 117.0   3rd Qu.:  92.00  
##  Max.   :17.58   Max.   :1049.0   Max.   :1049.00

plot(Menarche/Total~Age, data=menarche)

##Logistic Regression(2)
glm1 <- glm(cbind(Menarche, Total-Menarche)~Age, family=binomial(logit), data=menarche)
glm1

## 
## Call:  glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit), 
##     data = menarche)
## 
## Coefficients:
## (Intercept)          Age  
##     -21.226        1.632  
## 
## Degrees of Freedom: 24 Total (i.e. Null);  23 Residual
## Null Deviance:       3694 
## Residual Deviance: 26.7  AIC: 114.8

plot(Menarche/Total~Age, data = menarche)
lines(menarche$Age, glm1$fitted.values, type='l', col="red")

Regression

Bis335

Contents

Part I: Linear Regression Model

Part II: Fitness of Regression Model

Part III: Extension of Linear Model

Part IV: Summary

Supplements

Tutorial

Part I: Linear Regression Model

The method of least squares (s6)

Maximum Likelihood Estimation (MLE) Approach

Example of parameter estimation in simple linear regression (s7)

Example of variance estimation in simple linear regression (s9)

Example of inference of parameter (s11)

Example of inference of interval for \(\beta_i\) (s10)

Example of inference of \(E(Y)\) (s13)

Example of inference of particular value of \(y\) (s14)

Part II: Fitness of Regression Model

Example of the analysis of variance table (s17, Hayter p546)

Residual Analysis (s18, Hayter p585)

Example of lack of fit test (s18)

Example of Correlation analysis (s20, s22, Wakerly p600)

Example of Nonparametric Spearman’s Ranked Correlation Test

Part III: Extension of Linear Model

Example of Variable transformation (s25)

Example of Multiple linear Regression (s25)

Example of multiple linear regression - Curvilinear Regression Model (Polynomial) (s25, Wackerly p612)

Part IV: Summary

Summary example 1 - \(V0_2-max\) Aerobic Fitness Measurements (s27, Hayter p560)

Summary example 2 - Initial mass of tumor vs Increased size of tumor

Supplements

Simple Linear Regression

Multiple Linear Regression

Logistic Regression

Mass	Increase_t1	Increase_t2
75	28	85
75	42	88
100	112	97
100	136	100
125	160	108
125	150	113
150	152	126
175	156	134
175	124	139
200	124	147
200	104	150

IQ	TV	rank_IQ	rank_TV	d	d2
86	0	1	1	0	0
97	20	2	6	-4	16
99	28	3	8	-5	25
100	27	4	7	-3	9
101	50	5	10	-5	25
103	29	6	9	-3	9
106	7	7	3	4	16
110	17	8	5	3	9
112	6	9	2	7	49
113	12	10	4	6	36

VO2.max	Age
23	59
45	47
29	44
55	32
48	45
42	61
32	71
33	32
34	28
52	36
40	36
35	51
45	31
47	44
26	73
42	47
35	40
41	43
29	68
38	40

\(S_0\)	\(S_1\)
55	73
55	87
80	182
80	206
105	255
105	245
130	272
155	301
155	269
180	294
180	274

Mass	Increase_t1	Increase_t2
75	28	85
75	42	88
100	112	97
100	136	100
125	160	108
125	150	113
150	152	126
175	156	134
175	124	139
200	124	147
200	104	150

IQ	TV	rank_IQ	rank_TV	d	d2
86	0	1	1	0	0
97	20	2	6	-4	16
99	28	3	8	-5	25
100	27	4	7	-3	9
101	50	5	10	-5	25
103	29	6	9	-3	9
106	7	7	3	4	16
110	17	8	5	3	9
112	6	9	2	7	49
113	12	10	4	6	36

VO2.max	Age
23	59
45	47
29	44
55	32
48	45
42	61
32	71
33	32
34	28
52	36
40	36
35	51
45	31
47	44
26	73
42	47
35	40
41	43
29	68
38	40

\(S_0\)	\(S_1\)
55	73
55	87
80	182
80	206
105	255
105	245
130	272
155	301
155	269
180	294
180	274

Mass	Increase_t1	Increase_t2
75	28	85
75	42	88
100	112	97
100	136	100
125	160	108
125	150	113
150	152	126
175	156	134
175	124	139
200	124	147
200	104	150

IQ	TV	rank_IQ	rank_TV	d	d2
86	0	1	1	0	0
97	20	2	6	-4	16
99	28	3	8	-5	25
100	27	4	7	-3	9
101	50	5	10	-5	25
103	29	6	9	-3	9
106	7	7	3	4	16
110	17	8	5	3	9
112	6	9	2	7	49
113	12	10	4	6	36

VO2.max	Age
23	59
45	47
29	44
55	32
48	45
42	61
32	71
33	32
34	28
52	36
40	36
35	51
45	31
47	44
26	73
42	47
35	40
41	43
29	68
38	40

\(S_0\)	\(S_1\)
55	73
55	87
80	182
80	206
105	255
105	245
130	272
155	301
155	269
180	294
180	274