Contents

Part I: Linear Regression Model

  • Regression Model, Simple Linear Regression, Least Square Method
  • Inference of Regression Parameters and the Function

Part II: Fitness of Regression Model

  • Coefficient of Determination \(R^2\)
  • Residuals Analysis
  • Correlation Analysis
  • Non parametric Spearman’s Ranked Correlation Analysis

Part III: Extension of Linear Model

  • Variable Transformation
  • Curvilinear Regression
  • Multiple Linear Regression

Part IV: Summary

  • Summary example 1 - Age vs \(V0_2-max\)
  • Summary example 2 - Initial mass of tumor vs Increased size of tumor

Supplements

  • Simple Linear Regression - eruption
  • Multiple Linear Regression - stackloss
  • Logistic Linear Regression - Menarche

Tutorial

Part I: Linear Regression Model

The method of least squares (s6)

Suppose that we wish to fit the model

\[E(Y) = \beta_0 + \beta_1 X_i\]

to the set of data points as shown in the figure below. (Note that the independent variable \(x\) could be \(w^2\) or \(w^{\frac{1}{2}}\) or \(ln(w)\) and so on, for some other independent variable \(w\))

 

 

The least-squares procedure for fitting a line through a set of \(n\) data points is similar to the method that we might use if we fit a line by eye. A convenient way to accomplish this, and one that yields estimators with good properties, is to minimize the sum of squares of the vertical deviations from the fitted line. Thus, if

\[\hat{y_i}=\hat\beta_0 + \hat\beta_1x_i\] is the predicted value of \(i_{th}\) \(y\) value (when \(x = x_i\)), then the deviation (or error) of the observed value of \(y_i\) from \(\hat{y_i}=\hat\beta_0 + \hat\beta_1x_i\) is the difference \(y_i - \hat y_i\) and the sum of squares of deviations to be minimized is

\[SSE = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\] To minimize this equation, we need to take partial derivatives of SSE with respect to \(\hat\beta_0\) and \(\hat\beta_1\) and setting them equal to zero. Thus,

\[\frac{\partial SSE}{\partial \hat\beta_0} = \frac{\partial \{ \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\}}{\partial \hat\beta_0} = -\sum_{i=1}^n 2[y_i - (\hat\beta_0 + \hat\beta_1x_i)] \\ = -2 \bigg(\sum_{i=1}^n y_i - n\hat\beta_0 - \hat\beta_1\sum_{i=1}^nx_i \bigg) = 0\] and

\[\frac{\partial SSE}{\partial \hat\beta_1} = \frac{\partial \{ \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\}}{\partial \hat\beta_1} = -\sum_{i=1}^n 2[y_i - (\hat\beta_0 + \hat\beta_1x_i)]x_i \\ = -2 \bigg(\sum_{i=1}^n x_iy_i - \hat\beta_0\sum_{i=1}^n x_i - \hat\beta_1\sum_{i=1}^nx_i^2 \bigg) = 0\]

The equations \(\frac{\partial SSE}{\partial \hat\beta_0} = 0\) and \(\frac{\partial SSE}{\partial \hat\beta_1} = 0\) are called the least-squares equations for estimating the parameters of a line. The least-squares equations are linear in \(\hat\beta_0\) and \(\hat\beta_1\) and hence can be solve simultaneously. You can verify that the solutions are

\[\hat\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2} = \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n}\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{\sum_{i=1}^n x_i^2 - \frac{1}{n} \bigg( \sum_{i=1}^n x_i \bigg)^2}\] \[\hat\beta_0 = \bar y - \beta_1 \bar x\] Note that

\(\sum_{i=1}^n \bar x y_i = \frac{1}{n} \sum_{i=1}^n x_i \sum_{i=1}^n y_i\)

and

\(\sum_{i=1}^n \bar x \bar y - \sum_{i=1}^n \bar y x_i = \bar y \sum_{i=1}^n (\bar x - x_i) = 0\),

because

\(\sum_{i=1}^n (\bar x - x_i) = 0\)

We will denotes \(S_{xy}\) as a quantity \(\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)\),

\(S{xx}\) as a quantity \(\sum_{i=1}^n (x_i - \bar x)^2\),

and \(S{yy}\) as a quantity \(\sum_{i=1}^n (y_i - \bar y)^2\).

Then the above equation become \[\hat \beta_1 = \frac{S_{xy}}{S_{xx}}, \ \ \hat\beta_0 = \bar y - \hat \beta_1 \bar x\] where \[\begin{aligned}S_{xx} &= \sum_{i=1}^n (x_i - \bar x)^2 \\ &= \sum_{i=1}^n x_i^2 - 2n\bar x^2 + n \bar x^2 \\ &= \sum_{i=1}^n x_i^2 - n \bar x^2 \\ &= \sum_{i=1}^n x_i^2 - n \bigg(\frac{\sum_{i=1}^n x_i}{n} \bigg)^2 \\ &= \sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^n x_i)^2}{n}\end{aligned}\]

\[\begin{aligned} S_{xy} &= \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) \\ &= \sum_{i=1}^n x_iy_i - n \bar x \bar y \\ &= \sum_{i=1}^n x_iy_i - n \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n^2} \\ &= \sum_{i=1}^n x_iy_i - \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n}\end{aligned}\] and \[SSE = S_{yy} - \hat \beta_1 S_{xy}\]

Maximum Likelihood Estimation (MLE) Approach

Recall the model, \[Y_i = \beta_0 + \beta_1 x_i + \epsilon_i\] where \(\epsilon_i \sim N(0, \sigma^2)\)

Then we can find the mean and variance of each \(Y_i\) \[E[Y_i|X_i = x_i] = \beta_0 + \beta_1x_i\] and \[Var[Y_i|X_i = x_i] = \sigma^2\] Additionally, the \(Y_i\) follow a normal distribution conditioned on the \(x_i\), \[Y_i|X_i \sim N(\beta_0+\beta_1 x_i, \sigma^2)\] Recall that the P.D.F of a random variable \(X \sim N(\mu, \sigma^2)\) is given by \[f_X(x;\mu, \sigma^2) = \frac{1}{\sqrt{{2\pi\sigma^2}}}\bigg[-\frac{1}{2} \bigg(\frac{x-\mu}{\sigma}\bigg)^2\bigg]\] Then we can write the P.D.F of each of the \(Y_i\) as \[f_{Y_i}(y_i,x_i;\beta_0, \beta_1,\sigma^2) = \frac{1}{\sqrt{{2\pi\sigma^2}}} exp\bigg[-\frac{1}{2} \bigg(\frac{y_i-(\beta_0-\beta_1x_i)}{\sigma}\bigg)^2\bigg]\] Given \(n\) data points \((x_i, y_i)\) we can write the likelihood, which is a function of the three parameters \(\beta_0, \beta_1\) and \(\sigma^2\). Since the data have been observed, we use lower case \(y_i\) to denote that these values are no longer random. \[L(\beta_0, \beta_1,\sigma^2) = \prod_{i=1}^n\frac{1}{\sqrt{{2\pi\sigma^2}}} exp\bigg[-\frac{1}{2} \bigg(\frac{y_i-\beta_0-\beta_1x_i}{\sigma}\bigg)^2\bigg]\] Our goal is to find values of \(\beta_0, \beta_1\), and \(\sigma^2\) which maximize this function, which is a straightforward multivariate calculus problem.

We’ll start by doing a bit of rearraging to make our task easier. \[L(\beta_0, \beta_1,\sigma^2) = \bigg(\frac{1}{\sqrt{{2\pi\sigma^2}}}\bigg)^2 exp\bigg[-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2\bigg]\] Then, as is often the case when finding MLEs, for mathematical onvenience we will take the natural logarithm of the likelihood function since log is a monotonically increasing function. Then we will proceed to maximize the log-likelihood, and the resulting estimates will be the same as if we we had not taken the log. \[\log L(\beta_0, \beta_1,\sigma^2) = -\frac{n}{2} \log{(2\pi)} - \frac{n}{2} \log{(\sigma^2)} - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2\] Note that we use log to mean the natural logarithm (or \(\ln\)). We now take a partial derivative with respect to each of the parameters.

\[\begin{aligned} \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \beta_0} &= \frac{1}{\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 \\ \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \beta_1} &= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i) (y_i-\beta_0-\beta_1x_i)^2 \\ \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \sigma^2} &= \frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 \end{aligned}\]

When then set each of the partial derivatives equal to zero and solve the resulting system of equations.

\[\begin{aligned} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 &= 0\\ \sum_{i=1}^n (x_i) (y_i-\beta_0-\beta_1x_i)^2 &= 0\\ \frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 &=0 \end{aligned}\]

You may notice that the first two equations also appear in the least squares approach. Then, skipping the issue of actually checking if we have found a maximum, we then arrive at our estimates. We call these estimates the maximum likelihood estimates.

\[\begin{aligned} \hat\beta_1 &= \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n} (\sum_{i=1}^n x_i) (\sum_{i=1}^n y_i)}{\sum_{i=1}^n x_i^2 - \frac{1}{n} (\sum_{i=1}^n x_i)^2} = \frac{S_{xy}}{S_{xx}}\\ \hat\beta_0 &= \bar y - \hat\beta_1 \bar x \\ \hat\sigma^2 &= \frac{1}{n} \sum_{i=1}^n (y_i-\hat y_i)^2 \end{aligned}\]

Note that \(\hat\beta_0\) and \(\hat\beta_1\) are the same as the least squares estimates. However we now have a new estimate of \(\sigma^2\), that is \(\hat\sigma^2\). So we now have two different estimates of \(\sigma^2\).

\[\begin{aligned}S_e^2 &= \frac{1}{n-2} \sum_{i=1}^{n}(y_i - \hat y_i)^2 = \frac{1}{n-2} \sum_{i=1}^{n}e_i^2 \hspace{1cm} Least \ \ Squares \\ \hat\sigma^2 &= \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat y_i)^2 = \frac{1}{n} \sum_{i=1}^{n}e_i^2 \hspace{1cm} MLE \end{aligned}\]

Least square estimate is unbiased estimate while estimate from MLE is biased estimate since the expectation of \(\sigma^2\) is the same as the \(S_e^2\) but not \(\hat\sigma^2\) (See the supplementary tutorial).

 

Question: Is bias always a bad thing?

try to answer yourself and see these posts to get more idea

Example of parameter estimation in simple linear regression (s7)

Use the mothod of least squares to fit a straight line to the \(n=5\) data points given in the table below.(Wackrly p571)

x y
-2 0
-1 0
0 1
1 1
2 3

Solution

We commence computation of the least-squares estimates for the slope and intecept of the fitted line by constructing table below.

Using the results from this table, we obtain

\[\hat \beta_1 = \frac{S_{xy}}{S_{xx}} = \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n}\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{\sum_{i=1}^n x_i^2 - \frac{1}{n} \bigg( \sum_{i=1}^n x_i \bigg)^2} = \frac {7 - \frac{1}{5} (0)(5)}{10 - \frac{1}{5} (0)^2} = 0.7\] \[\hat\beta_0 = \bar y - \hat\beta_1 \bar x = \frac{5}{5} - (0.7)(0) = 1\] and the fitted line is

\[\hat y = 1 + 0.7x\]

slr <- function(x){1+0.7*x}
plot(data$x[1:5], data$y[1:5], xlab = "x", ylab = "y", ylim = c(-1.5, 3))
curve(slr, add = TRUE)

# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)
## 
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
## 
## Residuals:
##          1          2          3          4          5 
##  4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01  6.000e-01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     0.2708   3.693   0.0345 *
## x[1:5]        0.7000     0.1915   3.656   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.7556 
## F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535

Example of variance estimation in simple linear regression (s9)

Find the variances of the estimators \(\hat\beta_0\) and \(\hat\beta_1\) for the table below.(Wackrly p579)

x y
-2 0
-1 0
0 1
1 1
2 3

Solution Previous example we found that \[n = 5, \hspace{0.5cm} \sum_{i=1}^n x_i = 0, \hspace{0.5cm} \sum_{i=1}^n x_i^2 = 10, \hspace{0.5cm} S_{xx} = 10, \hspace{0.5cm} and \hspace{0.5cm} \bar x = 0\] Thus, \[V(\hat\beta_0) = \frac{\sigma^2\sum_{i=1}^n x_i^2}{nS_{xx}} = \frac{\sigma^2 (10)}{5(10)} = \bigg(\frac{1}{5}\bigg)\sigma^2\] and \[V(\hat\beta_1) = \bigg(\frac{1}{10}\bigg)\sigma^2\] Notice that \(Cov(\hat\beta_0, \hat\beta_1) = 0\) in this case since \(\sum_{i=1}^n x_i = 0\)

To get the answer, we need to estimate \(\sigma^2\). For these data, \(n = 5\) and we have already determined that \[\sum_{i=1}^n y_i = 5, \hspace{0.5cm} S_{xy} = 7, \hspace{0.5cm} and \hspace{0.5cm} \hat\beta_1 = 0.7\] It is easily determined that \(\sum_{i=1}^n y_i^2 = 11\) and that \[S_{yy} = \sum_{i=1}^n(y_i - \bar y)^2 = \sum_{i=1}^ny_i^2 - n(\bar y)^2 = 11 - 5(1)^2 = 6.0\] Therefore, \[SSE = S_{yy} - \hat\beta_1 S_{xy} = 6.0 - (0.7)(7) = 1.1\] and

\[S^2 = \frac{SSE}{n-2} = \frac{1.1}{5-2} = \frac{1.1}{3} = 0.367\]

x y xy \(x^2\) \(y^2\)
1 -2 0 0 4 0
2 -1 0 0 1 0
3 0 1 0 0 1
4 1 1 1 1 1
5 2 3 6 4 9
Sum 0 5 7 10 11
n <- 5
Sxx <- data$x2[6] - n * (data$x[6]/n)
Syy <- data$y2[6] - n * (data$y[6]/n)
Sxy <- data$xy[6]
b1 <- 0.7
b0 <- 1

# estimation of population variance
SSE <- Syy - b1*Sxy
S2 <- SSE/(n-2)
S <- sqrt(S2)

# cf) MLE
S2_mle <- SSE/n
S_mle <- sqrt(S2_mle)

# Formatting
knitr::kable(data.frame(Sxx, Syy, Sxy, SSE, S2, S, S2_mle, S_mle))
Sxx Syy Sxy SSE S2 S S2_mle S_mle
10 6 7 1.1 0.3666667 0.6055301 0.22 0.4690416
# variance of regression coefficient
V_b1 <- S2/Sxx; V_b1; S2/10
## [1] 0.03666667
## [1] 0.03666667
V_b0 <- (S2*data$x2[6])/(n*Sxx); V_b0; S2/5
## [1] 0.07333333
## [1] 0.07333333
# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)
## 
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
## 
## Residuals:
##          1          2          3          4          5 
##  4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01  6.000e-01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     0.2708   3.693   0.0345 *
## x[1:5]        0.7000     0.1915   3.656   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.7556 
## F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535

Example of inference of parameter (s11)

Do the data of previous example present sufficient evidence to indicate that the slope differs from 0? Test using \(\alpha = 0.05\) and give bounds for the attained significance level. (Wackerly p585)

Solution

From previous example, we can get

\[C_{11} = \frac{1}{S_{xx}} = \frac{1}{10} = 0.1, \hspace{1cm} S = \sqrt{S^2} = \sqrt{0.367} = 0.606\] and \[t = \frac{\hat\beta_1 - 0}{s\sqrt{C_{11}}} = \frac{0.7-0}{0.606\sqrt{0.1}} = 3.65\] If we take \(\alpha = 0.05\), the value of \(t_{\alpha/2} = t_{0.025}\) for 3 df is 3.182 from the t-table. and the rejection region is \[reject \ \ if \ \ |t| \geq 3.182\] Therefore, we reject the null hypothesis that \(\beta_1 = 0\) at the \(\alpha = 0.05\) level of significance.

df <- n-2
c11 <- 1/Sxx
S <- sqrt(S2)
t <- (b1-0)/(S * sqrt(c11))
t_crit <- qt(0.975, df)
pval <- 2*pt(t, df, lower.tail = FALSE)

knitr::kable(data.frame(df, c11, S, t, t_crit, pval))
df c11 S t t_crit pval
3 0.1 0.6055301 3.655631 3.182446 0.0353528
# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)
## 
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
## 
## Residuals:
##          1          2          3          4          5 
##  4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01  6.000e-01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     0.2708   3.693   0.0345 *
## x[1:5]        0.7000     0.1915   3.656   0.0354 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared:  0.8167, Adjusted R-squared:  0.7556 
## F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535

Example of inference of interval for \(\beta_i\) (s10)

Calculate a 95% confidence interval for the parameter \(\beta_1\) of previous example. (Wakerly p586)

Solution From the t-Table, we can get the critical t-value, \(t_{0.25}\), based on 3 df, 3.182. Then the 95% confidence interval for \(\beta_1\) is

\[\hat\beta_1 \pm t_{0.25}s\sqrt{C_{11}}\] Substituting, we get \[0.7 \pm (3.182)(0.606)\sqrt{0.1} \hspace{1cm} or \hspace{1cm} 0.7 \pm 0.610\] If we wish to estimate \(\beta_1\) correctly within 0.15 unit, it is obvious that the confidence interval is too wide and that the sample size must be increased.

b1 <- 0.7
t_crit <- qt(0.975, df)
S <- sqrt(S^2)
c11 <- 1/Sxx

lower <- b1 - (t_crit*S*sqrt(c11))
upper <- b1 + (t_crit*S*sqrt(c11))

knitr::kable(data.frame(lower, upper))
lower upper
0.0906079 1.309392
# base R
confint(slr2, "x[1:5]", level=0.95)
##             2.5 %   97.5 %
## x[1:5] 0.09060793 1.309392

Example of inference of \(E(Y)\) (s13)

For the same example, find a 90% confidence interval for \(E(Y)\) when \(x=1\). (wackerly p592)

Solution For the model of previous example \[E(Y) = \beta_0 + \beta_1x\] To estimate \(E(Y)\) for any fixed value \(x=x^*\), we use the unbiased estimator \(\hat{E(Y)} = \hat\beta_0 + \hat\beta_1x^*\). Then, \[\hat\beta_0 +\hat\beta_1x^* = 1 + 0.7x^*\] For the case of \(x^* = 1\), and because \(n=5, \ \bar x = 0,\) and \(S_{xx} = 10\), it follows that \[\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}} = \frac{1}{5} + \frac{(1-0)^2}{10} = 0.3\] We already know that \(S^2 = 0.367\) or \(S = 0.606\) and \(t_{0.05}\) with \(n-2 = 3\) df is 2.353. Thus, the confidence interval for \(E(Y)\) when \(x=1\) is \[\hat\beta_0 + \hat\beta_1x^* \pm t_{\alpha/2}S\sqrt{\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}\] \[[(1 + (0.7)(1)] \pm (2.353)(0.606)\sqrt{0.3} = 1.7 \pm 0.781\] That is, we are 90% confident that, when the independent variable takes on the value \(x=1\), the mean value \(E(Y)\) of the dependent variable is between 0.919 and 2.481. This interval obviously is very wide, but remember that it is based on only 5 data points.

a <- 0.1
n <- 5
b0 <- 1
b1 <- 0.7
df <- n-2
t_crit <- qt(1 - (a/2), df)
x <- 1
y_hat <- b0+b1*x

lower <- y_hat - (t_crit*S*sqrt((1/n)+((x - data$x[6])^2/Sxx)))
upper <- y_hat + (t_crit*S*sqrt((1/n)+((x - data$x[6])^2/Sxx)))

knitr::kable(data.frame(y_hat, lower, upper))
y_hat lower upper
1.7 0.9194776 2.480522
# base R
new.data <- data.frame(x = 1)
predict(slr2, new.data, level=0.90, interval = "confidence")[1,]
## Warning: 'newdata' had 1 row but variables found have 5 rows
##       fit       lwr       upr 
## 1.7000000 0.9194776 2.4805224

Example of inference of particular value of \(y\) (s14)

Suppose that the experiment that generated the data of previous example is to be run again with \(x=2\). Predict the particular value of \(y\) with \(1-\alpha= 0.90\) (Wackerly p596)

Solution The predicted value of \(y\) with \(x=2\) is \[\hat\beta_0 + \hat\beta_1x^* = 1 + (0.7)(2) = 2.4\] Further, with \(x^*=2\), \[\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}} = \frac{1}{5} + \frac{(2-0)^2}{10} = 0.6\] Thus the prediction interval is \[\hat\beta_0+ \hat\beta_1x^* \pm t_{\alpha/2}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}} = 2.4 \pm (2.353)(0.606)\sqrt{1+0.6} = 2.4 \pm 1.804\]

n <- 5
b0 <- 1
b1 <- 0.7
df <- n-2
t_crit <- qt(0.95, df)
x <- 2

y <- b0+b1*x
lower <- b0+b1*x - (t_crit*S*sqrt(1+(1/n)+((x - data$x[6])^2/Sxx)))
upper <- b0+b1*x + (t_crit*S*sqrt(1+(1/n)+((x - data$x[6])^2/Sxx)))

knitr::kable(data.frame(y, lower, upper))
y lower upper
2.4 0.5974608 4.202539
# base R
new.data <- data.frame(x = 2)
predict(slr2, new.data, level=0.90, interval = "prediction")[1,]
## Warning: 'newdata' had 1 row but variables found have 5 rows
##       fit       lwr       upr 
## 2.4000000 0.5974608 4.2025392

As you can see in the figure above, confidence bands for \(E(Y)\) is closer than prediction bands for \(y\).

Part II: Fitness of Regression Model

Example of the analysis of variance table (s17, Hayter p546)

The manager of a car plant wishes to investigate how the plant’s electricity usage depends upon the plant’s production. The data set shown below is compiled and provides the plant’s production and electrical usage for each month of the previous year.The electrical usage is in units of a million kilowatt-hours, and the production is measured as the value in million-dollar units of the cars produced in that month.

Production Usage
Jan 4.51 2.48
Feb 3.58 2.26
Mar 4.31 2.47
Apr 5.06 2.77
May 5.64 2.99
Jun 4.99 3.05
Jul 5.29 3.18
Aug 5.83 3.46
Sep 4.70 3.03
Oct 5.61 3.26
Nov 4.90 2.67
Dec 4.20 2.53

As shown in the scatter plot of the data set, the manager concludes that it is sensible to fit a straight line to the data points. The electricity usage is taken to be the dependent variable \(y\), and the production is taken to be the explanatory variable \(x\). The linear model \(y = \beta_0 + \beta_1x\) will then allow a month’s electrical usage to be estimated as function of the month’s production. We can perform ANOVA test for the regression line.

# Fitting linear model
fit <- lm(Usage ~ Production, data = data); summary(fit)
## 
## Call:
## lm(formula = Usage ~ Production, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23245 -0.16704  0.03919  0.13480  0.27645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.40905    0.38599   1.060    0.314    
## Production   0.49883    0.07835   6.367 8.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1729 on 10 degrees of freedom
## Multiple R-squared:  0.8021, Adjusted R-squared:  0.7823 
## F-statistic: 40.53 on 1 and 10 DF,  p-value: 8.176e-05
# ANOVA
res <- summary(aov(fit)); res
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Production   1 1.2124  1.2124   40.53 8.18e-05 ***
## Residuals   10 0.2991  0.0299                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
res <- unlist(res)

# R^2 = SSR/SST
R2 <- res[3]/(res[3]+res[4]); names(R2) <- "R2"; R2
##        R2 
## 0.8021094
# visualize
library(ggplot2)
qplot(data$Production, data$Usage, ylab = "Electricity Usage", 
      xlab = "Production", main = "Production vs Electricity Usage") + 
  stat_smooth(method = "lm")

Residual Analysis (s18, Hayter p585)

The residuals are defined to be \[e_i = y_i - \hat y_i, \hspace{1cm} 1\leq i \leq n\] so that they are the differences between the observed values of the dependent variable \(y_i\) and the corresponding fitted values \(\hat y_i\). A property of these residuals is that they sum to 0 \[\sum_{i=1}^n e_i = 0\] Residual analysis can be used

  • Identify data points that are outliers (residual-leverage, cook’s distance)

  • Check whether the fitted model is appropriate (Scale-location plot)

  • Check whether the error variance is constant (residual plots, standardized residual plots)

  • Check whether the error terms are normally distributed (normal probability plot)

Here I list some of residual plots which can be used for the diagnostics for the regression model

Random scatter and no patterns - best fit

Random scatter with outliers

Nonlinear relationship is exist - Simple linear regression is not appropriate

A funnel shape residual plot - Non constant error variance

Straight line in a normal scores plot - Normally distributed

nonlinear line in a normal scores plot - nonlinear patters

Plot the residuals \(e_i\) against the values of the explanatory variable \(x_i\)

# Residuals vs Fitted
plot(fit$fitted.values, fit$residuals, xlab = "Fitted value", 
     ylab = "Residuals", main = "Residuals vs Fitted")
abline(h=0, col="grey", lty = 2)
abline(h=0.2, col="grey", lty = 2); abline(h=-0.2, col="grey", lty = 2)

# base R - plot.lm
plot(fit, which=1)

Notice that the red line in the residual vs fitted value plot represents the average value of the residuals at each value of fitted value (scatterplot smoother). From this plot, we can identify outlier or nonlinear pattern of data.

Plot the standardized residual \(\frac{e_i}{\hat\sigma}\) against the values of the explanatory variable \(x_i\) or leverage \(h_i\)

Standardized residual

The standardized residual is simply ratio between residual and estimate of error variance as shown below. \[\frac{e_i}{\hat\sigma}\] we can take square root of the absolute value of the standardized residual to diminish skewness of the \(E(\frac{e_i}{\hat\sigma})\)

If we replace estimate of error variance with sample variance under consideration of leverage, we can get the Studentized residual. \[\frac{e_i}{S \sqrt{1-h_i}}\]

Leverage (\(h_i\))

An outlier is a data point whose response \(y\) does not follow the general trend of the rest of the data. A data point has high leverage if it has “extreme” predictor \(x\) values. A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results.

The statistic called leverage is given as follows \[h_i = \frac{1}{n} + \frac{(x-\bar x)^2}{\sum_{i=1}^n (x-\bar x)^2}\] and indicates the proportion of the SST of the \(x\) contribued by \(i_{th}\) point. Here I summarized the properties of levarage

  • The leverage \(h_i\) is a measure of the distance between the \(x\) value for the \(i_{th}\) data point and the mean of the \(x\) values for all \(n\) data points.
  • The leverage hii is a number between 0 and 1, inclusive.
  • The sum of the \(h_i\) equals \(p\), the number of parameters (regression coefficients including the intercept). Therefore, in simple linear regression case, \(p = 2\).

 

Cook’s Distance (\(D_i\))

Cook’s distance is a statistic that quantifies influence of data points. I summarized the definition and properties of Cook’s distance below.

\[D_i = \frac{(y_i - \hat y_i)^2}{2\times MSE}\bigg[\frac{h_{i}}{(1-h_{i})^2}\bigg] = \frac{e_i^2}{2\times MSE}\bigg[\frac{h_{i}}{(1-h_{i})^2}\bigg]\] - \(D_i\) directly summarizes how much all of the fitted values change when the \(i_{th}\) observation is deleted

  • A data point having a large \(D_i\) indicates that the data point strongly influences the fitted values
    • If \(D_i\) is greater than 0.5, then the \(i_{th}\) data point is worthy of further investigation as it may be influential
    • If \(D_i\) is greater than 1, then the \(i_{th}\) data point is quite likely to be influential
    • or if \(D_i\) is easy to recognize from the other \(D_i\) values, it is almost certainly influential.
  • For more information of \(D_i\), please see this page
# Standardized residual vs Fitted
fit.stdres <- rstandard(fit)
plot(fit$fitted.values, fit.stdres, xlab = "Fitted value", 
     ylab = "Standardized Residuals", ylim = c(-2, 2), main = "Standardized residuals vs Fitted")
abline(h=0, col="grey", lty = 3)
abline(h=1, col="grey", lty = 2); abline(h=-1, col="grey", lty = 2)
abline(h=2, col="grey", lty = 2); abline(h=-2, col="grey", lty = 2)

# Square root |Standardized residual| vs Fitted
plot(fit$fitted.values, sqrt(abs(fit.stdres)), xlab = "Fitted value", ylim = c(0, 1.3),
     ylab = bquote(sqrt(abs("Standardized residuals"))), main = "Scale-Location")

plot(fit, which=3)

# Cook's distance - Influential outlier detection
plot(fit, which=4)

# Standardized residual vs leverage - Influential outlier detection
plot(fit, which=5)

Plot a normal probability plot (also referred to as a normal scores plot, or Q-Q plot)

Normal scores plot is used to check whether the error terms \(\epsilon_i\) appear to be normally distributed. In such a plot the residuals \(e_i\) are plotted against their “normal scores”, which can be thought of as being their “expected values” if they have a normal distribution. The normal score of \(i_{th}\) smallest residual is \[\Phi^{-1}\Bigg(\frac{i - \frac{3}{8}}{n+\frac{1}{4}}\Bigg)\]

qqnorm(fit.stdres, ylab = "Standardize residuals", xlab = "Normal scores", 
       main = "Normal Probability Plot", ylim = c(-1.5, 1.9))
qqline(fit.stdres, lty=3, col="grey")

# Q-Q plot
plot(fit, which=2)

Example of lack of fit test (s18)

A perfect regression model results in a fitted line that passes exactly through all observed data points. This perfect model will give us a zero error sum of square, i.e. \(SSE = 0\). However, if you conduct the same experiment with the same explantory variable \(x_i\) and record response variable \(y_i\) multiple times, the observations from the experiment (2nd trial) may deviate from perfect model. This error is originated from “purely” random variation or noise. The sum of squares due to this error, called sum of square of pure error is abberiviated as \(SSPE\) quantifies thses variations. \(SSPE\) is calculated by taking repeated observations at some or all values of \(x_i\) and adding up the square of deviations at each level of \(x_i\). The remaining error is called sum of square of lac of fit and abberiviated as \(SSLF\). Therefore,

\[SSE = SSPE + SSLE\] As you can see the table below, in lack of fit test, we analyze residual error SSE in terms of SSLF and SSPE by taking MSE of each error separately.

Suppose that we built a simple linear regression model between the size of cancer mass (\(x\)) and the increased size of it (\(y\)) in a given time nonetheless its curved distribution as shown below. The regression showed that the \(\hat\beta_0=50.7\) and \(\hat\beta_1=0.49\) (Calculate \(\beta_0\), and \(\beta_1\), the variance and confidence intervals of \(\beta_0\), \(\beta_1\), \(\hat y\) by yourself too. set \(\alpha=0.05\)).

 

Solution

 

Mass Increase_t1 Increase_t2
75 28 85
75 42 88
100 112 97
100 136 100
125 160 108
125 150 113
150 152 126
175 156 134
175 124 139
200 124 147
200 104 150

cancer_mass <- read.table("http://bisyn.kaist.ac.kr/bis335/10-Regression-tumorsize.txt", 
                          sep = "\t", header = T)
colnames(cancer_mass) <- c("Mass", "Increase_t1", "Increase_t2")

# Find b0, b1, yhat
fit <- lm(Increase_t1 ~ Mass, data=cancer_mass)
summary(fit)
## 
## Call:
## lm(formula = Increase_t1 ~ Mass, data = cancer_mass)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.23 -34.06  12.61  32.44  48.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  50.7225    39.3979   1.287     0.23
## Mass          0.4867     0.2747   1.772     0.11
## 
## Residual standard error: 40.47 on 9 degrees of freedom
## Multiple R-squared:  0.2586, Adjusted R-squared:  0.1762 
## F-statistic: 3.139 on 1 and 9 DF,  p-value: 0.1102
knitr::kable(data.frame(y = cancer_mass$Mass, y_hat = fit$fitted.values))
y y_hat
75 87.22513
75 87.22513
100 99.39267
100 99.39267
125 111.56021
125 111.56021
150 123.72775
175 135.89529
175 135.89529
200 148.06283
200 148.06283
# Manual calculation of Lack of fit ANOVA table
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- cancer_mass

# Degrees of freedom
df.SSPE <- length(data$Mass) - length(unique(data$Mass))
df.SSLF <- length(unique(data$Mass)) - 2

# calculation of SSPE & SSLF  
data2 <- data %>% group_by(Mass) %>% summarise(Group_mean = mean(Increase_t1))
data <- left_join(data, data2, by="Mass")
data$SSPE <- (data$Increase_t1 - data$Group_mean)^2
data$Pred <- data$Mass * fit$coefficients[2] + fit$coefficients[1]
data$SSLF <- (data$Group_mean - data$Pred)^2
SSPE <- sum(data$SSPE)
SSLF <- sum(data$SSLF)

# calculation of F statistic
MSLF <- SSLF/df.SSLF
MSPE <- SSPE/df.SSPE
a <- 0.05

f <- round(MSLF / MSPE, 4)
f_crit <- qf(1-a, df1 = df.SSLF, df2 = df.SSPE); f_crit
## [1] 5.192168
# pvalue
p <- round(1 - pf(f, df.SSLF, df.SSPE),4)

# summary table
knitr::kable(data.frame(cbind(DF = c(df.SSLF, df.SSPE), 
                              SS = c(SSLF, SSPE), 
                              MS = c(MSLF, MSPE), 
                              F = c(f,NA), 
                              p = c(p, NA)
                              ) # end of cbind
                        ) # end of data.frame
             ) # end of kable
DF SS MS F p
4 13593.57 3398.393 14.8014 0.0056
5 1148.00 229.600 NA NA
# with R package
library(alr3)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
fit <- lm(Increase_t1 ~ Mass, data = cancer_mass)
summary(fit)
## 
## Call:
## lm(formula = Increase_t1 ~ Mass, data = cancer_mass)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.23 -34.06  12.61  32.44  48.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  50.7225    39.3979   1.287     0.23
## Mass          0.4867     0.2747   1.772     0.11
## 
## Residual standard error: 40.47 on 9 degrees of freedom
## Multiple R-squared:  0.2586, Adjusted R-squared:  0.1762 
## F-statistic: 3.139 on 1 and 9 DF,  p-value: 0.1102
anova(fit)
## Analysis of Variance Table
## 
## Response: Increase_t1
##           Df  Sum Sq Mean Sq F value Pr(>F)
## Mass       1  5141.3  5141.3  3.1389 0.1102
## Residuals  9 14741.6  1638.0
pureErrorAnova(fit)
## Analysis of Variance Table
## 
## Response: Increase_t1
##              Df  Sum Sq Mean Sq F value   Pr(>F)   
## Mass          1  5141.3  5141.3  22.393 0.005186 **
## Residuals     9 14741.6  1638.0                    
##  Lack of fit  4 13593.6  3398.4  14.801 0.005594 **
##  Pure Error   5  1148.0   229.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because p-value is less than significance level of our test, we can conclude that there is lack of fit in the simple linear regression model.

Example of Correlation analysis (s20, s22, Wakerly p600)

The data in the table below represent a sample of mathematics achievement test scores and calculus grades for ten independently selected college freshmen. From this evidence, would you say that the achievement test scores and calculus grades are independent? Use \(\alpha = 0.05\). Identify the corresponding attained significance level.

Student Math_Test_Score Final_Calculus_Grade
1 39 65
2 43 78
3 21 52
4 64 82
5 57 92
6 47 89
7 28 73
8 75 98
9 34 56
10 52 75

Solution

  • \(H_0: \rho = 0\)

  • \(H_a: \rho \neq 0\)

  • test statistic: \(t = \frac {r\sqrt{n-2}}{\sqrt{1-r^2}}\)

data <- data.frame(student = 1:10, 
                   x = c(39, 43, 21, 64, 57, 47, 28, 75, 34, 52), 
                   y = c(65, 78, 52, 82, 92, 89, 73, 98, 56, 75)
                   )
data2 <- data
# setting parameters
n <- length(data$student)
df <- n-2
a <- 0.05

# calculation of SS
data2$xy <- data2$x * data$y
data2$x2 <- data2$x^2
data2$y2 <- data2$y^2
data2[length(data2$student)+1,] <- colSums(data2)

SS <- data2[11,-1]; SS
##      x   y    xy    x2    y2
## 11 460 760 36854 23634 59816
Sxx <- n*SS[4] - SS[1]^2
Syy <- n*SS[5] - SS[2]^2
Sxy <- n*SS[3] - SS[1]*SS[2]

knitr::kable(data.frame(Sxx = as.numeric(Sxx), 
                        Syy = as.numeric(Syy), 
                        Sxy = as.numeric(Sxy)))
Sxx Syy Sxy
24740 20560 18940
# Calculation of r and t
r <- as.numeric(Sxy/sqrt(Sxx*Syy))
t <- (r * sqrt(n-2))/(sqrt(1 - r^2))
t_crit <- qt(1-(a/2), df)
p <- 2 * pt(t, df, lower.tail = FALSE)

knitr::kable(data.frame(r = round(r,3), 
                        t = round(t,3), 
                        t_crit = round(t_crit,3), 
                        p = round(p,5)), 
             align='c')
r t t_crit p
0.84 4.375 2.306 0.00236
# base R
cor.test(data$x, data$y)
## 
##  Pearson's product-moment correlation
## 
## data:  data$x and data$y
## t = 4.375, df = 8, p-value = 0.002365
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4459632 0.9611846
## sample estimates:
##       cor 
## 0.8397859
# cf) Spearman rank correlation (for more information, see nonparametric test tutorial)
cor.test(data$x, data$y, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  data$x and data$y
## S = 20, p-value = 0.001977
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8787879

Example of Nonparametric Spearman’s Ranked Correlation Test

We discussed Spearman’s ranked correlation in the lecture of nonparametric test. As you already know, the statstic, abbreviated as \(r_S\), is a nonparametric version of Pearson’s correlation statistic. (The formular is the same as Pearson’s formular except it utilizes ranked values instead of actual value of \(x_i\) and \(y_i\)). When there are no ties in either the \(x\) observations or the \(y\) observations, the expression for \(r_S\) algebraically reduces to a simpler expression:

\[r_S = 1 - \frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}, \hspace{1cm} where \ d_i =R(x_i) - R(y_i)\] The statistical significance of Spearman’s correlation statistic can be determined by the t statistic as well as Pearson correlation statistic. Now, let’s consider simple example. The correlation between IQ and the hours of TV watching per week.

IQTV <- data.frame(IQ = c(86, 97, 99, 100, 101, 103, 106, 110, 112, 113), 
                   TV = c(0, 20, 28, 27, 50, 29, 7, 17, 6, 12))

n <- length(IQTV$IQ)
df <- n-2
IQTV$rank_IQ <- rank(IQTV$IQ)
IQTV$rank_TV <-  rank(IQTV$TV)
IQTV$d <- IQTV$rank_IQ - IQTV$rank_TV
IQTV$d2 <- IQTV$d^2

knitr::kable(IQTV)
IQ TV rank_IQ rank_TV d d2
86 0 1 1 0 0
97 20 2 6 -4 16
99 28 3 8 -5 25
100 27 4 7 -3 9
101 50 5 10 -5 25
103 29 6 9 -3 9
106 7 7 3 4 16
110 17 8 5 3 9
112 6 9 2 7 49
113 12 10 4 6 36
rs <- 1-6*sum(IQTV$d2)/(n*(n^2-1))
t <- (rs * sqrt(n-2))/(sqrt(1 - rs^2))
t_crit <- -qt(1-(a/2),df)
p <- 2*pt(t, df)

knitr::kable(data.frame(rs = round(rs,3), 
                        t = round(t,3), 
                        t_crit = round(t_crit,3), 
                        p = round(p,5)), 
             align='c')
rs t t_crit p
-0.176 -0.505 -2.306 0.62719
# base R
cor.test(IQTV$IQ, IQTV$TV, method = "spearman", conf.level = 0.95)
## 
##  Spearman's rank correlation rho
## 
## data:  IQTV$IQ and IQTV$TV
## S = 194, p-value = 0.632
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1757576

The test result implies that there is no correlation between IQ and hours of TV watching per week but negative sign implies there could be negative correlation between two although it is very low.

Part III: Extension of Linear Model

Example of Variable transformation (s25)

A bioengineer measures the growth rate of a substance by counting the number of cells \(N\) present at various times \(t\) as shown below. Fit the model \(N=\gamma_0e^{\gamma_1t}\) and calculate two-sided 95% confidence intervals for the unknown parameters \(\gamma_1\)

time Num_cell
1 12
2 31
3 42
4 75
5 119
6 221
7 327
8 546

Solution with two unknown parameters \(\gamma_0\) and \(\gamma_1\) is commonly used for this problem. With a logarithmic transformation, the linear relationship \[y = \beta_0 + \beta_1 x\] is obtained with \(y = ln(N)\), \(\beta_0 = ln(\gamma_0)\), \(\beta_1 = \gamma_1\), \(x = t\).

A straight line is fitted to the transformed data points \((x_i, y_i)\) and the fitted regression line is \[y = 0.523x + 2.176\] The equation can be transformed back into a model in terms of the original variables \[N = e^{ln(\gamma_0) + \gamma_1t}=e^{ln(2.176) + 0.523t}\] \[\beta_1 = \gamma_1 \in (0.472, 0.573)\]

cell_growth <- data.frame(time = 1:8, Num_cell = c(12, 31, 42, 75, 119, 221, 327, 546))

# transform of variables
cell_growth$lnt <- log(cell_growth$time)
cell_growth$lnN <- log(cell_growth$Num_cell)

# fit linear model directly
model1 <- lm(Num_cell ~ time, data = cell_growth)
summary(model1)
## 
## Call:
## lm(formula = Num_cell ~ time, data = cell_growth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -87.14 -56.16 -21.64  44.47 132.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -139.04      64.20  -2.166  0.07350 . 
## time           69.04      12.71   5.430  0.00162 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.39 on 6 degrees of freedom
## Multiple R-squared:  0.8309, Adjusted R-squared:  0.8027 
## F-statistic: 29.49 on 1 and 6 DF,  p-value: 0.001617
plot(Num_cell ~ time, data = cell_growth, pch = 16, 
     ylab = "N", xlab = "t")
lines(cell_growth$time, model1$fitted.values)

# fit linear model with transformed variable
model2 <- lm(lnN ~ time, data = cell_growth)
summary(model2)
## 
## Call:
## lm(formula = lnN ~ time, data = cell_growth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.21446 -0.05110 -0.01050  0.05717  0.21144 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.17619    0.10357   21.01 7.57e-07 ***
## time         0.52318    0.02051   25.51 2.39e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1329 on 6 degrees of freedom
## Multiple R-squared:  0.9909, Adjusted R-squared:  0.9893 
## F-statistic: 650.7 on 1 and 6 DF,  p-value: 2.391e-07
plot(lnN ~ time, data = cell_growth, pch = 16, 
     ylab = "ln(N)", xlab = "t")
lines(cell_growth$time, model2$fitted.values)

slr <- function(x) exp(2.176 + 0.523*x)
plot(cell_growth$time, cell_growth$Num_cell, pch = 16, 
     ylab = "N", xlab = "t")
curve(slr, add = TRUE)

# confidence interval
confint(model2, "time", level=0.95)
##         2.5 %    97.5 %
## time 0.472994 0.5733623

Example of Multiple linear Regression (s25)

For multiple linear regression, We define the following matrices, with \(x_0 = 1\):

\[Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \end{bmatrix}, \hspace{1cm} X = \begin{bmatrix} x_0 & x_{11} & x_{12} & ... & x_{1k} \\ x_0 & x_{21} & x_{22} & ... & x_{2k} \\ \vdots & \vdots & \vdots & & \vdots \\ x_0 & x_{n1} & x_{n2} & ... & x_{nk} \end{bmatrix}, \hspace{1cm} \beta = \begin{bmatrix} \beta_0 \\ \beta_2 \\ \vdots\\ \beta_n \end{bmatrix}, \hspace{1cm} \epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots\\ \epsilon_n \end{bmatrix}\]

Thus, the n equations representing \(y_i\) as a function of the x’s, \(\beta\)’s, and \(\epsilon\)’s can be simultaneously written as

\[Y = X\beta + \epsilon\] In simple linear regression, we see that the least-squares equations are given by \[(X'X)\hat\beta = X'Y\]

Hence, \[\hat\beta = (X'X)^{-1}X'Y\] The \(n \times 1\) vector of fitted values is \[ \hat Y = \begin{pmatrix} \hat y_1 \\ \hat y_2 \\ \vdots \\ \hat y_n \end{pmatrix} = X\hat\beta\] and the \(n \times 1\) vector of residuals is \[ e = \begin{pmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{pmatrix} = Y - \hat Y = \begin{pmatrix} y_1 - \hat y_1 \\ y_2 - \hat y_2 \\ \vdots \\ y_n - \hat y_n \end{pmatrix}\] The sum of squares for error can then be calculated as \[SSE = e^{'}e = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat y_i)^2\] and \[\hat\sigma^2 = MSE = \frac{SSE}{n-k-1}\] where \(k\) indicates number of input variables.

Example of multiple linear regression - Curvilinear Regression Model (Polynomial) (s25, Wackerly p612)

x y
-2 0
-1 0
0 1
1 1
2 3

Fit a parabola to the data above, using the model \[Y = \beta_0 + \beta_1x + \beta_2x^2 +\epsilon\]

Solution The \(X\) matrix for this example differs from that of simple linear regression case only by the addition of a third column corresponding to \(x^2\). (Notice that \(x_1 = x, x_2 = x^2\), and \(k = 2\) in the notation of the general linear model) Thus,

\[Y = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 3 \end{bmatrix}, \hspace{1cm} X = \begin{bmatrix} 1 & -2 & 4 \\ 1 & -1 & 1 \\ 1 & \phantom{-}0 & 0 \\ 1 & \phantom{-}1 & 1 \\ 1 & \phantom{-}2 & 4 \end{bmatrix}\]

Thus, for the first measurement, \(y = 0, x_0 = 1, x = -2\), and \(x^2 = 4\) and for the second measurement, \(y = 0, x_0 = 1, x = -1\), and \(x^2 = 1\). Succeeding rows of the \(Y\) and \(X\) matrices are obtained in a similar manner. The matrix products \(X'X\) and \(X'Y\) are

\[X'X =\begin{bmatrix} \phantom{-}1 & \phantom{-}1 & 1 & 1 & 1 \\ -2 & -1 & 0 & 1 &2 \\ \phantom{-}4 & \phantom{-}1 & 0 & 1 & 4\end{bmatrix} \begin{bmatrix} 1 & -2 & 4 \\ 1 & -1 & 1 \\ 1 & \phantom{-} 0 & 0\\ 1 & \phantom{-} 1 & 1\\ 1 & \phantom{-} 2 & 4\\ \end{bmatrix} =\begin{bmatrix} 5 & 0 & 10 \\ 0 & 10 & 0 \\ 10 & 0 & 34 \end{bmatrix}\]

\[X'Y =\begin{bmatrix} \phantom{-}1 & \phantom{-}1 & 1 & 1 & 1 \\ -2 & -1 & 0 & 1 &2 \\ \phantom{-}4 & \phantom{-}1 & 0 & 1 & 4\end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 3 \end{bmatrix} =\begin{bmatrix} 5 \\ 7 \\ 13 \end{bmatrix}\]

By inverting \(X'X\), we get, \[(X'X)^{-1} =\begin{bmatrix} \phantom{-} 17/35 & 0 & -1/7 \\ \phantom{-} 0 & 1/10 & 0 \\ -1/7 & 0 & \phantom{-} 1/14\end{bmatrix}\]

Finally, \[\begin{aligned} \hat\beta &= (X'X)^{-1} X'Y \\ &=\begin{bmatrix} \phantom{-} 17/35 & 0 & -1/7 \\ \phantom{-} 0 & 1/10 & 0 \\ -1/7 & 0 & \phantom{-} 1/14\end{bmatrix} \begin{bmatrix} 5 \\ 7 \\ 13 \end{bmatrix} =\begin{bmatrix} 4/7 \\ 7/10 \\ 3/14 \end{bmatrix} \approx \begin{bmatrix} 0.571 \\ 0.700 \\ 0.214 \end{bmatrix}\end{aligned}\]

Hence, \(\hat\beta_0 = 0.571, \ \hat\beta_1 = 0.7\), and \(\hat\beta_2 = 0.214\). and the prediction equation is \[\hat y = 0.571 + 0.7x + 0.214 x^2\]

toy <- data.frame(x = c(-2,-1,0,1,2), y = c(0,0,1,1,3))
toy$x2 <- toy$x^2
X <- as.matrix(data.frame(x0 = rep(1,5), x = toy$x, x2 = toy$x2))
Y <- as.matrix(toy$y)
n <- length(Y)
k <- dim(X)[2]-1

# B_hat matrix
B_hat <- solve(t(X) %*% X) %*% t(X) %*% Y; B_hat
##         [,1]
## x0 0.5714286
## x  0.7000000
## x2 0.2142857
Y_hat <- X %*% B_hat # prediction
e <- Y-Y_hat # residual
SSE <- t(e) %*% e # sum of square error
S_hat <- sqrt(SSE/(n-k-1)) # residual standard error (or MSE)


# base R
fit <- lm(y ~ x+x2, data = toy)
summary(fit)
## 
## Call:
## lm(formula = y ~ x + x2, data = toy)
## 
## Residuals:
##        1        2        3        4        5 
## -0.02857 -0.08571  0.42857 -0.48571  0.17143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.5714     0.3332   1.715   0.2285  
## x             0.7000     0.1512   4.630   0.0436 *
## x2            0.2143     0.1278   1.677   0.2355  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4781 on 2 degrees of freedom
## Multiple R-squared:  0.9238, Adjusted R-squared:  0.8476 
## F-statistic: 12.13 on 2 and 2 DF,  p-value: 0.07619

for more information on the equations for the inference of parameters of MLR model, see Wakerly p615 ~ p634

Part IV: Summary

Summary example 1 - \(V0_2-max\) Aerobic Fitness Measurements (s27, Hayter p560)

The table below is a data set concerning the aerobic fitness of a sample of 20 male subjects. An exercising individual breathes through an apparatus that measures the amount of oxygen in the inhaled air that is used by the individual. The maximum value per unit time of the utilized oxygen is then scaled by the person’s body weight to come up with a variable \(VO_2-max\), which is a general indication of the aerobic fitness of the individual. In other words, \(VO_2-max\) quantifies the maximum rate of oxygen consumption measured during incremental exercise. Fit a linear regression model with \(VO_2-max\) as the dependent variable and age as the explanatory variable. Solve questions below.

VO2.max Age
23 59
45 47
29 44
55 32
48 45
42 61
32 71
33 32
34 28
52 36
40 36
35 51
45 31
47 44
26 73
42 47
35 40
41 43
29 68
38 40
  1. Estimate intercept (\(\hat\beta_0\)) and slope (\(\hat\beta_1\)) parameters.
vo2 <- read.csv("http://bisyn.kaist.ac.kr/bis335/10-Regression-vo2max-aerobic-fitness.csv")

# prep
colnames(vo2) <- c("y", "x")
data <- vo2 %>% select(x, y) %>% mutate(xy = x*y, x2 = x^2, y2 = y^2)

# SS
n <- length(data$x)
SS <- colSums(data)
Sxx <- SS[4] - SS[1]^2/n
Syy <- SS[5] - SS[2]^2/n
Sxy <- SS[3] - SS[1]*SS[2]/n
kable(data.frame(Sxx = as.numeric(Sxx), Syy = as.numeric(Syy), Sxy = as.numeric(Sxy)), 
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
Sxx Syy Sxy
3486.8 1428.95 -1177.4
# Slope & Intercept
b1 <- Sxy/Sxx
b0 <- mean(data$y) - b1*mean(data$x)

kable(data.frame(intercept = as.numeric(round(b0,4)), 
                        slope = as.numeric(round(b1,4))), 
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
intercept slope
54.2181 -0.3377
  1. Calculate \(r^2\), \(r\), and determine the significance of \(r\) using \(t\)-statistic
df <- n - 2
SSE <- as.numeric(Syy - b1*Sxy)
S2 <- SSE/df # error variance

# R^2 & r
SSR <- b1*Sxy
SST <- Syy
r2 <- SSR/SST
r <- Sxy/sqrt(Sxx*Syy)

# pvalue
t <- (r * sqrt(n-2))/(sqrt(1 - r^2))
t_crit <- qt(1-(a/2), df)
p <- 2 * pt(t, df, lower.tail = FALSE)

library(kableExtra)
kable(data.frame(S2 = as.numeric(round(S2,4)),
                        r2 = as.numeric(round(r2,4)), 
                        r = as.numeric(round(r,4)),
                        sqrt_r2 = as.numeric(round(-sqrt(r2),4)),
                        p = as.numeric(round(p,4))),
             col.names = c("$S^2$", "$r^2$", "r", "$-\\sqrt {r^2}$", "p"), 
             align = 'c', escape = FALSE, format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(S^2\) \(r^2\) r \(-\sqrt {r^2}\) p
57.2985 0.2782 -0.5275 -0.5275 1.9832
  1. Find one-sided 95% confidence interval for \(\hat\beta_1\)
a <- 0.05
t_crit <- qt(1 - a, df)
S <- sqrt(S2)
c11 <- 1/Sxx

upper <- b1 + (t_crit*S*sqrt(c11)); upper
##         xy 
## -0.1153818
kable(data.frame(lower = "-$\\infty$", 
                 upper = as.numeric(round(upper,4))), 
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
lower upper
-\(\infty\) -0.1154
  1. Find two-sided 95% confidence and prediction interval for 50 years old male
# Confidence interval
t_crit <- qt(1 - (a/2), df)
x <- 50
y_hat <- b0+b1*x
K <- (1/n)+((x - mean(vo2$x))^2/Sxx)

lower <- y_hat - (t_crit*S*sqrt(K))
upper <- y_hat + (t_crit*S*sqrt(K))

kable(data.frame(y_hat = as.numeric(round(y_hat, 4)), 
                 lower = as.numeric(round(lower, 4)),
                 upper = as.numeric(round(upper,4))),
      col.names = c("$\\hat y$", "lower", "upper"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(\hat y\) lower upper
37.3344 33.6485 41.0202
# Prediction interval
lower <- y_hat - (t_crit*S*sqrt(1+K))
upper <- y_hat + (t_crit*S*sqrt(1+K))

kable(data.frame(y_hat = as.numeric(round(y_hat, 4)), 
                 lower = as.numeric(round(lower, 4)),
                 upper = as.numeric(round(upper,4))),
      col.names = c("$\\hat y$", "lower", "upper"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(\hat y\) lower upper
37.3344 21.0097 53.659
  1. Using base R function, perform above analysis
vo2 <- read.csv("http://bisyn.kaist.ac.kr/bis335/10-Regression-vo2max-aerobic-fitness.csv")

# 1. lm
fit <- lm(VO2.max ~ Age, data = vo2)
fit$coefficients # b0, b1
## (Intercept)         Age 
##  54.2180509  -0.3376735
# 2. r^2, r
## r^2
r2 <- summary(fit)
kable(data.frame(r2 = as.numeric(round(r2$r.squared, 4)), 
                 f = as.numeric(round(r2$f[1], 4)),
                 p = as.numeric(round(r2$coefficients[2,4],4)),
                 sigma = as.numeric(round(r2$sigma,4))),
      col.names = c("$r^2$", "F", "p", "$\\hat\\sigma$"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(r^2\) F p \(\hat\sigma\)
0.2782 6.9387 0.0168 7.5696
## r, t, p
r <- cor.test(vo2$VO2.max, vo2$Age)

kable(data.frame(r = as.numeric(round(r$estimate, 4)), 
                 t = as.numeric(round(r$statistic, 4)),
                 p = as.numeric(round(r$p.value,4))),
      col.names = c("$r$", "t", "p"),
      escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
  kable_styling(position = "center")
\(r\) t p
-0.5275 -2.6341 0.0168
# 3. conf.int for b1 (one-sided, upper)
confint(fit, "Age", level = 1-2*a, "one-sided")[2]
## [1] -0.1153818
# 4. conf.int and pred.int for y_hat
newdata <- data.frame(Age = 50)
predict(fit, newdata, level=1-a, interval = "confidence")[1,]
##      fit      lwr      upr 
## 37.33438 33.64853 41.02022
predict(fit, newdata, level=1-a, interval = "prediction")[1,]
##      fit      lwr      upr 
## 37.33438 21.00974 53.65901
# 5. visualize
library(ggplot2)
qplot(vo2$Age, vo2$VO2.max, ylab = "VO2 Max", 
      xlab = "Age", main = "VO2 max vs Age") + stat_smooth(method = "lm")

  1. Questions
  • Does the sign of \(\hat\beta_1\) surprise you? What does your model predict as the change in \(VO2-max\) for an additional five years of age? \[-0.3377 \times 5 = -1.6885\]
  • What does the model tell you about the aerobic fitness of a 15-year-old boy?

If the model is used then it would be extrapolation, so the prediction may be inaccurate.

Summary example 2 - Initial mass of tumor vs Increased size of tumor

The following table summarize the initial (\(S_0\)) and increased masses of tumor (\(S_1\)) in a given time. From the scatter plot of the data set, you may notice that the relationship between \(S_0\) and \(S_1\) seems nonlinear but intrinisically linear. Find best linear fit of this data by using variable transformation and answer questions below. You can also test the variable \(\Delta S = S_1 - S_0\) to find best fit between initial (possibly transformed) mass and difference or increased mass.

\(S_0\) \(S_1\)
55 73
55 87
80 182
80 206
105 255
105 245
130 272
155 301
155 269
180 294
180 274

  1. Calculate the \(R^2\) for \((S_0, \Delta S)\), \((S_0, S_1)\)
# prep
tumor$sd <- tumor$s1 - tumor$s0

# reordering
fit1 <- lm(sd ~ s0, data = tumor)
fit2 <- lm(s1 ~ s0, data = tumor)

# r^2
summary(fit1)$r.squared
## [1] 0.2585808
summary(fit2)$r.squared
## [1] 0.7649424
  1. Perform appropriate variable transformation and calculate \(R^2\) for \((S_0^*, \Delta S)\), \((S_0^*, S_1)\)
# prep
tumor$s0_t <- log(tumor$s0) # ln(s_0)

# reordering
fit3 <- lm(sd ~ s0_t, data = tumor)
fit4 <- lm(s1 ~ s0_t, data = tumor)

# r^2
summary(fit3)$r.squared
## [1] 0.4075046
summary(fit4)$r.squared
## [1] 0.8827907
  1. Test which variable pairs among the four relations above are significantly dependent with \(\alpha\) level of 0.05 and 0.01.
# Extract the statistic
t <- data.frame(fit1 = summary(fit1)$coefficients[2,3],
                fit2 = summary(fit2)$coefficients[2,3],
                fit3 = summary(fit3)$coefficients[2,3],
                fit4 = summary(fit4)$coefficients[2,3])

# Setting parameters
a1 <- 0.05 
a2 <- 0.01
df <- length(tumor$s0)-2

# Calculate critical t (two-sided)
t_crit1 <- qt(1-(a1/2), df)
t_crit2 <- qt(1-(a2/2), df)

# result
t > t_crit1
##       fit1 fit2 fit3 fit4
## [1,] FALSE TRUE TRUE TRUE
t > t_crit2
##       fit1 fit2  fit3 fit4
## [1,] FALSE TRUE FALSE TRUE
# cf) R^2 cutoff
r2_crit1 <- t_crit1^2 / (t_crit1^2+df)
r2_crit2 <- t_crit2^2 / (t_crit2^2+df)

r2_crit <- data.frame(r2_crit1, r2_crit2); r2_crit
##    r2_crit1 r2_crit2
## 1 0.3624868 0.539911

Notice that for testing significance, \(\hat\beta_1\) and r use the same t statistic. (see the theory part)

Recall that \[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = \frac{1}{1+\frac{SSE}{SSR}}\]

  1. Test if there is the “lack of fit” to the models you created with significance level of 0.05 and 0.01
library(alr3)
f_lof <- data.frame(fit1 = pureErrorAnova(fit1)[3,4],
                    fit2 = pureErrorAnova(fit2)[3,4],
                    fit3 = pureErrorAnova(fit3)[3,4],
                    fit4 = pureErrorAnova(fit4)[3,4])

# setting parameters
df1 <- pureErrorAnova(fit1)[3,1]
df2 <- pureErrorAnova(fit1)[4,1]
a1 <- 0.05
a2 <- 0.01

# critical F statistic
f_crit1 <- qf(1-a1, df1, df2)
f_crit2 <- qf(1-a2, df1, df2)

# result
f_lof > f_crit1
##      fit1 fit2 fit3 fit4
## [1,] TRUE TRUE TRUE TRUE
f_lof > f_crit2
##      fit1 fit2 fit3  fit4
## [1,] TRUE TRUE TRUE FALSE
  1. Calculate prediction interval of the model where \(x_0 = \bar x\) and \(x_0 = \bar x + 2S\) with significance level of 0.05
# prep
S <- data.frame(fit1 = summary(fit1)$sigma,
                fit2 = summary(fit2)$sigma,
                fit3 = summary(fit3)$sigma,
                fit4 = summary(fit4)$sigma)
a <- 0.05

# for x0 = xbar
newdata.fit12 <- data.frame(s0 = mean(tumor$s0))
newdata.fit34 <- data.frame(s0_t = mean(tumor$s0_t))

# prediction interval for x0 = xbar
predict(fit1, newdata.fit12, level=1-a, interval = "prediction")[1,]
##       fit       lwr       upr 
## 107.09091  11.46674 202.71508
predict(fit2, newdata.fit12, level=1-a, interval = "prediction")[1,]
##      fit      lwr      upr 
## 223.4545 127.8304 319.0787
predict(fit3, newdata.fit34, level=1-a, interval = "prediction")[1,]
##       fit       lwr       upr 
## 107.09091  21.60821 192.57361
predict(fit4, newdata.fit34, level=1-a, interval = "prediction")[1,]
##      fit      lwr      upr 
## 223.4545 155.9300 290.9791
# for x0 = xbar + 2S
newdata.fit1 <- data.frame(s0 = mean(tumor$s0) + 2 * as.numeric(summary(fit1)$sigma))
newdata.fit2 <- data.frame(s0 = mean(tumor$s0) + 2 * as.numeric(summary(fit2)$sigma))
newdata.fit3 <- data.frame(s0_t = mean(tumor$s0_t) + 2 * as.numeric(summary(fit3)$sigma))
newdata.fit4 <- data.frame(s0_t = mean(tumor$s0_t) + 2 * as.numeric(summary(fit4)$sigma))

# prediction interval for x0 = xbar + 2S
predict(fit1, newdata.fit1, level=1-a, interval = "prediction")[1,]
##       fit       lwr       upr 
## 146.48611  38.43892 254.53331
predict(fit2, newdata.fit2, level=1-a, interval = "prediction")[1,]
##      fit      lwr      upr 
## 343.7930 235.7458 451.8402
predict(fit3, newdata.fit3, level=1-a, interval = "prediction")[1,]
##       fit       lwr       upr 
## 4820.8822  534.0701 9107.6942
predict(fit4, newdata.fit4, level=1-a, interval = "prediction")[1,]
##       fit       lwr       upr 
##  9956.777  7281.596 12631.958
  1. visulalize the best regression line in transformed space and original space.
# transformed space
plot(tumor$s0_t, tumor$s1, pch = 16, xlab = "S0_t", ylab = "S1")
lines(tumor$s0_t, fit4$fitted.values)

# original space
slrv <- function(x){fit4$coefficients[2]*log(x)+fit4$coefficients[1]}
plot(tumor$s0, tumor$s1, pch = 16, xlab = "S0", ylab = "S1")
curve(slrv, add = TRUE)

Supplements

Simple Linear Regression

##Simple Linear Regression(1)
data("faithful")
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55
summary(faithful)
##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0
with(faithful, plot(eruptions,waiting)) #Scatterplot of the data

with(faithful, cor(eruptions,waiting)) #correlation
## [1] 0.9008112
##Simple Linear Regression(2)
lm(waiting ~ eruptions, data=faithful)
## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Coefficients:
## (Intercept)    eruptions  
##       33.47        10.73
##Simple Linear Regression(3)
slr1 <- lm(waiting ~ eruptions, data = faithful)
summary(slr1)
## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0796  -4.4831   0.2122   3.9246  15.9719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.4744     1.1549   28.98   <2e-16 ***
## eruptions    10.7296     0.3148   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.914 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16
summary(slr1)$r.squared
## [1] 0.8114608
##SLR - Confidence Interval and Prediction Interval
newdata = data.frame(eruptions=4)
predict(slr1, newdata, interval="confidence")
##        fit     lwr      upr
## 1 76.39296 75.6189 77.16702
predict(slr1, newdata, interval="predict")
##        fit      lwr     upr
## 1 76.39296 64.72382 88.0621
##Graph of Simple Linear Regression
plot(faithful$waiting ~ faithful$eruptions)
abline(slr1, col="red")

par(mfrow=c(2,2))
plot(slr1)

Multiple Linear Regression

##Multiple Linear Regression(1)
data("stackloss")
head(stackloss)
##   Air.Flow Water.Temp Acid.Conc. stack.loss
## 1       80         27         89         42
## 2       80         27         88         37
## 3       75         25         90         37
## 4       62         24         87         28
## 5       62         22         87         18
## 6       62         23         87         18
summary(stackloss)
##     Air.Flow       Water.Temp     Acid.Conc.      stack.loss   
##  Min.   :50.00   Min.   :17.0   Min.   :72.00   Min.   : 7.00  
##  1st Qu.:56.00   1st Qu.:18.0   1st Qu.:82.00   1st Qu.:11.00  
##  Median :58.00   Median :20.0   Median :87.00   Median :15.00  
##  Mean   :60.43   Mean   :21.1   Mean   :86.29   Mean   :17.52  
##  3rd Qu.:62.00   3rd Qu.:24.0   3rd Qu.:89.00   3rd Qu.:19.00  
##  Max.   :80.00   Max.   :27.0   Max.   :93.00   Max.   :42.00
cor(stackloss)
##             Air.Flow Water.Temp Acid.Conc. stack.loss
## Air.Flow   1.0000000  0.7818523  0.5001429  0.9196635
## Water.Temp 0.7818523  1.0000000  0.3909395  0.8755044
## Acid.Conc. 0.5001429  0.3909395  1.0000000  0.3998296
## stack.loss 0.9196635  0.8755044  0.3998296  1.0000000
##Multiple Linear Regression(2)
mlr = lm(stack.loss~Air.Flow+Water.Temp+Acid.Conc., data=stackloss)
mlr
## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
##     data = stackloss)
## 
## Coefficients:
## (Intercept)     Air.Flow   Water.Temp   Acid.Conc.  
##    -39.9197       0.7156       1.2953      -0.1521
summary(mlr)
## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
##     data = stackloss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2377 -1.7117 -0.4551  2.3614  5.6978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
## Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
## Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
## Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.8983 
## F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09
##Multiple Linear Regression(3)
pairs(stackloss) #Scatter plot of each variables

par(mfrow=c(2,2))
plot(mlr)

##MLR - Confidence Interval and Prediction Interval
attach(stackloss)
## The following object is masked _by_ .GlobalEnv:
## 
##     stack.loss
## The following object is masked from package:datasets:
## 
##     stack.loss
mlr2 <- lm(stack.loss~Air.Flow+Water.Temp+Acid.Conc.)
newdata = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)
predict(mlr2, newdata, interval="confidence")
##        fit      lwr    upr
## 1 24.58173 20.21846 28.945
predict(mlr2, newdata, interval="prediction")
##        fit     lwr      upr
## 1 24.58173 16.4661 32.69736

Logistic Regression

##Logistic Regression(1)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
## The following object is masked from 'package:dplyr':
## 
##     select
data(menarche)
menarche
##      Age Total Menarche
## 1   9.21   376        0
## 2  10.21   200        0
## 3  10.58    93        0
## 4  10.83   120        2
## 5  11.08    90        2
## 6  11.33    88        5
## 7  11.58   105       10
## 8  11.83   111       17
## 9  12.08   100       16
## 10 12.33    93       29
## 11 12.58   100       39
## 12 12.83   108       51
## 13 13.08    99       47
## 14 13.33   106       67
## 15 13.58   105       81
## 16 13.83   117       88
## 17 14.08    98       79
## 18 14.33    97       90
## 19 14.58   120      113
## 20 14.83   102       95
## 21 15.08   122      117
## 22 15.33   111      107
## 23 15.58    94       92
## 24 15.83   114      112
## 25 17.58  1049     1049
str(menarche)
## 'data.frame':    25 obs. of  3 variables:
##  $ Age     : num  9.21 10.21 10.58 10.83 11.08 ...
##  $ Total   : num  376 200 93 120 90 88 105 111 100 93 ...
##  $ Menarche: num  0 0 0 2 2 5 10 17 16 29 ...
summary(menarche)
##       Age            Total           Menarche      
##  Min.   : 9.21   Min.   :  88.0   Min.   :   0.00  
##  1st Qu.:11.58   1st Qu.:  98.0   1st Qu.:  10.00  
##  Median :13.08   Median : 105.0   Median :  51.00  
##  Mean   :13.10   Mean   : 156.7   Mean   :  92.32  
##  3rd Qu.:14.58   3rd Qu.: 117.0   3rd Qu.:  92.00  
##  Max.   :17.58   Max.   :1049.0   Max.   :1049.00
plot(Menarche/Total~Age, data=menarche)

##Logistic Regression(2)
glm1 <- glm(cbind(Menarche, Total-Menarche)~Age, family=binomial(logit), data=menarche)
glm1
## 
## Call:  glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit), 
##     data = menarche)
## 
## Coefficients:
## (Intercept)          Age  
##     -21.226        1.632  
## 
## Degrees of Freedom: 24 Total (i.e. Null);  23 Residual
## Null Deviance:       3694 
## Residual Deviance: 26.7  AIC: 114.8
plot(Menarche/Total~Age, data = menarche)
lines(menarche$Age, glm1$fitted.values, type='l', col="red")