Suppose that we wish to fit the model
\[E(Y) = \beta_0 + \beta_1 X_i\]
to the set of data points as shown in the figure below. (Note that the independent variable \(x\) could be \(w^2\) or \(w^{\frac{1}{2}}\) or \(ln(w)\) and so on, for some other independent variable \(w\))
The least-squares procedure for fitting a line through a set of \(n\) data points is similar to the method that we might use if we fit a line by eye. A convenient way to accomplish this, and one that yields estimators with good properties, is to minimize the sum of squares of the vertical deviations from the fitted line. Thus, if
\[\hat{y_i}=\hat\beta_0 + \hat\beta_1x_i\] is the predicted value of \(i_{th}\) \(y\) value (when \(x = x_i\)), then the deviation (or error) of the observed value of \(y_i\) from \(\hat{y_i}=\hat\beta_0 + \hat\beta_1x_i\) is the difference \(y_i - \hat y_i\) and the sum of squares of deviations to be minimized is
\[SSE = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\] To minimize this equation, we need to take partial derivatives of SSE with respect to \(\hat\beta_0\) and \(\hat\beta_1\) and setting them equal to zero. Thus,
\[\frac{\partial SSE}{\partial \hat\beta_0} = \frac{\partial \{ \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\}}{\partial \hat\beta_0} = -\sum_{i=1}^n 2[y_i - (\hat\beta_0 + \hat\beta_1x_i)] \\ = -2 \bigg(\sum_{i=1}^n y_i - n\hat\beta_0 - \hat\beta_1\sum_{i=1}^nx_i \bigg) = 0\] and
\[\frac{\partial SSE}{\partial \hat\beta_1} = \frac{\partial \{ \sum_{i=1}^n [y_i - (\hat\beta_0 + \hat\beta_1x_i)]^2\}}{\partial \hat\beta_1} = -\sum_{i=1}^n 2[y_i - (\hat\beta_0 + \hat\beta_1x_i)]x_i \\ = -2 \bigg(\sum_{i=1}^n x_iy_i - \hat\beta_0\sum_{i=1}^n x_i - \hat\beta_1\sum_{i=1}^nx_i^2 \bigg) = 0\]
The equations \(\frac{\partial SSE}{\partial \hat\beta_0} = 0\) and \(\frac{\partial SSE}{\partial \hat\beta_1} = 0\) are called the least-squares equations for estimating the parameters of a line. The least-squares equations are linear in \(\hat\beta_0\) and \(\hat\beta_1\) and hence can be solve simultaneously. You can verify that the solutions are
\[\hat\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2} = \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n}\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{\sum_{i=1}^n x_i^2 - \frac{1}{n} \bigg( \sum_{i=1}^n x_i \bigg)^2}\] \[\hat\beta_0 = \bar y - \beta_1 \bar x\] Note that
\(\sum_{i=1}^n \bar x y_i = \frac{1}{n} \sum_{i=1}^n x_i \sum_{i=1}^n y_i\)
and
\(\sum_{i=1}^n \bar x \bar y - \sum_{i=1}^n \bar y x_i = \bar y \sum_{i=1}^n (\bar x - x_i) = 0\),
because
\(\sum_{i=1}^n (\bar x - x_i) = 0\)
We will denotes \(S_{xy}\) as a quantity \(\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)\),
\(S{xx}\) as a quantity \(\sum_{i=1}^n (x_i - \bar x)^2\),
and \(S{yy}\) as a quantity \(\sum_{i=1}^n (y_i - \bar y)^2\).
Then the above equation become \[\hat \beta_1 = \frac{S_{xy}}{S_{xx}}, \ \ \hat\beta_0 = \bar y - \hat \beta_1 \bar x\] where \[\begin{aligned}S_{xx} &= \sum_{i=1}^n (x_i - \bar x)^2 \\ &= \sum_{i=1}^n x_i^2 - 2n\bar x^2 + n \bar x^2 \\ &= \sum_{i=1}^n x_i^2 - n \bar x^2 \\ &= \sum_{i=1}^n x_i^2 - n \bigg(\frac{\sum_{i=1}^n x_i}{n} \bigg)^2 \\ &= \sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^n x_i)^2}{n}\end{aligned}\]
\[\begin{aligned} S_{xy} &= \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) \\ &= \sum_{i=1}^n x_iy_i - n \bar x \bar y \\ &= \sum_{i=1}^n x_iy_i - n \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n^2} \\ &= \sum_{i=1}^n x_iy_i - \frac{(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n}\end{aligned}\] and \[SSE = S_{yy} - \hat \beta_1 S_{xy}\]
Recall the model, \[Y_i = \beta_0 + \beta_1 x_i + \epsilon_i\] where \(\epsilon_i \sim N(0, \sigma^2)\)
Then we can find the mean and variance of each \(Y_i\) \[E[Y_i|X_i = x_i] = \beta_0 + \beta_1x_i\] and \[Var[Y_i|X_i = x_i] = \sigma^2\] Additionally, the \(Y_i\) follow a normal distribution conditioned on the \(x_i\), \[Y_i|X_i \sim N(\beta_0+\beta_1 x_i, \sigma^2)\] Recall that the P.D.F of a random variable \(X \sim N(\mu, \sigma^2)\) is given by \[f_X(x;\mu, \sigma^2) = \frac{1}{\sqrt{{2\pi\sigma^2}}}\bigg[-\frac{1}{2} \bigg(\frac{x-\mu}{\sigma}\bigg)^2\bigg]\] Then we can write the P.D.F of each of the \(Y_i\) as \[f_{Y_i}(y_i,x_i;\beta_0, \beta_1,\sigma^2) = \frac{1}{\sqrt{{2\pi\sigma^2}}} exp\bigg[-\frac{1}{2} \bigg(\frac{y_i-(\beta_0-\beta_1x_i)}{\sigma}\bigg)^2\bigg]\] Given \(n\) data points \((x_i, y_i)\) we can write the likelihood, which is a function of the three parameters \(\beta_0, \beta_1\) and \(\sigma^2\). Since the data have been observed, we use lower case \(y_i\) to denote that these values are no longer random. \[L(\beta_0, \beta_1,\sigma^2) = \prod_{i=1}^n\frac{1}{\sqrt{{2\pi\sigma^2}}} exp\bigg[-\frac{1}{2} \bigg(\frac{y_i-\beta_0-\beta_1x_i}{\sigma}\bigg)^2\bigg]\] Our goal is to find values of \(\beta_0, \beta_1\), and \(\sigma^2\) which maximize this function, which is a straightforward multivariate calculus problem.
We’ll start by doing a bit of rearraging to make our task easier. \[L(\beta_0, \beta_1,\sigma^2) = \bigg(\frac{1}{\sqrt{{2\pi\sigma^2}}}\bigg)^2 exp\bigg[-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2\bigg]\] Then, as is often the case when finding MLEs, for mathematical onvenience we will take the natural logarithm of the likelihood function since log is a monotonically increasing function. Then we will proceed to maximize the log-likelihood, and the resulting estimates will be the same as if we we had not taken the log. \[\log L(\beta_0, \beta_1,\sigma^2) = -\frac{n}{2} \log{(2\pi)} - \frac{n}{2} \log{(\sigma^2)} - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2\] Note that we use log to mean the natural logarithm (or \(\ln\)). We now take a partial derivative with respect to each of the parameters.
\[\begin{aligned} \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \beta_0} &= \frac{1}{\sigma^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 \\ \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \beta_1} &= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i) (y_i-\beta_0-\beta_1x_i)^2 \\ \frac{\partial \log L(\beta_0, \beta_1, \sigma^2)}{\partial \sigma^2} &= \frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 \end{aligned}\]
When then set each of the partial derivatives equal to zero and solve the resulting system of equations.
\[\begin{aligned} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 &= 0\\ \sum_{i=1}^n (x_i) (y_i-\beta_0-\beta_1x_i)^2 &= 0\\ \frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (y_i-\beta_0-\beta_1x_i)^2 &=0 \end{aligned}\]
You may notice that the first two equations also appear in the least squares approach. Then, skipping the issue of actually checking if we have found a maximum, we then arrive at our estimates. We call these estimates the maximum likelihood estimates.
\[\begin{aligned} \hat\beta_1 &= \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n} (\sum_{i=1}^n x_i) (\sum_{i=1}^n y_i)}{\sum_{i=1}^n x_i^2 - \frac{1}{n} (\sum_{i=1}^n x_i)^2} = \frac{S_{xy}}{S_{xx}}\\ \hat\beta_0 &= \bar y - \hat\beta_1 \bar x \\ \hat\sigma^2 &= \frac{1}{n} \sum_{i=1}^n (y_i-\hat y_i)^2 \end{aligned}\]
Note that \(\hat\beta_0\) and \(\hat\beta_1\) are the same as the least squares estimates. However we now have a new estimate of \(\sigma^2\), that is \(\hat\sigma^2\). So we now have two different estimates of \(\sigma^2\).
\[\begin{aligned}S_e^2 &= \frac{1}{n-2} \sum_{i=1}^{n}(y_i - \hat y_i)^2 = \frac{1}{n-2} \sum_{i=1}^{n}e_i^2 \hspace{1cm} Least \ \ Squares \\ \hat\sigma^2 &= \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat y_i)^2 = \frac{1}{n} \sum_{i=1}^{n}e_i^2 \hspace{1cm} MLE \end{aligned}\]
Least square estimate is unbiased estimate while estimate from MLE is biased estimate since the expectation of \(\sigma^2\) is the same as the \(S_e^2\) but not \(\hat\sigma^2\) (See the supplementary tutorial).
Question: Is bias always a bad thing?
try to answer yourself and see these posts to get more idea
Use the mothod of least squares to fit a straight line to the \(n=5\) data points given in the table below.(Wackrly p571)
x | y |
---|---|
-2 | 0 |
-1 | 0 |
0 | 1 |
1 | 1 |
2 | 3 |
Solution
We commence computation of the least-squares estimates for the slope and intecept of the fitted line by constructing table below.
Using the results from this table, we obtain
\[\hat \beta_1 = \frac{S_{xy}}{S_{xx}} = \frac{\sum_{i=1}^n x_iy_i - \frac{1}{n}\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{\sum_{i=1}^n x_i^2 - \frac{1}{n} \bigg( \sum_{i=1}^n x_i \bigg)^2} = \frac {7 - \frac{1}{5} (0)(5)}{10 - \frac{1}{5} (0)^2} = 0.7\] \[\hat\beta_0 = \bar y - \hat\beta_1 \bar x = \frac{5}{5} - (0.7)(0) = 1\] and the fitted line is
\[\hat y = 1 + 0.7x\]
slr <- function(x){1+0.7*x}
plot(data$x[1:5], data$y[1:5], xlab = "x", ylab = "y", ylim = c(-1.5, 3))
curve(slr, add = TRUE)
# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)
##
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
##
## Residuals:
## 1 2 3 4 5
## 4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01 6.000e-01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.2708 3.693 0.0345 *
## x[1:5] 0.7000 0.1915 3.656 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared: 0.8167, Adjusted R-squared: 0.7556
## F-statistic: 13.36 on 1 and 3 DF, p-value: 0.03535
Find the variances of the estimators \(\hat\beta_0\) and \(\hat\beta_1\) for the table below.(Wackrly p579)
x | y |
---|---|
-2 | 0 |
-1 | 0 |
0 | 1 |
1 | 1 |
2 | 3 |
Solution Previous example we found that \[n = 5, \hspace{0.5cm} \sum_{i=1}^n x_i = 0, \hspace{0.5cm} \sum_{i=1}^n x_i^2 = 10, \hspace{0.5cm} S_{xx} = 10, \hspace{0.5cm} and \hspace{0.5cm} \bar x = 0\] Thus, \[V(\hat\beta_0) = \frac{\sigma^2\sum_{i=1}^n x_i^2}{nS_{xx}} = \frac{\sigma^2 (10)}{5(10)} = \bigg(\frac{1}{5}\bigg)\sigma^2\] and \[V(\hat\beta_1) = \bigg(\frac{1}{10}\bigg)\sigma^2\] Notice that \(Cov(\hat\beta_0, \hat\beta_1) = 0\) in this case since \(\sum_{i=1}^n x_i = 0\)
To get the answer, we need to estimate \(\sigma^2\). For these data, \(n = 5\) and we have already determined that \[\sum_{i=1}^n y_i = 5, \hspace{0.5cm} S_{xy} = 7, \hspace{0.5cm} and \hspace{0.5cm} \hat\beta_1 = 0.7\] It is easily determined that \(\sum_{i=1}^n y_i^2 = 11\) and that \[S_{yy} = \sum_{i=1}^n(y_i - \bar y)^2 = \sum_{i=1}^ny_i^2 - n(\bar y)^2 = 11 - 5(1)^2 = 6.0\] Therefore, \[SSE = S_{yy} - \hat\beta_1 S_{xy} = 6.0 - (0.7)(7) = 1.1\] and
\[S^2 = \frac{SSE}{n-2} = \frac{1.1}{5-2} = \frac{1.1}{3} = 0.367\]
x | y | xy | \(x^2\) | \(y^2\) | |
---|---|---|---|---|---|
1 | -2 | 0 | 0 | 4 | 0 |
2 | -1 | 0 | 0 | 1 | 0 |
3 | 0 | 1 | 0 | 0 | 1 |
4 | 1 | 1 | 1 | 1 | 1 |
5 | 2 | 3 | 6 | 4 | 9 |
Sum | 0 | 5 | 7 | 10 | 11 |
n <- 5
Sxx <- data$x2[6] - n * (data$x[6]/n)
Syy <- data$y2[6] - n * (data$y[6]/n)
Sxy <- data$xy[6]
b1 <- 0.7
b0 <- 1
# estimation of population variance
SSE <- Syy - b1*Sxy
S2 <- SSE/(n-2)
S <- sqrt(S2)
# cf) MLE
S2_mle <- SSE/n
S_mle <- sqrt(S2_mle)
# Formatting
knitr::kable(data.frame(Sxx, Syy, Sxy, SSE, S2, S, S2_mle, S_mle))
Sxx | Syy | Sxy | SSE | S2 | S | S2_mle | S_mle |
---|---|---|---|---|---|---|---|
10 | 6 | 7 | 1.1 | 0.3666667 | 0.6055301 | 0.22 | 0.4690416 |
# variance of regression coefficient
V_b1 <- S2/Sxx; V_b1; S2/10
## [1] 0.03666667
## [1] 0.03666667
V_b0 <- (S2*data$x2[6])/(n*Sxx); V_b0; S2/5
## [1] 0.07333333
## [1] 0.07333333
# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)
##
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
##
## Residuals:
## 1 2 3 4 5
## 4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01 6.000e-01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.2708 3.693 0.0345 *
## x[1:5] 0.7000 0.1915 3.656 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared: 0.8167, Adjusted R-squared: 0.7556
## F-statistic: 13.36 on 1 and 3 DF, p-value: 0.03535
Do the data of previous example present sufficient evidence to indicate that the slope differs from 0? Test using \(\alpha = 0.05\) and give bounds for the attained significance level. (Wackerly p585)
Solution
From previous example, we can get
\[C_{11} = \frac{1}{S_{xx}} = \frac{1}{10} = 0.1, \hspace{1cm} S = \sqrt{S^2} = \sqrt{0.367} = 0.606\] and \[t = \frac{\hat\beta_1 - 0}{s\sqrt{C_{11}}} = \frac{0.7-0}{0.606\sqrt{0.1}} = 3.65\] If we take \(\alpha = 0.05\), the value of \(t_{\alpha/2} = t_{0.025}\) for 3 df is 3.182 from the t-table. and the rejection region is \[reject \ \ if \ \ |t| \geq 3.182\] Therefore, we reject the null hypothesis that \(\beta_1 = 0\) at the \(\alpha = 0.05\) level of significance.
df <- n-2
c11 <- 1/Sxx
S <- sqrt(S2)
t <- (b1-0)/(S * sqrt(c11))
t_crit <- qt(0.975, df)
pval <- 2*pt(t, df, lower.tail = FALSE)
knitr::kable(data.frame(df, c11, S, t, t_crit, pval))
df | c11 | S | t | t_crit | pval |
---|---|---|---|---|---|
3 | 0.1 | 0.6055301 | 3.655631 | 3.182446 | 0.0353528 |
# base R function
slr2 <- lm(y[1:5] ~ x[1:5], data = data)
summary(slr2)
##
## Call:
## lm(formula = y[1:5] ~ x[1:5], data = data)
##
## Residuals:
## 1 2 3 4 5
## 4.000e-01 -3.000e-01 -2.776e-16 -7.000e-01 6.000e-01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.2708 3.693 0.0345 *
## x[1:5] 0.7000 0.1915 3.656 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6055 on 3 degrees of freedom
## Multiple R-squared: 0.8167, Adjusted R-squared: 0.7556
## F-statistic: 13.36 on 1 and 3 DF, p-value: 0.03535
Calculate a 95% confidence interval for the parameter \(\beta_1\) of previous example. (Wakerly p586)
Solution From the t-Table, we can get the critical t-value, \(t_{0.25}\), based on 3 df, 3.182. Then the 95% confidence interval for \(\beta_1\) is
\[\hat\beta_1 \pm t_{0.25}s\sqrt{C_{11}}\] Substituting, we get \[0.7 \pm (3.182)(0.606)\sqrt{0.1} \hspace{1cm} or \hspace{1cm} 0.7 \pm 0.610\] If we wish to estimate \(\beta_1\) correctly within 0.15 unit, it is obvious that the confidence interval is too wide and that the sample size must be increased.
b1 <- 0.7
t_crit <- qt(0.975, df)
S <- sqrt(S^2)
c11 <- 1/Sxx
lower <- b1 - (t_crit*S*sqrt(c11))
upper <- b1 + (t_crit*S*sqrt(c11))
knitr::kable(data.frame(lower, upper))
lower | upper |
---|---|
0.0906079 | 1.309392 |
# base R
confint(slr2, "x[1:5]", level=0.95)
## 2.5 % 97.5 %
## x[1:5] 0.09060793 1.309392
For the same example, find a 90% confidence interval for \(E(Y)\) when \(x=1\). (wackerly p592)
Solution For the model of previous example \[E(Y) = \beta_0 + \beta_1x\] To estimate \(E(Y)\) for any fixed value \(x=x^*\), we use the unbiased estimator \(\hat{E(Y)} = \hat\beta_0 + \hat\beta_1x^*\). Then, \[\hat\beta_0 +\hat\beta_1x^* = 1 + 0.7x^*\] For the case of \(x^* = 1\), and because \(n=5, \ \bar x = 0,\) and \(S_{xx} = 10\), it follows that \[\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}} = \frac{1}{5} + \frac{(1-0)^2}{10} = 0.3\] We already know that \(S^2 = 0.367\) or \(S = 0.606\) and \(t_{0.05}\) with \(n-2 = 3\) df is 2.353. Thus, the confidence interval for \(E(Y)\) when \(x=1\) is \[\hat\beta_0 + \hat\beta_1x^* \pm t_{\alpha/2}S\sqrt{\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}\] \[[(1 + (0.7)(1)] \pm (2.353)(0.606)\sqrt{0.3} = 1.7 \pm 0.781\] That is, we are 90% confident that, when the independent variable takes on the value \(x=1\), the mean value \(E(Y)\) of the dependent variable is between 0.919 and 2.481. This interval obviously is very wide, but remember that it is based on only 5 data points.
a <- 0.1
n <- 5
b0 <- 1
b1 <- 0.7
df <- n-2
t_crit <- qt(1 - (a/2), df)
x <- 1
y_hat <- b0+b1*x
lower <- y_hat - (t_crit*S*sqrt((1/n)+((x - data$x[6])^2/Sxx)))
upper <- y_hat + (t_crit*S*sqrt((1/n)+((x - data$x[6])^2/Sxx)))
knitr::kable(data.frame(y_hat, lower, upper))
y_hat | lower | upper |
---|---|---|
1.7 | 0.9194776 | 2.480522 |
# base R
new.data <- data.frame(x = 1)
predict(slr2, new.data, level=0.90, interval = "confidence")[1,]
## Warning: 'newdata' had 1 row but variables found have 5 rows
## fit lwr upr
## 1.7000000 0.9194776 2.4805224
Suppose that the experiment that generated the data of previous example is to be run again with \(x=2\). Predict the particular value of \(y\) with \(1-\alpha= 0.90\) (Wackerly p596)
Solution The predicted value of \(y\) with \(x=2\) is \[\hat\beta_0 + \hat\beta_1x^* = 1 + (0.7)(2) = 2.4\] Further, with \(x^*=2\), \[\frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}} = \frac{1}{5} + \frac{(2-0)^2}{10} = 0.6\] Thus the prediction interval is \[\hat\beta_0+ \hat\beta_1x^* \pm t_{\alpha/2}{S \sqrt{1 + \frac{1}{n} + \frac{(x^*-\bar x)^2}{S_{xx}}}} = 2.4 \pm (2.353)(0.606)\sqrt{1+0.6} = 2.4 \pm 1.804\]
n <- 5
b0 <- 1
b1 <- 0.7
df <- n-2
t_crit <- qt(0.95, df)
x <- 2
y <- b0+b1*x
lower <- b0+b1*x - (t_crit*S*sqrt(1+(1/n)+((x - data$x[6])^2/Sxx)))
upper <- b0+b1*x + (t_crit*S*sqrt(1+(1/n)+((x - data$x[6])^2/Sxx)))
knitr::kable(data.frame(y, lower, upper))
y | lower | upper |
---|---|---|
2.4 | 0.5974608 | 4.202539 |
# base R
new.data <- data.frame(x = 2)
predict(slr2, new.data, level=0.90, interval = "prediction")[1,]
## Warning: 'newdata' had 1 row but variables found have 5 rows
## fit lwr upr
## 2.4000000 0.5974608 4.2025392
As you can see in the figure above, confidence bands for \(E(Y)\) is closer than prediction bands for \(y\).
The manager of a car plant wishes to investigate how the plant’s electricity usage depends upon the plant’s production. The data set shown below is compiled and provides the plant’s production and electrical usage for each month of the previous year.The electrical usage is in units of a million kilowatt-hours, and the production is measured as the value in million-dollar units of the cars produced in that month.
Production | Usage | |
---|---|---|
Jan | 4.51 | 2.48 |
Feb | 3.58 | 2.26 |
Mar | 4.31 | 2.47 |
Apr | 5.06 | 2.77 |
May | 5.64 | 2.99 |
Jun | 4.99 | 3.05 |
Jul | 5.29 | 3.18 |
Aug | 5.83 | 3.46 |
Sep | 4.70 | 3.03 |
Oct | 5.61 | 3.26 |
Nov | 4.90 | 2.67 |
Dec | 4.20 | 2.53 |
As shown in the scatter plot of the data set, the manager concludes that it is sensible to fit a straight line to the data points. The electricity usage is taken to be the dependent variable \(y\), and the production is taken to be the explanatory variable \(x\). The linear model \(y = \beta_0 + \beta_1x\) will then allow a month’s electrical usage to be estimated as function of the month’s production. We can perform ANOVA test for the regression line.
# Fitting linear model
fit <- lm(Usage ~ Production, data = data); summary(fit)
##
## Call:
## lm(formula = Usage ~ Production, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23245 -0.16704 0.03919 0.13480 0.27645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40905 0.38599 1.060 0.314
## Production 0.49883 0.07835 6.367 8.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1729 on 10 degrees of freedom
## Multiple R-squared: 0.8021, Adjusted R-squared: 0.7823
## F-statistic: 40.53 on 1 and 10 DF, p-value: 8.176e-05
# ANOVA
res <- summary(aov(fit)); res
## Df Sum Sq Mean Sq F value Pr(>F)
## Production 1 1.2124 1.2124 40.53 8.18e-05 ***
## Residuals 10 0.2991 0.0299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
res <- unlist(res)
# R^2 = SSR/SST
R2 <- res[3]/(res[3]+res[4]); names(R2) <- "R2"; R2
## R2
## 0.8021094
# visualize
library(ggplot2)
qplot(data$Production, data$Usage, ylab = "Electricity Usage",
xlab = "Production", main = "Production vs Electricity Usage") +
stat_smooth(method = "lm")
The residuals are defined to be \[e_i = y_i - \hat y_i, \hspace{1cm} 1\leq i \leq n\] so that they are the differences between the observed values of the dependent variable \(y_i\) and the corresponding fitted values \(\hat y_i\). A property of these residuals is that they sum to 0 \[\sum_{i=1}^n e_i = 0\] Residual analysis can be used
Identify data points that are outliers (residual-leverage, cook’s distance)
Check whether the fitted model is appropriate (Scale-location plot)
Check whether the error variance is constant (residual plots, standardized residual plots)
Check whether the error terms are normally distributed (normal probability plot)
Here I list some of residual plots which can be used for the diagnostics for the regression model
Random scatter and no patterns - best fit
Random scatter with outliers
Nonlinear relationship is exist - Simple linear regression is not appropriate
A funnel shape residual plot - Non constant error variance
Straight line in a normal scores plot - Normally distributed
nonlinear line in a normal scores plot - nonlinear patters
Plot the residuals \(e_i\) against the values of the explanatory variable \(x_i\)
# Residuals vs Fitted
plot(fit$fitted.values, fit$residuals, xlab = "Fitted value",
ylab = "Residuals", main = "Residuals vs Fitted")
abline(h=0, col="grey", lty = 2)
abline(h=0.2, col="grey", lty = 2); abline(h=-0.2, col="grey", lty = 2)
# base R - plot.lm
plot(fit, which=1)
Notice that the red line in the residual vs fitted value plot represents the average value of the residuals at each value of fitted value (scatterplot smoother). From this plot, we can identify outlier or nonlinear pattern of data.
Plot the standardized residual \(\frac{e_i}{\hat\sigma}\) against the values of the explanatory variable \(x_i\) or leverage \(h_i\)
Standardized residual
The standardized residual is simply ratio between residual and estimate of error variance as shown below. \[\frac{e_i}{\hat\sigma}\] we can take square root of the absolute value of the standardized residual to diminish skewness of the \(E(\frac{e_i}{\hat\sigma})\)
If we replace estimate of error variance with sample variance under consideration of leverage, we can get the Studentized residual. \[\frac{e_i}{S \sqrt{1-h_i}}\]
Leverage (\(h_i\))
An outlier is a data point whose response \(y\) does not follow the general trend of the rest of the data. A data point has high leverage if it has “extreme” predictor \(x\) values. A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results.
The statistic called leverage is given as follows \[h_i = \frac{1}{n} + \frac{(x-\bar x)^2}{\sum_{i=1}^n (x-\bar x)^2}\] and indicates the proportion of the SST of the \(x\) contribued by \(i_{th}\) point. Here I summarized the properties of levarage
Cook’s Distance (\(D_i\))
Cook’s distance is a statistic that quantifies influence of data points. I summarized the definition and properties of Cook’s distance below.
\[D_i = \frac{(y_i - \hat y_i)^2}{2\times MSE}\bigg[\frac{h_{i}}{(1-h_{i})^2}\bigg] = \frac{e_i^2}{2\times MSE}\bigg[\frac{h_{i}}{(1-h_{i})^2}\bigg]\] - \(D_i\) directly summarizes how much all of the fitted values change when the \(i_{th}\) observation is deleted
# Standardized residual vs Fitted
fit.stdres <- rstandard(fit)
plot(fit$fitted.values, fit.stdres, xlab = "Fitted value",
ylab = "Standardized Residuals", ylim = c(-2, 2), main = "Standardized residuals vs Fitted")
abline(h=0, col="grey", lty = 3)
abline(h=1, col="grey", lty = 2); abline(h=-1, col="grey", lty = 2)
abline(h=2, col="grey", lty = 2); abline(h=-2, col="grey", lty = 2)
# Square root |Standardized residual| vs Fitted
plot(fit$fitted.values, sqrt(abs(fit.stdres)), xlab = "Fitted value", ylim = c(0, 1.3),
ylab = bquote(sqrt(abs("Standardized residuals"))), main = "Scale-Location")
plot(fit, which=3)
# Cook's distance - Influential outlier detection
plot(fit, which=4)
# Standardized residual vs leverage - Influential outlier detection
plot(fit, which=5)
Plot a normal probability plot (also referred to as a normal scores plot, or Q-Q plot)
Normal scores plot is used to check whether the error terms \(\epsilon_i\) appear to be normally distributed. In such a plot the residuals \(e_i\) are plotted against their “normal scores”, which can be thought of as being their “expected values” if they have a normal distribution. The normal score of \(i_{th}\) smallest residual is \[\Phi^{-1}\Bigg(\frac{i - \frac{3}{8}}{n+\frac{1}{4}}\Bigg)\]
qqnorm(fit.stdres, ylab = "Standardize residuals", xlab = "Normal scores",
main = "Normal Probability Plot", ylim = c(-1.5, 1.9))
qqline(fit.stdres, lty=3, col="grey")
# Q-Q plot
plot(fit, which=2)
A perfect regression model results in a fitted line that passes exactly through all observed data points. This perfect model will give us a zero error sum of square, i.e. \(SSE = 0\). However, if you conduct the same experiment with the same explantory variable \(x_i\) and record response variable \(y_i\) multiple times, the observations from the experiment (2nd trial) may deviate from perfect model. This error is originated from “purely” random variation or noise. The sum of squares due to this error, called sum of square of pure error is abberiviated as \(SSPE\) quantifies thses variations. \(SSPE\) is calculated by taking repeated observations at some or all values of \(x_i\) and adding up the square of deviations at each level of \(x_i\). The remaining error is called sum of square of lac of fit and abberiviated as \(SSLF\). Therefore,
\[SSE = SSPE + SSLE\] As you can see the table below, in lack of fit test, we analyze residual error SSE in terms of SSLF and SSPE by taking MSE of each error separately.
Suppose that we built a simple linear regression model between the size of cancer mass (\(x\)) and the increased size of it (\(y\)) in a given time nonetheless its curved distribution as shown below. The regression showed that the \(\hat\beta_0=50.7\) and \(\hat\beta_1=0.49\) (Calculate \(\beta_0\), and \(\beta_1\), the variance and confidence intervals of \(\beta_0\), \(\beta_1\), \(\hat y\) by yourself too. set \(\alpha=0.05\)).
Solution
Mass | Increase_t1 | Increase_t2 |
---|---|---|
75 | 28 | 85 |
75 | 42 | 88 |
100 | 112 | 97 |
100 | 136 | 100 |
125 | 160 | 108 |
125 | 150 | 113 |
150 | 152 | 126 |
175 | 156 | 134 |
175 | 124 | 139 |
200 | 124 | 147 |
200 | 104 | 150 |
cancer_mass <- read.table("http://bisyn.kaist.ac.kr/bis335/10-Regression-tumorsize.txt",
sep = "\t", header = T)
colnames(cancer_mass) <- c("Mass", "Increase_t1", "Increase_t2")
# Find b0, b1, yhat
fit <- lm(Increase_t1 ~ Mass, data=cancer_mass)
summary(fit)
##
## Call:
## lm(formula = Increase_t1 ~ Mass, data = cancer_mass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.23 -34.06 12.61 32.44 48.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.7225 39.3979 1.287 0.23
## Mass 0.4867 0.2747 1.772 0.11
##
## Residual standard error: 40.47 on 9 degrees of freedom
## Multiple R-squared: 0.2586, Adjusted R-squared: 0.1762
## F-statistic: 3.139 on 1 and 9 DF, p-value: 0.1102
knitr::kable(data.frame(y = cancer_mass$Mass, y_hat = fit$fitted.values))
y | y_hat |
---|---|
75 | 87.22513 |
75 | 87.22513 |
100 | 99.39267 |
100 | 99.39267 |
125 | 111.56021 |
125 | 111.56021 |
150 | 123.72775 |
175 | 135.89529 |
175 | 135.89529 |
200 | 148.06283 |
200 | 148.06283 |
# Manual calculation of Lack of fit ANOVA table
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- cancer_mass
# Degrees of freedom
df.SSPE <- length(data$Mass) - length(unique(data$Mass))
df.SSLF <- length(unique(data$Mass)) - 2
# calculation of SSPE & SSLF
data2 <- data %>% group_by(Mass) %>% summarise(Group_mean = mean(Increase_t1))
data <- left_join(data, data2, by="Mass")
data$SSPE <- (data$Increase_t1 - data$Group_mean)^2
data$Pred <- data$Mass * fit$coefficients[2] + fit$coefficients[1]
data$SSLF <- (data$Group_mean - data$Pred)^2
SSPE <- sum(data$SSPE)
SSLF <- sum(data$SSLF)
# calculation of F statistic
MSLF <- SSLF/df.SSLF
MSPE <- SSPE/df.SSPE
a <- 0.05
f <- round(MSLF / MSPE, 4)
f_crit <- qf(1-a, df1 = df.SSLF, df2 = df.SSPE); f_crit
## [1] 5.192168
# pvalue
p <- round(1 - pf(f, df.SSLF, df.SSPE),4)
# summary table
knitr::kable(data.frame(cbind(DF = c(df.SSLF, df.SSPE),
SS = c(SSLF, SSPE),
MS = c(MSLF, MSPE),
F = c(f,NA),
p = c(p, NA)
) # end of cbind
) # end of data.frame
) # end of kable
DF | SS | MS | F | p |
---|---|---|---|---|
4 | 13593.57 | 3398.393 | 14.8014 | 0.0056 |
5 | 1148.00 | 229.600 | NA | NA |
# with R package
library(alr3)
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
fit <- lm(Increase_t1 ~ Mass, data = cancer_mass)
summary(fit)
##
## Call:
## lm(formula = Increase_t1 ~ Mass, data = cancer_mass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.23 -34.06 12.61 32.44 48.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.7225 39.3979 1.287 0.23
## Mass 0.4867 0.2747 1.772 0.11
##
## Residual standard error: 40.47 on 9 degrees of freedom
## Multiple R-squared: 0.2586, Adjusted R-squared: 0.1762
## F-statistic: 3.139 on 1 and 9 DF, p-value: 0.1102
anova(fit)
## Analysis of Variance Table
##
## Response: Increase_t1
## Df Sum Sq Mean Sq F value Pr(>F)
## Mass 1 5141.3 5141.3 3.1389 0.1102
## Residuals 9 14741.6 1638.0
pureErrorAnova(fit)
## Analysis of Variance Table
##
## Response: Increase_t1
## Df Sum Sq Mean Sq F value Pr(>F)
## Mass 1 5141.3 5141.3 22.393 0.005186 **
## Residuals 9 14741.6 1638.0
## Lack of fit 4 13593.6 3398.4 14.801 0.005594 **
## Pure Error 5 1148.0 229.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because p-value is less than significance level of our test, we can conclude that there is lack of fit in the simple linear regression model.
The data in the table below represent a sample of mathematics achievement test scores and calculus grades for ten independently selected college freshmen. From this evidence, would you say that the achievement test scores and calculus grades are independent? Use \(\alpha = 0.05\). Identify the corresponding attained significance level.
Student | Math_Test_Score | Final_Calculus_Grade |
---|---|---|
1 | 39 | 65 |
2 | 43 | 78 |
3 | 21 | 52 |
4 | 64 | 82 |
5 | 57 | 92 |
6 | 47 | 89 |
7 | 28 | 73 |
8 | 75 | 98 |
9 | 34 | 56 |
10 | 52 | 75 |
Solution
\(H_0: \rho = 0\)
\(H_a: \rho \neq 0\)
test statistic: \(t = \frac {r\sqrt{n-2}}{\sqrt{1-r^2}}\)
data <- data.frame(student = 1:10,
x = c(39, 43, 21, 64, 57, 47, 28, 75, 34, 52),
y = c(65, 78, 52, 82, 92, 89, 73, 98, 56, 75)
)
data2 <- data
# setting parameters
n <- length(data$student)
df <- n-2
a <- 0.05
# calculation of SS
data2$xy <- data2$x * data$y
data2$x2 <- data2$x^2
data2$y2 <- data2$y^2
data2[length(data2$student)+1,] <- colSums(data2)
SS <- data2[11,-1]; SS
## x y xy x2 y2
## 11 460 760 36854 23634 59816
Sxx <- n*SS[4] - SS[1]^2
Syy <- n*SS[5] - SS[2]^2
Sxy <- n*SS[3] - SS[1]*SS[2]
knitr::kable(data.frame(Sxx = as.numeric(Sxx),
Syy = as.numeric(Syy),
Sxy = as.numeric(Sxy)))
Sxx | Syy | Sxy |
---|---|---|
24740 | 20560 | 18940 |
# Calculation of r and t
r <- as.numeric(Sxy/sqrt(Sxx*Syy))
t <- (r * sqrt(n-2))/(sqrt(1 - r^2))
t_crit <- qt(1-(a/2), df)
p <- 2 * pt(t, df, lower.tail = FALSE)
knitr::kable(data.frame(r = round(r,3),
t = round(t,3),
t_crit = round(t_crit,3),
p = round(p,5)),
align='c')
r | t | t_crit | p |
---|---|---|---|
0.84 | 4.375 | 2.306 | 0.00236 |
# base R
cor.test(data$x, data$y)
##
## Pearson's product-moment correlation
##
## data: data$x and data$y
## t = 4.375, df = 8, p-value = 0.002365
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4459632 0.9611846
## sample estimates:
## cor
## 0.8397859
# cf) Spearman rank correlation (for more information, see nonparametric test tutorial)
cor.test(data$x, data$y, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: data$x and data$y
## S = 20, p-value = 0.001977
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8787879
We discussed Spearman’s ranked correlation in the lecture of nonparametric test. As you already know, the statstic, abbreviated as \(r_S\), is a nonparametric version of Pearson’s correlation statistic. (The formular is the same as Pearson’s formular except it utilizes ranked values instead of actual value of \(x_i\) and \(y_i\)). When there are no ties in either the \(x\) observations or the \(y\) observations, the expression for \(r_S\) algebraically reduces to a simpler expression:
\[r_S = 1 - \frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}, \hspace{1cm} where \ d_i =R(x_i) - R(y_i)\] The statistical significance of Spearman’s correlation statistic can be determined by the t statistic as well as Pearson correlation statistic. Now, let’s consider simple example. The correlation between IQ and the hours of TV watching per week.
IQTV <- data.frame(IQ = c(86, 97, 99, 100, 101, 103, 106, 110, 112, 113),
TV = c(0, 20, 28, 27, 50, 29, 7, 17, 6, 12))
n <- length(IQTV$IQ)
df <- n-2
IQTV$rank_IQ <- rank(IQTV$IQ)
IQTV$rank_TV <- rank(IQTV$TV)
IQTV$d <- IQTV$rank_IQ - IQTV$rank_TV
IQTV$d2 <- IQTV$d^2
knitr::kable(IQTV)
IQ | TV | rank_IQ | rank_TV | d | d2 |
---|---|---|---|---|---|
86 | 0 | 1 | 1 | 0 | 0 |
97 | 20 | 2 | 6 | -4 | 16 |
99 | 28 | 3 | 8 | -5 | 25 |
100 | 27 | 4 | 7 | -3 | 9 |
101 | 50 | 5 | 10 | -5 | 25 |
103 | 29 | 6 | 9 | -3 | 9 |
106 | 7 | 7 | 3 | 4 | 16 |
110 | 17 | 8 | 5 | 3 | 9 |
112 | 6 | 9 | 2 | 7 | 49 |
113 | 12 | 10 | 4 | 6 | 36 |
rs <- 1-6*sum(IQTV$d2)/(n*(n^2-1))
t <- (rs * sqrt(n-2))/(sqrt(1 - rs^2))
t_crit <- -qt(1-(a/2),df)
p <- 2*pt(t, df)
knitr::kable(data.frame(rs = round(rs,3),
t = round(t,3),
t_crit = round(t_crit,3),
p = round(p,5)),
align='c')
rs | t | t_crit | p |
---|---|---|---|
-0.176 | -0.505 | -2.306 | 0.62719 |
# base R
cor.test(IQTV$IQ, IQTV$TV, method = "spearman", conf.level = 0.95)
##
## Spearman's rank correlation rho
##
## data: IQTV$IQ and IQTV$TV
## S = 194, p-value = 0.632
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1757576
The test result implies that there is no correlation between IQ and hours of TV watching per week but negative sign implies there could be negative correlation between two although it is very low.
A bioengineer measures the growth rate of a substance by counting the number of cells \(N\) present at various times \(t\) as shown below. Fit the model \(N=\gamma_0e^{\gamma_1t}\) and calculate two-sided 95% confidence intervals for the unknown parameters \(\gamma_1\)
time | Num_cell |
---|---|
1 | 12 |
2 | 31 |
3 | 42 |
4 | 75 |
5 | 119 |
6 | 221 |
7 | 327 |
8 | 546 |
Solution with two unknown parameters \(\gamma_0\) and \(\gamma_1\) is commonly used for this problem. With a logarithmic transformation, the linear relationship \[y = \beta_0 + \beta_1 x\] is obtained with \(y = ln(N)\), \(\beta_0 = ln(\gamma_0)\), \(\beta_1 = \gamma_1\), \(x = t\).
A straight line is fitted to the transformed data points \((x_i, y_i)\) and the fitted regression line is \[y = 0.523x + 2.176\] The equation can be transformed back into a model in terms of the original variables \[N = e^{ln(\gamma_0) + \gamma_1t}=e^{ln(2.176) + 0.523t}\] \[\beta_1 = \gamma_1 \in (0.472, 0.573)\]
cell_growth <- data.frame(time = 1:8, Num_cell = c(12, 31, 42, 75, 119, 221, 327, 546))
# transform of variables
cell_growth$lnt <- log(cell_growth$time)
cell_growth$lnN <- log(cell_growth$Num_cell)
# fit linear model directly
model1 <- lm(Num_cell ~ time, data = cell_growth)
summary(model1)
##
## Call:
## lm(formula = Num_cell ~ time, data = cell_growth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.14 -56.16 -21.64 44.47 132.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -139.04 64.20 -2.166 0.07350 .
## time 69.04 12.71 5.430 0.00162 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82.39 on 6 degrees of freedom
## Multiple R-squared: 0.8309, Adjusted R-squared: 0.8027
## F-statistic: 29.49 on 1 and 6 DF, p-value: 0.001617
plot(Num_cell ~ time, data = cell_growth, pch = 16,
ylab = "N", xlab = "t")
lines(cell_growth$time, model1$fitted.values)
# fit linear model with transformed variable
model2 <- lm(lnN ~ time, data = cell_growth)
summary(model2)
##
## Call:
## lm(formula = lnN ~ time, data = cell_growth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.21446 -0.05110 -0.01050 0.05717 0.21144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.17619 0.10357 21.01 7.57e-07 ***
## time 0.52318 0.02051 25.51 2.39e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1329 on 6 degrees of freedom
## Multiple R-squared: 0.9909, Adjusted R-squared: 0.9893
## F-statistic: 650.7 on 1 and 6 DF, p-value: 2.391e-07
plot(lnN ~ time, data = cell_growth, pch = 16,
ylab = "ln(N)", xlab = "t")
lines(cell_growth$time, model2$fitted.values)
slr <- function(x) exp(2.176 + 0.523*x)
plot(cell_growth$time, cell_growth$Num_cell, pch = 16,
ylab = "N", xlab = "t")
curve(slr, add = TRUE)
# confidence interval
confint(model2, "time", level=0.95)
## 2.5 % 97.5 %
## time 0.472994 0.5733623
For multiple linear regression, We define the following matrices, with \(x_0 = 1\):
\[Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \end{bmatrix}, \hspace{1cm} X = \begin{bmatrix} x_0 & x_{11} & x_{12} & ... & x_{1k} \\ x_0 & x_{21} & x_{22} & ... & x_{2k} \\ \vdots & \vdots & \vdots & & \vdots \\ x_0 & x_{n1} & x_{n2} & ... & x_{nk} \end{bmatrix}, \hspace{1cm} \beta = \begin{bmatrix} \beta_0 \\ \beta_2 \\ \vdots\\ \beta_n \end{bmatrix}, \hspace{1cm} \epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots\\ \epsilon_n \end{bmatrix}\]
Thus, the n equations representing \(y_i\) as a function of the x’s, \(\beta\)’s, and \(\epsilon\)’s can be simultaneously written as
\[Y = X\beta + \epsilon\] In simple linear regression, we see that the least-squares equations are given by \[(X'X)\hat\beta = X'Y\]
Hence, \[\hat\beta = (X'X)^{-1}X'Y\] The \(n \times 1\) vector of fitted values is \[ \hat Y = \begin{pmatrix} \hat y_1 \\ \hat y_2 \\ \vdots \\ \hat y_n \end{pmatrix} = X\hat\beta\] and the \(n \times 1\) vector of residuals is \[ e = \begin{pmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{pmatrix} = Y - \hat Y = \begin{pmatrix} y_1 - \hat y_1 \\ y_2 - \hat y_2 \\ \vdots \\ y_n - \hat y_n \end{pmatrix}\] The sum of squares for error can then be calculated as \[SSE = e^{'}e = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat y_i)^2\] and \[\hat\sigma^2 = MSE = \frac{SSE}{n-k-1}\] where \(k\) indicates number of input variables.
x | y |
---|---|
-2 | 0 |
-1 | 0 |
0 | 1 |
1 | 1 |
2 | 3 |
Fit a parabola to the data above, using the model \[Y = \beta_0 + \beta_1x + \beta_2x^2 +\epsilon\]
Solution The \(X\) matrix for this example differs from that of simple linear regression case only by the addition of a third column corresponding to \(x^2\). (Notice that \(x_1 = x, x_2 = x^2\), and \(k = 2\) in the notation of the general linear model) Thus,
\[Y = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 3 \end{bmatrix}, \hspace{1cm} X = \begin{bmatrix} 1 & -2 & 4 \\ 1 & -1 & 1 \\ 1 & \phantom{-}0 & 0 \\ 1 & \phantom{-}1 & 1 \\ 1 & \phantom{-}2 & 4 \end{bmatrix}\]
Thus, for the first measurement, \(y = 0, x_0 = 1, x = -2\), and \(x^2 = 4\) and for the second measurement, \(y = 0, x_0 = 1, x = -1\), and \(x^2 = 1\). Succeeding rows of the \(Y\) and \(X\) matrices are obtained in a similar manner. The matrix products \(X'X\) and \(X'Y\) are
\[X'X =\begin{bmatrix} \phantom{-}1 & \phantom{-}1 & 1 & 1 & 1 \\ -2 & -1 & 0 & 1 &2 \\ \phantom{-}4 & \phantom{-}1 & 0 & 1 & 4\end{bmatrix} \begin{bmatrix} 1 & -2 & 4 \\ 1 & -1 & 1 \\ 1 & \phantom{-} 0 & 0\\ 1 & \phantom{-} 1 & 1\\ 1 & \phantom{-} 2 & 4\\ \end{bmatrix} =\begin{bmatrix} 5 & 0 & 10 \\ 0 & 10 & 0 \\ 10 & 0 & 34 \end{bmatrix}\]
\[X'Y =\begin{bmatrix} \phantom{-}1 & \phantom{-}1 & 1 & 1 & 1 \\ -2 & -1 & 0 & 1 &2 \\ \phantom{-}4 & \phantom{-}1 & 0 & 1 & 4\end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 3 \end{bmatrix} =\begin{bmatrix} 5 \\ 7 \\ 13 \end{bmatrix}\]
By inverting \(X'X\), we get, \[(X'X)^{-1} =\begin{bmatrix} \phantom{-} 17/35 & 0 & -1/7 \\ \phantom{-} 0 & 1/10 & 0 \\ -1/7 & 0 & \phantom{-} 1/14\end{bmatrix}\]
Finally, \[\begin{aligned} \hat\beta &= (X'X)^{-1} X'Y \\ &=\begin{bmatrix} \phantom{-} 17/35 & 0 & -1/7 \\ \phantom{-} 0 & 1/10 & 0 \\ -1/7 & 0 & \phantom{-} 1/14\end{bmatrix} \begin{bmatrix} 5 \\ 7 \\ 13 \end{bmatrix} =\begin{bmatrix} 4/7 \\ 7/10 \\ 3/14 \end{bmatrix} \approx \begin{bmatrix} 0.571 \\ 0.700 \\ 0.214 \end{bmatrix}\end{aligned}\]
Hence, \(\hat\beta_0 = 0.571, \ \hat\beta_1 = 0.7\), and \(\hat\beta_2 = 0.214\). and the prediction equation is \[\hat y = 0.571 + 0.7x + 0.214 x^2\]
toy <- data.frame(x = c(-2,-1,0,1,2), y = c(0,0,1,1,3))
toy$x2 <- toy$x^2
X <- as.matrix(data.frame(x0 = rep(1,5), x = toy$x, x2 = toy$x2))
Y <- as.matrix(toy$y)
n <- length(Y)
k <- dim(X)[2]-1
# B_hat matrix
B_hat <- solve(t(X) %*% X) %*% t(X) %*% Y; B_hat
## [,1]
## x0 0.5714286
## x 0.7000000
## x2 0.2142857
Y_hat <- X %*% B_hat # prediction
e <- Y-Y_hat # residual
SSE <- t(e) %*% e # sum of square error
S_hat <- sqrt(SSE/(n-k-1)) # residual standard error (or MSE)
# base R
fit <- lm(y ~ x+x2, data = toy)
summary(fit)
##
## Call:
## lm(formula = y ~ x + x2, data = toy)
##
## Residuals:
## 1 2 3 4 5
## -0.02857 -0.08571 0.42857 -0.48571 0.17143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5714 0.3332 1.715 0.2285
## x 0.7000 0.1512 4.630 0.0436 *
## x2 0.2143 0.1278 1.677 0.2355
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4781 on 2 degrees of freedom
## Multiple R-squared: 0.9238, Adjusted R-squared: 0.8476
## F-statistic: 12.13 on 2 and 2 DF, p-value: 0.07619
for more information on the equations for the inference of parameters of MLR model, see Wakerly p615 ~ p634
The table below is a data set concerning the aerobic fitness of a sample of 20 male subjects. An exercising individual breathes through an apparatus that measures the amount of oxygen in the inhaled air that is used by the individual. The maximum value per unit time of the utilized oxygen is then scaled by the person’s body weight to come up with a variable \(VO_2-max\), which is a general indication of the aerobic fitness of the individual. In other words, \(VO_2-max\) quantifies the maximum rate of oxygen consumption measured during incremental exercise. Fit a linear regression model with \(VO_2-max\) as the dependent variable and age as the explanatory variable. Solve questions below.
VO2.max | Age |
---|---|
23 | 59 |
45 | 47 |
29 | 44 |
55 | 32 |
48 | 45 |
42 | 61 |
32 | 71 |
33 | 32 |
34 | 28 |
52 | 36 |
40 | 36 |
35 | 51 |
45 | 31 |
47 | 44 |
26 | 73 |
42 | 47 |
35 | 40 |
41 | 43 |
29 | 68 |
38 | 40 |
vo2 <- read.csv("http://bisyn.kaist.ac.kr/bis335/10-Regression-vo2max-aerobic-fitness.csv")
# prep
colnames(vo2) <- c("y", "x")
data <- vo2 %>% select(x, y) %>% mutate(xy = x*y, x2 = x^2, y2 = y^2)
# SS
n <- length(data$x)
SS <- colSums(data)
Sxx <- SS[4] - SS[1]^2/n
Syy <- SS[5] - SS[2]^2/n
Sxy <- SS[3] - SS[1]*SS[2]/n
kable(data.frame(Sxx = as.numeric(Sxx), Syy = as.numeric(Syy), Sxy = as.numeric(Sxy)),
escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
kable_styling(position = "center")
Sxx | Syy | Sxy |
---|---|---|
3486.8 | 1428.95 | -1177.4 |
# Slope & Intercept
b1 <- Sxy/Sxx
b0 <- mean(data$y) - b1*mean(data$x)
kable(data.frame(intercept = as.numeric(round(b0,4)),
slope = as.numeric(round(b1,4))),
escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
kable_styling(position = "center")
intercept | slope |
---|---|
54.2181 | -0.3377 |
df <- n - 2
SSE <- as.numeric(Syy - b1*Sxy)
S2 <- SSE/df # error variance
# R^2 & r
SSR <- b1*Sxy
SST <- Syy
r2 <- SSR/SST
r <- Sxy/sqrt(Sxx*Syy)
# pvalue
t <- (r * sqrt(n-2))/(sqrt(1 - r^2))
t_crit <- qt(1-(a/2), df)
p <- 2 * pt(t, df, lower.tail = FALSE)
library(kableExtra)
kable(data.frame(S2 = as.numeric(round(S2,4)),
r2 = as.numeric(round(r2,4)),
r = as.numeric(round(r,4)),
sqrt_r2 = as.numeric(round(-sqrt(r2),4)),
p = as.numeric(round(p,4))),
col.names = c("$S^2$", "$r^2$", "r", "$-\\sqrt {r^2}$", "p"),
align = 'c', escape = FALSE, format = 'html', booktabs = TRUE) %>%
kable_styling(position = "center")
\(S^2\) | \(r^2\) | r | \(-\sqrt {r^2}\) | p |
---|---|---|---|---|
57.2985 | 0.2782 | -0.5275 | -0.5275 | 1.9832 |
a <- 0.05
t_crit <- qt(1 - a, df)
S <- sqrt(S2)
c11 <- 1/Sxx
upper <- b1 + (t_crit*S*sqrt(c11)); upper
## xy
## -0.1153818
kable(data.frame(lower = "-$\\infty$",
upper = as.numeric(round(upper,4))),
escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
kable_styling(position = "center")
lower | upper |
---|---|
-\(\infty\) | -0.1154 |
# Confidence interval
t_crit <- qt(1 - (a/2), df)
x <- 50
y_hat <- b0+b1*x
K <- (1/n)+((x - mean(vo2$x))^2/Sxx)
lower <- y_hat - (t_crit*S*sqrt(K))
upper <- y_hat + (t_crit*S*sqrt(K))
kable(data.frame(y_hat = as.numeric(round(y_hat, 4)),
lower = as.numeric(round(lower, 4)),
upper = as.numeric(round(upper,4))),
col.names = c("$\\hat y$", "lower", "upper"),
escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
kable_styling(position = "center")
\(\hat y\) | lower | upper |
---|---|---|
37.3344 | 33.6485 | 41.0202 |
# Prediction interval
lower <- y_hat - (t_crit*S*sqrt(1+K))
upper <- y_hat + (t_crit*S*sqrt(1+K))
kable(data.frame(y_hat = as.numeric(round(y_hat, 4)),
lower = as.numeric(round(lower, 4)),
upper = as.numeric(round(upper,4))),
col.names = c("$\\hat y$", "lower", "upper"),
escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
kable_styling(position = "center")
\(\hat y\) | lower | upper |
---|---|---|
37.3344 | 21.0097 | 53.659 |
vo2 <- read.csv("http://bisyn.kaist.ac.kr/bis335/10-Regression-vo2max-aerobic-fitness.csv")
# 1. lm
fit <- lm(VO2.max ~ Age, data = vo2)
fit$coefficients # b0, b1
## (Intercept) Age
## 54.2180509 -0.3376735
# 2. r^2, r
## r^2
r2 <- summary(fit)
kable(data.frame(r2 = as.numeric(round(r2$r.squared, 4)),
f = as.numeric(round(r2$f[1], 4)),
p = as.numeric(round(r2$coefficients[2,4],4)),
sigma = as.numeric(round(r2$sigma,4))),
col.names = c("$r^2$", "F", "p", "$\\hat\\sigma$"),
escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
kable_styling(position = "center")
\(r^2\) | F | p | \(\hat\sigma\) |
---|---|---|---|
0.2782 | 6.9387 | 0.0168 | 7.5696 |
## r, t, p
r <- cor.test(vo2$VO2.max, vo2$Age)
kable(data.frame(r = as.numeric(round(r$estimate, 4)),
t = as.numeric(round(r$statistic, 4)),
p = as.numeric(round(r$p.value,4))),
col.names = c("$r$", "t", "p"),
escape = FALSE, align = 'c', format = 'html', booktabs = TRUE) %>%
kable_styling(position = "center")
\(r\) | t | p |
---|---|---|
-0.5275 | -2.6341 | 0.0168 |
# 3. conf.int for b1 (one-sided, upper)
confint(fit, "Age", level = 1-2*a, "one-sided")[2]
## [1] -0.1153818
# 4. conf.int and pred.int for y_hat
newdata <- data.frame(Age = 50)
predict(fit, newdata, level=1-a, interval = "confidence")[1,]
## fit lwr upr
## 37.33438 33.64853 41.02022
predict(fit, newdata, level=1-a, interval = "prediction")[1,]
## fit lwr upr
## 37.33438 21.00974 53.65901
# 5. visualize
library(ggplot2)
qplot(vo2$Age, vo2$VO2.max, ylab = "VO2 Max",
xlab = "Age", main = "VO2 max vs Age") + stat_smooth(method = "lm")
If the model is used then it would be extrapolation, so the prediction may be inaccurate.
The following table summarize the initial (\(S_0\)) and increased masses of tumor (\(S_1\)) in a given time. From the scatter plot of the data set, you may notice that the relationship between \(S_0\) and \(S_1\) seems nonlinear but intrinisically linear. Find best linear fit of this data by using variable transformation and answer questions below. You can also test the variable \(\Delta S = S_1 - S_0\) to find best fit between initial (possibly transformed) mass and difference or increased mass.
\(S_0\) | \(S_1\) |
---|---|
55 | 73 |
55 | 87 |
80 | 182 |
80 | 206 |
105 | 255 |
105 | 245 |
130 | 272 |
155 | 301 |
155 | 269 |
180 | 294 |
180 | 274 |
# prep
tumor$sd <- tumor$s1 - tumor$s0
# reordering
fit1 <- lm(sd ~ s0, data = tumor)
fit2 <- lm(s1 ~ s0, data = tumor)
# r^2
summary(fit1)$r.squared
## [1] 0.2585808
summary(fit2)$r.squared
## [1] 0.7649424
# prep
tumor$s0_t <- log(tumor$s0) # ln(s_0)
# reordering
fit3 <- lm(sd ~ s0_t, data = tumor)
fit4 <- lm(s1 ~ s0_t, data = tumor)
# r^2
summary(fit3)$r.squared
## [1] 0.4075046
summary(fit4)$r.squared
## [1] 0.8827907
# Extract the statistic
t <- data.frame(fit1 = summary(fit1)$coefficients[2,3],
fit2 = summary(fit2)$coefficients[2,3],
fit3 = summary(fit3)$coefficients[2,3],
fit4 = summary(fit4)$coefficients[2,3])
# Setting parameters
a1 <- 0.05
a2 <- 0.01
df <- length(tumor$s0)-2
# Calculate critical t (two-sided)
t_crit1 <- qt(1-(a1/2), df)
t_crit2 <- qt(1-(a2/2), df)
# result
t > t_crit1
## fit1 fit2 fit3 fit4
## [1,] FALSE TRUE TRUE TRUE
t > t_crit2
## fit1 fit2 fit3 fit4
## [1,] FALSE TRUE FALSE TRUE
# cf) R^2 cutoff
r2_crit1 <- t_crit1^2 / (t_crit1^2+df)
r2_crit2 <- t_crit2^2 / (t_crit2^2+df)
r2_crit <- data.frame(r2_crit1, r2_crit2); r2_crit
## r2_crit1 r2_crit2
## 1 0.3624868 0.539911
Notice that for testing significance, \(\hat\beta_1\) and r use the same t statistic. (see the theory part)
Recall that \[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = \frac{1}{1+\frac{SSE}{SSR}}\]
library(alr3)
f_lof <- data.frame(fit1 = pureErrorAnova(fit1)[3,4],
fit2 = pureErrorAnova(fit2)[3,4],
fit3 = pureErrorAnova(fit3)[3,4],
fit4 = pureErrorAnova(fit4)[3,4])
# setting parameters
df1 <- pureErrorAnova(fit1)[3,1]
df2 <- pureErrorAnova(fit1)[4,1]
a1 <- 0.05
a2 <- 0.01
# critical F statistic
f_crit1 <- qf(1-a1, df1, df2)
f_crit2 <- qf(1-a2, df1, df2)
# result
f_lof > f_crit1
## fit1 fit2 fit3 fit4
## [1,] TRUE TRUE TRUE TRUE
f_lof > f_crit2
## fit1 fit2 fit3 fit4
## [1,] TRUE TRUE TRUE FALSE
# prep
S <- data.frame(fit1 = summary(fit1)$sigma,
fit2 = summary(fit2)$sigma,
fit3 = summary(fit3)$sigma,
fit4 = summary(fit4)$sigma)
a <- 0.05
# for x0 = xbar
newdata.fit12 <- data.frame(s0 = mean(tumor$s0))
newdata.fit34 <- data.frame(s0_t = mean(tumor$s0_t))
# prediction interval for x0 = xbar
predict(fit1, newdata.fit12, level=1-a, interval = "prediction")[1,]
## fit lwr upr
## 107.09091 11.46674 202.71508
predict(fit2, newdata.fit12, level=1-a, interval = "prediction")[1,]
## fit lwr upr
## 223.4545 127.8304 319.0787
predict(fit3, newdata.fit34, level=1-a, interval = "prediction")[1,]
## fit lwr upr
## 107.09091 21.60821 192.57361
predict(fit4, newdata.fit34, level=1-a, interval = "prediction")[1,]
## fit lwr upr
## 223.4545 155.9300 290.9791
# for x0 = xbar + 2S
newdata.fit1 <- data.frame(s0 = mean(tumor$s0) + 2 * as.numeric(summary(fit1)$sigma))
newdata.fit2 <- data.frame(s0 = mean(tumor$s0) + 2 * as.numeric(summary(fit2)$sigma))
newdata.fit3 <- data.frame(s0_t = mean(tumor$s0_t) + 2 * as.numeric(summary(fit3)$sigma))
newdata.fit4 <- data.frame(s0_t = mean(tumor$s0_t) + 2 * as.numeric(summary(fit4)$sigma))
# prediction interval for x0 = xbar + 2S
predict(fit1, newdata.fit1, level=1-a, interval = "prediction")[1,]
## fit lwr upr
## 146.48611 38.43892 254.53331
predict(fit2, newdata.fit2, level=1-a, interval = "prediction")[1,]
## fit lwr upr
## 343.7930 235.7458 451.8402
predict(fit3, newdata.fit3, level=1-a, interval = "prediction")[1,]
## fit lwr upr
## 4820.8822 534.0701 9107.6942
predict(fit4, newdata.fit4, level=1-a, interval = "prediction")[1,]
## fit lwr upr
## 9956.777 7281.596 12631.958
# transformed space
plot(tumor$s0_t, tumor$s1, pch = 16, xlab = "S0_t", ylab = "S1")
lines(tumor$s0_t, fit4$fitted.values)
# original space
slrv <- function(x){fit4$coefficients[2]*log(x)+fit4$coefficients[1]}
plot(tumor$s0, tumor$s1, pch = 16, xlab = "S0", ylab = "S1")
curve(slrv, add = TRUE)
##Simple Linear Regression(1)
data("faithful")
head(faithful)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
summary(faithful)
## eruptions waiting
## Min. :1.600 Min. :43.0
## 1st Qu.:2.163 1st Qu.:58.0
## Median :4.000 Median :76.0
## Mean :3.488 Mean :70.9
## 3rd Qu.:4.454 3rd Qu.:82.0
## Max. :5.100 Max. :96.0
with(faithful, plot(eruptions,waiting)) #Scatterplot of the data
with(faithful, cor(eruptions,waiting)) #correlation
## [1] 0.9008112
##Simple Linear Regression(2)
lm(waiting ~ eruptions, data=faithful)
##
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
##
## Coefficients:
## (Intercept) eruptions
## 33.47 10.73
##Simple Linear Regression(3)
slr1 <- lm(waiting ~ eruptions, data = faithful)
summary(slr1)
##
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0796 -4.4831 0.2122 3.9246 15.9719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.4744 1.1549 28.98 <2e-16 ***
## eruptions 10.7296 0.3148 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.914 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
summary(slr1)$r.squared
## [1] 0.8114608
##SLR - Confidence Interval and Prediction Interval
newdata = data.frame(eruptions=4)
predict(slr1, newdata, interval="confidence")
## fit lwr upr
## 1 76.39296 75.6189 77.16702
predict(slr1, newdata, interval="predict")
## fit lwr upr
## 1 76.39296 64.72382 88.0621
##Graph of Simple Linear Regression
plot(faithful$waiting ~ faithful$eruptions)
abline(slr1, col="red")
par(mfrow=c(2,2))
plot(slr1)
##Multiple Linear Regression(1)
data("stackloss")
head(stackloss)
## Air.Flow Water.Temp Acid.Conc. stack.loss
## 1 80 27 89 42
## 2 80 27 88 37
## 3 75 25 90 37
## 4 62 24 87 28
## 5 62 22 87 18
## 6 62 23 87 18
summary(stackloss)
## Air.Flow Water.Temp Acid.Conc. stack.loss
## Min. :50.00 Min. :17.0 Min. :72.00 Min. : 7.00
## 1st Qu.:56.00 1st Qu.:18.0 1st Qu.:82.00 1st Qu.:11.00
## Median :58.00 Median :20.0 Median :87.00 Median :15.00
## Mean :60.43 Mean :21.1 Mean :86.29 Mean :17.52
## 3rd Qu.:62.00 3rd Qu.:24.0 3rd Qu.:89.00 3rd Qu.:19.00
## Max. :80.00 Max. :27.0 Max. :93.00 Max. :42.00
cor(stackloss)
## Air.Flow Water.Temp Acid.Conc. stack.loss
## Air.Flow 1.0000000 0.7818523 0.5001429 0.9196635
## Water.Temp 0.7818523 1.0000000 0.3909395 0.8755044
## Acid.Conc. 0.5001429 0.3909395 1.0000000 0.3998296
## stack.loss 0.9196635 0.8755044 0.3998296 1.0000000
##Multiple Linear Regression(2)
mlr = lm(stack.loss~Air.Flow+Water.Temp+Acid.Conc., data=stackloss)
mlr
##
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
## data = stackloss)
##
## Coefficients:
## (Intercept) Air.Flow Water.Temp Acid.Conc.
## -39.9197 0.7156 1.2953 -0.1521
summary(mlr)
##
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
## data = stackloss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2377 -1.7117 -0.4551 2.3614 5.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.9197 11.8960 -3.356 0.00375 **
## Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
## Water.Temp 1.2953 0.3680 3.520 0.00263 **
## Acid.Conc. -0.1521 0.1563 -0.973 0.34405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
## F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
##Multiple Linear Regression(3)
pairs(stackloss) #Scatter plot of each variables
par(mfrow=c(2,2))
plot(mlr)
##MLR - Confidence Interval and Prediction Interval
attach(stackloss)
## The following object is masked _by_ .GlobalEnv:
##
## stack.loss
## The following object is masked from package:datasets:
##
## stack.loss
mlr2 <- lm(stack.loss~Air.Flow+Water.Temp+Acid.Conc.)
newdata = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)
predict(mlr2, newdata, interval="confidence")
## fit lwr upr
## 1 24.58173 20.21846 28.945
predict(mlr2, newdata, interval="prediction")
## fit lwr upr
## 1 24.58173 16.4661 32.69736
##Logistic Regression(1)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
## The following object is masked from 'package:dplyr':
##
## select
data(menarche)
menarche
## Age Total Menarche
## 1 9.21 376 0
## 2 10.21 200 0
## 3 10.58 93 0
## 4 10.83 120 2
## 5 11.08 90 2
## 6 11.33 88 5
## 7 11.58 105 10
## 8 11.83 111 17
## 9 12.08 100 16
## 10 12.33 93 29
## 11 12.58 100 39
## 12 12.83 108 51
## 13 13.08 99 47
## 14 13.33 106 67
## 15 13.58 105 81
## 16 13.83 117 88
## 17 14.08 98 79
## 18 14.33 97 90
## 19 14.58 120 113
## 20 14.83 102 95
## 21 15.08 122 117
## 22 15.33 111 107
## 23 15.58 94 92
## 24 15.83 114 112
## 25 17.58 1049 1049
str(menarche)
## 'data.frame': 25 obs. of 3 variables:
## $ Age : num 9.21 10.21 10.58 10.83 11.08 ...
## $ Total : num 376 200 93 120 90 88 105 111 100 93 ...
## $ Menarche: num 0 0 0 2 2 5 10 17 16 29 ...
summary(menarche)
## Age Total Menarche
## Min. : 9.21 Min. : 88.0 Min. : 0.00
## 1st Qu.:11.58 1st Qu.: 98.0 1st Qu.: 10.00
## Median :13.08 Median : 105.0 Median : 51.00
## Mean :13.10 Mean : 156.7 Mean : 92.32
## 3rd Qu.:14.58 3rd Qu.: 117.0 3rd Qu.: 92.00
## Max. :17.58 Max. :1049.0 Max. :1049.00
plot(Menarche/Total~Age, data=menarche)
##Logistic Regression(2)
glm1 <- glm(cbind(Menarche, Total-Menarche)~Age, family=binomial(logit), data=menarche)
glm1
##
## Call: glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit),
## data = menarche)
##
## Coefficients:
## (Intercept) Age
## -21.226 1.632
##
## Degrees of Freedom: 24 Total (i.e. Null); 23 Residual
## Null Deviance: 3694
## Residual Deviance: 26.7 AIC: 114.8
plot(Menarche/Total~Age, data = menarche)
lines(menarche$Age, glm1$fitted.values, type='l', col="red")