The primary goal of experimental designs is to compare different treatments. In a completely randomized design (CRD), experimental units from a single homogeneous group are assigned at random to treatments. (Experimental units are individuals to which treatments are applied). If all the treatment groups have the same number of experimental units we call the design balanced.
Suppose that we are trying to find the effect of an enzyme on the output yields in cells with the four concentration levels of the enzyme. Make a \(4 \times 4\) table with proper identification name on each column, row, and cell for each design below.
you cultured a cell type in 16 culture dishes for your experiment. Now you are designing a perfectly balanced random design with four replicas. To get the design, first assign random numbers to the 16 dishes, then sample four dishes four times according to the number of repetitions.
# Design
set.seed(123)
dish <- 1:16
rn <- sample(dish, 16, replace = FALSE) # random permutation
rep <- cbind(rn[1:4], rn[5:8], rn[9:12], rn[13:16])
colnames(rep) <- c("L1", "L2", "L3", "L4") # levels of the enzyme
rownames(rep) <- c("R1", "R2", "R3", "R4") # replications
knitr::kable(rep)
L1 | L2 | L3 | L4 | |
---|---|---|---|---|
R1 | 5 | 13 | 16 | 8 |
R2 | 12 | 1 | 4 | 2 |
R3 | 6 | 14 | 10 | 11 |
R4 | 15 | 9 | 3 | 7 |
Now we are going to analyze this design with synthetic data which only depends on the treatment effect. As expected, the yield of the cell is different in terms of the level of enzyme but not significantly different in terms of the replication.
# Data
rn <- data.frame(rn, c(rep("L1", 4), rep("L2", 4), rep("L3", 4), rep("L4", 4)),
rep(c("R1","R2","R3","R4"), 4))
colnames(rn)<- c("Dish_Number", "Levels", "Replications")
yield <- c(runif(4, min = 150, max = 250),
runif(4, min = 100, max = 200),
runif(4, min = 200, max = 300),
runif(4, min = 145, max = 155))
data <- data.frame(rn, yield)
colnames(data)<- c("Dish_Number", "Treatments", "Replicate", "Response"); data
## Dish_Number Treatments Replicate Response
## 1 5 L1 R1 174.6088
## 2 12 L1 R2 154.2060
## 3 6 L1 R3 182.7921
## 4 15 L1 R4 245.4504
## 5 13 L2 R1 188.9539
## 6 1 L2 R2 169.2803
## 7 14 L2 R3 164.0507
## 8 9 L2 R4 199.4270
## 9 16 L3 R1 265.5706
## 10 4 L3 R2 270.8530
## 11 10 L3 R3 254.4066
## 12 3 L3 R4 259.4142
## 13 8 L4 R1 147.8916
## 14 2 L4 R2 146.4711
## 15 11 L4 R3 154.6302
## 16 7 L4 R4 154.0230
str(data)
## 'data.frame': 16 obs. of 4 variables:
## $ Dish_Number: int 5 12 6 15 13 1 14 9 16 4 ...
## $ Treatments : Factor w/ 4 levels "L1","L2","L3",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Replicate : Factor w/ 4 levels "R1","R2","R3",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Response : num 175 154 183 245 189 ...
# ANOVA Analysis
fit <- aov(Response ~ Treatments + Replicate, data = data)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatments 3 27061 9020 22.430 0.000164 ***
## Replicate 3 2056 685 1.704 0.235212
## Residuals 9 3619 402
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(fit)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Response ~ Treatments + Replicate, data = data)
##
## $Treatments
## diff lwr upr p adj
## L2-L1 -8.836308 -53.10384 35.431219 0.9220641
## L3-L1 73.296817 29.02929 117.564344 0.0026915
## L4-L1 -38.510299 -82.77783 5.757228 0.0919670
## L3-L2 82.133125 37.86560 126.400652 0.0012181
## L4-L2 -29.673991 -73.94152 14.593536 0.2262619
## L4-L3 -111.807116 -156.07464 -67.539589 0.0001189
##
## $Replicate
## diff lwr upr p adj
## R2-R1 -9.053601 -53.32113 35.21393 0.9169163
## R3-R1 -5.286321 -49.55385 38.98121 0.9812200
## R4-R1 20.322413 -23.94511 64.58994 0.5114975
## R3-R2 3.767280 -40.50025 48.03481 0.9929717
## R4-R2 29.376014 -14.89151 73.64354 0.2328807
## R4-R3 25.608734 -18.65879 69.87626 0.3307793
boxplot(Response ~ Treatments, data = data,
main = "yield of cells according to the different levels of enzyme")
# cf
anova(lm(Response ~ Treatments + Replicate, data = data))
## Analysis of Variance Table
##
## Response: Response
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatments 3 27060.8 9020.3 22.430 0.0001636 ***
## Replicate 3 2055.9 685.3 1.704 0.2352124
## Residuals 9 3619.4 402.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A blocking variable is a categorical variable that is not the primary variable of interest where observations within each level ought to be homogeneous except for treatment. In a randomized complete block design, each treatment is applied to individuals selected at random within each block. Treatments are assigned at random within blocks, but every treatment may not be represented in every block.
In CRD, we have to assume that the cells in each dish are equivalent. In reality, even when the experimenter try to minimize any difference, they could be different due to the different culture conditions (for example, temperature gradient in the incubator) A powerful alternative to the CRD is to restrict the randomization process to form blocks. Within a block, we want all the treatment levels to be represented.
To assign treatments to experimental units, we need to work each block separately. The key element is that each treatment level appears in each block (forming complete blocks), and were assigned at random within each block. Because the cell type is the source of variation in this problem, we set the cell type as blocking variable and perform randomization within each block.
# Design
set.seed(123)
dish <- 1:16
rn <- sample(dish, 16, replace = FALSE)
rn <- data.frame(rn, c(rep("L1", 4), rep("L2", 4), rep("L3", 4), rep("L4", 4)),
rep(c("C1","C2","C3","C4"), 4))
colnames(rn)<- c("Dish_Number", "Enzyme_Level", "Cell_Type"); rn
## Dish_Number Enzyme_Level Cell_Type
## 1 5 L1 C1
## 2 12 L1 C2
## 3 6 L1 C3
## 4 15 L1 C4
## 5 13 L2 C1
## 6 1 L2 C2
## 7 14 L2 C3
## 8 9 L2 C4
## 9 16 L3 C1
## 10 4 L3 C2
## 11 10 L3 C3
## 12 3 L3 C4
## 13 8 L4 C1
## 14 2 L4 C2
## 15 11 L4 C3
## 16 7 L4 C4
rep <- data.frame(rn[rn$Cell_Type == "C1", "Dish_Number"], # Block 1
rn[rn$Cell_Type == "C2", "Dish_Number"], # Block 2
rn[rn$Cell_Type == "C3", "Dish_Number"], # Block 3
rn[rn$Cell_Type == "C4", "Dish_Number"]) # Block 4
colnames(rep) <- c("L1", "L2", "L3", "L4") # levels of the enzyme
rownames(rep) <- c("B1", "B2", "B3", "B4") # blocks
knitr::kable(rep)
L1 | L2 | L3 | L4 | |
---|---|---|---|---|
B1 | 5 | 12 | 6 | 15 |
B2 | 13 | 1 | 14 | 9 |
B3 | 16 | 4 | 10 | 3 |
B4 | 8 | 2 | 11 | 7 |
The analysis procedure of the RBD is the same as CRD but here we consider the effect of blocking variable.
# Data
yield <- c(runif(4, min = 150, max = 250),
runif(4, min = 100, max = 200),
runif(4, min = 200, max = 300),
runif(4, min = 145, max = 155))
data <- data.frame(rn, yield)
colnames(data)<- c("Dish_Number", "Treatments", "Block", "Response"); data
## Dish_Number Treatments Block Response
## 1 5 L1 C1 174.6088
## 2 12 L1 C2 154.2060
## 3 6 L1 C3 182.7921
## 4 15 L1 C4 245.4504
## 5 13 L2 C1 188.9539
## 6 1 L2 C2 169.2803
## 7 14 L2 C3 164.0507
## 8 9 L2 C4 199.4270
## 9 16 L3 C1 265.5706
## 10 4 L3 C2 270.8530
## 11 10 L3 C3 254.4066
## 12 3 L3 C4 259.4142
## 13 8 L4 C1 147.8916
## 14 2 L4 C2 146.4711
## 15 11 L4 C3 154.6302
## 16 7 L4 C4 154.0230
str(data)
## 'data.frame': 16 obs. of 4 variables:
## $ Dish_Number: int 5 12 6 15 13 1 14 9 16 4 ...
## $ Treatments : Factor w/ 4 levels "L1","L2","L3",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Block : Factor w/ 4 levels "C1","C2","C3",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Response : num 175 154 183 245 189 ...
boxplot(Response ~ Treatments, data = data,
main = "yield of cells according to the different levels of enzyme")
# ANOVA Analysis
fit <- aov(Response ~ Block + Treatments, data = data)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Block 3 2056 685 1.704 0.235212
## Treatments 3 27061 9020 22.430 0.000164 ***
## Residuals 9 3619 402
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(fit)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Response ~ Block + Treatments, data = data)
##
## $Block
## diff lwr upr p adj
## C2-C1 -9.053601 -53.32113 35.21393 0.9169163
## C3-C1 -5.286321 -49.55385 38.98121 0.9812200
## C4-C1 20.322413 -23.94511 64.58994 0.5114975
## C3-C2 3.767280 -40.50025 48.03481 0.9929717
## C4-C2 29.376014 -14.89151 73.64354 0.2328807
## C4-C3 25.608734 -18.65879 69.87626 0.3307793
##
## $Treatments
## diff lwr upr p adj
## L2-L1 -8.836308 -53.10384 35.431219 0.9220641
## L3-L1 73.296817 29.02929 117.564344 0.0026915
## L4-L1 -38.510299 -82.77783 5.757228 0.0919670
## L3-L2 82.133125 37.86560 126.400652 0.0012181
## L4-L2 -29.673991 -73.94152 14.593536 0.2262619
## L4-L3 -111.807116 -156.07464 -67.539589 0.0001189
# cf
anova(lm(Response ~ Block + Treatments, data = data))
## Analysis of Variance Table
##
## Response: Response
## Df Sum Sq Mean Sq F value Pr(>F)
## Block 3 2055.9 685.3 1.704 0.2352124
## Treatments 3 27060.8 9020.3 22.430 0.0001636 ***
## Residuals 9 3619.4 402.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Blocks are usually treated as random effects, as they would represent the population of all possible blocks. In other words, we are usually not concerned with mean comparisons among specific blocks, but are using blocks to represent the population of all possible blocks. To more clearly understand this concept, I strongly recommend you watch this video
A Latin Square design is an example of an incomplete block design where there is a single treatment and two blocking variables, each with the same number of levels. In the latin square design, only a single treatment is applied within each combination of blocking variables. In addition, each letter appears in each row and in each column exactly once. There are many different possible Latin Squares of a given size and For any Latin square, swapping the order of the rows or the columns provides a different Latin square. Latin square design for our example is shown below.
# Design
set.seed(123)
# Latin Square function
latin <- function (n, nrand = 20)
{
x = matrix(LETTERS[1:n], n, n)
x = t(x)
for (i in 2:n) x[i, ] = x[i, c(i:n, 1:(i - 1))]
if (nrand > 0) {
for (i in 1:nrand) {
x = x[sample(n), ]
x = x[, sample(n)]
}
}
x
}
rep <- latin(4)
rep[rep == "A"] <- "L1"
rep[rep == "B"] <- "L2"
rep[rep == "C"] <- "L3"
rep[rep == "D"] <- "L4"
colnames(rep) <- c("C1", "C2", "C3", "C4") # type of cells
rownames(rep) <- c("E1", "E2", "E3", "E4") # type of enzymes
knitr::kable(rep)
C1 | C2 | C3 | C4 | |
---|---|---|---|---|
E1 | L3 | L1 | L2 | L4 |
E2 | L2 | L4 | L1 | L3 |
E3 | L1 | L3 | L4 | L2 |
E4 | L4 | L2 | L3 | L1 |
In Latin square design, the blocking variables are two. Therefore, we need to consider two blocking variables at the same time as shown below.
data <- rbind(cbind(rep("E1", 4), paste0("C",1:4), c("L3", "L1", "L2", "L4")),
cbind(rep("E2", 4), paste0("C",1:4), c("L2", "L4", "L1", "L3")),
cbind(rep("E3", 4), paste0("C",1:4), c("L1", "L3", "L4", "L2")),
cbind(rep("E4", 4), paste0("C",1:4), c("L4", "L2", "L3", "L1")))
# Data
data <- as.data.frame(data)
data[data[,3] == "L1",4] <- runif(4, min = 150, max = 250)
data[data[,3] == "L2",4] <- runif(4, min = 100, max = 200)
data[data[,3] == "L3",4] <- runif(4, min = 200, max = 300)
data[data[,3] == "L4",4] <- runif(4, min = 145, max = 155)
colnames(data) <- c("Enzyme_type", "Cell_types", "Levels", "yield")
colnames(data) <- c("Block1", "Block2", "Treatments", "Response")
str(data)
## 'data.frame': 16 obs. of 4 variables:
## $ Block1 : Factor w/ 4 levels "E1","E2","E3",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Block2 : Factor w/ 4 levels "C1","C2","C3",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Treatments: Factor w/ 4 levels "L1","L2","L3",..: 3 1 2 4 2 4 1 3 1 3 ...
## $ Response : num 241 200 136 154 153 ...
boxplot(Response ~ Treatments, data = data,
main = "yield of cells according to the different levels of enzyme")
# ANOVA Analysis
fit <- aov(Response ~ Treatments + Block1 + Block2, data = data)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatments 3 21765 7255 80.759 3.06e-05 ***
## Block1 3 2034 678 7.548 0.0185 *
## Block2 3 659 220 2.446 0.1618
## Residuals 6 539 90
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(fit)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Response ~ Treatments + Block1 + Block2, data = data)
##
## $Treatments
## diff lwr upr p adj
## L2-L1 -50.758316 -73.95883 -27.55781 0.0011366
## L3-L1 40.283277 17.08277 63.48379 0.0038752
## L4-L1 -44.781295 -67.98181 -21.58078 0.0022279
## L3-L2 91.041593 67.84108 114.24210 0.0000415
## L4-L2 5.977021 -17.22349 29.17753 0.8097381
## L4-L3 -85.064572 -108.26508 -61.86406 0.0000618
##
## $Block1
## diff lwr upr p adj
## E2-E1 -3.222987 -26.423497 19.977524 0.9605895
## E3-E1 18.256639 -4.943871 41.457150 0.1198675
## E4-E1 -12.885208 -36.085718 10.315303 0.3115406
## E3-E2 21.479626 -1.720885 44.680136 0.0674617
## E4-E2 -9.662221 -32.862731 13.538290 0.5210716
## E4-E3 -31.141847 -54.342357 -7.941336 0.0138293
##
## $Block2
## diff lwr upr p adj
## C2-C1 -5.7547164 -28.95523 17.445794 0.8255475
## C3-C1 -17.4553525 -40.65586 5.745158 0.1384964
## C4-C1 -4.7780399 -27.97855 18.422471 0.8885388
## C3-C2 -11.7006362 -34.90115 11.499874 0.3798203
## C4-C2 0.9766765 -22.22383 24.177187 0.9987706
## C4-C3 12.6773127 -10.52320 35.877823 0.3227538
# cf
anova(lm(Response ~ Treatments + Block1 + Block2, data = data))
## Analysis of Variance Table
##
## Response: Response
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatments 3 21764.8 7254.9 80.7588 3.059e-05 ***
## Block1 3 2034.3 678.1 7.5482 0.01845 *
## Block2 3 659.2 219.7 2.4460 0.16176
## Residuals 6 539.0 89.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We hypothetically generated yield of each cell and performed ANOVA analysis. as you can see in the TukeyHSD test result, there are no difference in yield corresponding to the enzyme level. (You can double check this with confidence interval of levels section in the output of the TukeyHSD function. They do contains “0” between lower and upper confidence limits)
Factorial design is a design consists of two or more factors, each with discrete possible values or “levels”, and whose experimental units take on all possible combinations of these levels across all such factors. If the design contains all possible combinations of the levels, we call them crossed.
The question above illustrates fully crossed design. The possible configurations of this design is given as below.
\[N = n_i \times n_j \times n_k \times n_r\]
where \(N\) is the total number of experimental units, \(n_i\), \(n_j\), \(n_k\) are the number of levels on the first, second and third factors respectively. Because we have 4, 4, 4 factors and 3 replicates, the number of experimental units is 192. In other words, to test the factorial design in our experimental setting, we need to culture 192 cells on different dishes.
# Design
set.seed(123)
n <- 4 * 4 * 4 * 3
dish <- 1:n
enzyme_types <- rep(c("E1","E2","E3","E4"), n/4)
levels <- rep(c("L1","L2","L3","L4"), n/4)
cell_types <- rep(c("C1","C2","C3","C4"), n/4)
replicate <- rep(c("R1", "R2", "R3"), n/3)
data <- data.frame(dish, sample(enzyme_types),
sample(cell_types), sample(levels), replicate)
In factorial design, we need to consider interaction effect of two or more blocking variables as shown below.
# data
data[data[,4] == "L1",6] <- runif(n/4, min = 150, max = 250)
data[data[,4] == "L2",6] <- runif(n/4, min = 100, max = 200)
data[data[,4] == "L3",6] <- runif(n/4, min = 200, max = 300)
data[data[,4] == "L4",6] <- runif(n/4, min = 145, max = 155)
colnames(data) <- c("Dishes", "Block1", "Block2",
"Treatments", "Replicate", "Response"); head(data)
## Dishes Block1 Block2 Treatments Replicate Response
## 1 1 E4 C2 L4 R1 149.1867
## 2 2 E3 C3 L3 R2 208.6278
## 3 3 E2 C3 L4 R3 152.1788
## 4 4 E3 C1 L2 R1 104.2108
## 5 5 E1 C1 L2 R2 136.4411
## 6 6 E1 C4 L4 R3 152.4254
boxplot(Response ~ Treatments, data = data,
main = "yield of cells according to the different levels of enzyme")
# ANOVA Analysis
fit <- aov(Response ~ Treatments + Block1 * Block2 + Replicate, data = data)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatments 3 300075 100025 151.327 <2e-16 ***
## Block1 3 1660 553 0.837 0.475
## Block2 3 2646 882 1.334 0.265
## Replicate 2 1655 827 1.252 0.289
## Block1:Block2 9 6771 752 1.138 0.338
## Residuals 171 113029 661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(fit)$Treatments
## diff lwr upr p adj
## L2-L1 -50.859645 -64.47535 -37.24394 2.298162e-14
## L3-L1 42.720026 29.10432 56.33573 4.958256e-13
## L4-L1 -52.538916 -66.15462 -38.92321 2.364775e-14
## L3-L2 93.579671 79.96397 107.19537 0.000000e+00
## L4-L2 -1.679271 -15.29497 11.93643 9.886316e-01
## L4-L3 -95.258942 -108.87465 -81.64324 0.000000e+00
As you can see, only treatment effect is significant in this example.
Let’s formalize our observation as follows. Suppose you have several observations, say \(Y_{ij}\), and it follows normal distribution with mean \(\mu_i\) and variance \(\sigma^2\) which assume that equal variance for all observations. Now we can write, \[Y_{ij} \sim i.i.d.\ N(\mu_i, \sigma^2)\] where \(i = 1,...,k\) represents different treatment groups and \(j=1,...,n\) represents number of samples in each group. we can rewrite the above relationship as \[Y_{ij} = \mu_i + \epsilon_{ij}, \ \ \epsilon_{ij} \sim i.i.d \ N(0, \sigma^2)\]
which means the random errors of all observations follow normal distribution with mean \(0\) and variance \(\sigma^2\). In other words, our observations are simply drawn by the deterministic statistic \(\mu_i\) and a stochastic error \(\epsilon_{ij}\) fluctuating around zero (which means \(\sum_{j=1}^n \epsilon_{ij} = 0\)). We also say that \(Y\) is the response variable and the grouping information is predictor variables. That is a linear regression model with a categorical pedictor with normally distributed errors.
We can rewrite \(\mu_i\) as \[\mu_i = \mu + \alpha_i\] for \(i = 1,...,k\) to obtain \[Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\] where
Here \(\alpha_i\) represents the \(i_{th}\) treatment effect and we assumed sum of all treatment effect is zero (or you can use \(\alpha_1 = 0\) as a side-constraint. This is default constraint of R. because we have k+1 parameters with k equations, to estimate \(\mu_i\) we need to have this side-constraint equation). Think of \(\mu\) as a “grand mean” and \(\alpha_i\) as a “deviation from the grand mean due to the \(i_{th}\) treatment effect” This statement is not always correct but it is still useful.
We now estimate the parameters using the least squares criterion which ensures that the model fits the data well in the sense that the squared deviation from the observed data \(y_{ij}\) to the model values \(\mu + \alpha_i\) are minimized, i.e., \[\hat \mu, \hat \alpha_i = argmin_{\mu, \alpha_i} \sum_{i=1}^{k}\sum_{j=1}^{n} (y_{ij} - \mu - \alpha_i)^2\] or simply \[\hat \mu_i = argmin_{\mu_i}\sum_{i=1}^{k}\sum_{j=1}^{n} (y_{ij} - \mu_i)^2\] As we can independently estimate the values of the different groups, we get \(\hat \mu_i = \bar y_i.\), hence \[\hat \mu_i = \hat \mu + \hat \alpha_i = \bar y_i.\] where \[\bar y_{i.} = \frac{1}{n}\sum_{j=1}^{n} y_{ij} \hspace {1cm} mean \ \ of \ \ group \ \ i\]
Depnding on the side-constriant that we use we get different results for \(\hat \alpha_i\)
The estimate of the error variance \(\hat \sigma^2\) is also called mean squared error, MSE. It is given by \[\hat \sigma^2 = MSE = \frac{1}{N-k} SSE\] where \[SSE = \sum_{i=1}^{k}\sum_{j=1}^{n} (y_{ij} - \hat\mu_i)^2 \hspace {1cm} \ total \ sum \ of \ squares\] With these definitions, we can generate testing hypothesis as shown below.
\[H_0: \mu_1 = ... \mu_k\] \[H_A: \mu_k \neq \mu_l \ for \ \ at \ \ least \ \ one \ \ pair \ \ k \neq l\]
By constructing a statistical test through decomposing the variation of the data, we can test our hypothesis. Here we test whether total variance of the response around the overall mean originated from the variance “between groups” or the variance “within groups”. Formally,
\[SST = \sum_{i=1}^{k}\sum_{j=1}^{n} (y_{ij} - \bar y_{..})^2 \hspace {1cm} \ treatment \ sum \ of \ squares \ (between \ groups)\] \[SSTr = \sum_{i=1}^{k}\sum_{j=1}^{n} (y_{i.} - \bar y_{..})^2 \hspace {1cm} \ error \ sum \ of \ squares \ (within \ groups)\] \[SSE = \sum_{i=1}^{k}\sum_{j=1}^{n} (y_{ij} - \bar y_{i.})^2\] \[SST = SStr + SSE\]
where
\[y_{i.} = \sum_{j=1}^{n} y_{ij} \hspace {1cm} sum \ \ of \ \ group \ \ i\]
\[y_{..} = \sum_{i=1}^{k}\sum_{j=1}^{n} y_{ij} \hspace {1cm} sum \ \ of \ \ all \ \ observations\]
\[\bar y_{i.} = \frac{1}{n}\sum_{j=1}^{n} y_{ij} \hspace {1cm} mean \ \ of \ \ group \ \ i\]
\[\bar y_{..} = \frac{1}{N} \sum_{i=1}^{k}\sum_{j=1}^{n} y_{ij} \hspace {1cm} overall \ \ (or \ \ total) \ \ mean\]
If all the groups share the same (theoretical) mean, we expect the treatment sum of squares to be small: Just due to the random nature of the data we expect small differences between the different (empirical) group means. The idea is now to compare the variation between groups with the variation within groups. If the variation between groups is substantially larger than the variation within groups we have evidence against the null hypothesis.
In fact, it can be shown that under \(H_0\), it holds that
\[F = \frac{MSTr}{MSE} ~ F_{k-1, N-k}\] where \[MSTr = \frac{SSTr}{k-1} \hspace {1cm} Mean \ \ squares \ \ of \ \ treatments\] \[MSE = \frac{SSE}{N-k} \hspace {1cm} Mean \ \ squares \ \ of \ \ errors\]
So far we derived equations to model our observation and derive test statistics. When we get these statistics from our data, we need to check validity of these statistics by checking assumptions underlying our model. This is called the Residual Analysis. The assumptions we made is expressed as \(\epsilon_{ij} \sim i.i.d. \ N(0, \sigma^2)\). We can not test our assumption directly, because we can not observe the error. Instead we can estimate the error using residuals represented as \[r_{ij} = y_{ij} - \hat \mu_i\]
In R, there are some of tools related to this residual. Q-Q plot can show the fitness of the model compared to the theoretical model derived from reference distribution (the default is normal distribution). TA-Plot (Tukey-Anscombe Plot) can show whether residuals have constant variance and zero mean. Index plot can show sequential structure of residuals. These are tools for checking our model. You can plot these tools by simply run plot function on the fitted model.
par(mfrow=c(2,2)); plot(fit); par(mfrow=c(1,1))
This is called Analysis of Variance or ANOVA.
Recall we started from equation \[Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\] In fact, the equation is equivalent to the equation of completely randomized design or CRD. If you add blocking variable to the this equation, then you can get the Randomized complete block design or RCBD and formally,
\[y_{ij} = \mu + \alpha_i + \beta_j + e_{ij}, \ \ \ j = 1,...,n_i \ \ i = 1, ..., k\] where
\(\mu\) is an overall mean
\(\alpha_{i}\) is the effect of treatment \(i\) where \(\sum_{i=1} \alpha_i = 0\), for \(i = 1, ..., n\)
\(\beta_{j}\) is the effect of block \(j\) where \(\sum_{j=1} \beta_j = 0\), for \(i = 1, ..., k\)
and \(e_{ij} \sim i.i.d \ N(0,\sigma_e^2)\) is individual random error
Note that \(\mu_{ij} = \mu + \alpha_i + \beta_j\). We could assume \(\alpha_1 = \beta_1 = 0\)
If you have two blocking variable with equal size then you can apply Latin Square design and formally,
\[y_{ijk} = \mu + \alpha_i + \beta_j + \gamma_k + \epsilon_{ijk}\]
where,
\(\mu\) is a population mean
\(\alpha_i\) is the main effect of treatment \(i\), \(i = 1,2, ..., n\), \(\sum_{i=1}^n \alpha_i = 0\)
\(\beta_j\) is a main effect of the first blocking variable \(j\), \(j=1, 2, ..., n\), \(\sum_{j=1}^n \beta_j = 0\)
\(\gamma_k\) is a main effect of the second blocking variable \(k\), \(k=1, 2, ..., n\), \(\sum_{k=1}^n \gamma_k = 0\)
\(\epsilon_{ijk} ~ i.i.d. \ N(0, \sigma_{\epsilon}^2)\) is the individual error distribution
For the factorial design, we simply add interaction term as shown below
\[y_{ij} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + e_{ijk}, \ \ \ j = 1,...,n_i \ \ i = 1, ..., k\] Note that although it looks like a multiplication, the interaction term need not imply multiplicative interaction.
You can extend this model with more than 2 factors by writing as \[y_{ij} = \mu + \alpha_i + \beta_j + \gamma_k + (\alpha\beta)_{ij} + (\alpha\gamma)_{ik} + (\beta\gamma)_{jk} + e_{ijkl}, \ \ \ i = 1,...,a, \ \ j = 1, ..., b, \ \ k=1,...,c, \ \ l = 1,...,n\]
You can also write the equation \(Y_{ij} = \mu + \alpha_i + \epsilon_{ij}\) in matrix notation as \[Y = X\beta + E\] where
\(Y \in R^N\) contains all the response values,
\(X\) is an \(N \times k\) matrix, the so-called design matrix,
\(\beta \in R^k\) contains all parameters and
\(E \in R^N\) are the error terms.
This is a typical setup for any regression problem.
In sumamry, ANOVA, Design of experiment, and Regression are essentially the same analysis but different representations and purposes. In our example, the regression model utilizes a categorical predictors.
250 | 300 | 350 |
---|---|---|
2.4 | 2.6 | 3.2 |
2.7 | 2.4 | 3.0 |
2.2 | 2.8 | 3.1 |
2.5 | 2.5 | 2.8 |
2.0 | 2.2 | 2.5 |
2.5 | 2.7 | 2.9 |
2.8 | 2.3 | 3.1 |
2.9 | 3.1 | 3.4 |
2.4 | 2.9 | 3.2 |
2.1 | 2.2 | 2.6 |
# Calculate Test Statistics
# number of sample & factor
n <- length(dt)
k <- 3
nt <- length(dt)/k
# degrees of freedom
df <- k - 1
df_err <- n - k
df_sum <- n - 1
# Test statistics
CT <- (sum(colSums(dt)))^2 / n
SST <- sum((dt^2)) - CT
SSTr <- (sum(colSums(dt)^2) / nt) - CT
SSE <- SST - SSTr
MSE <- SSE/(n - k)
MSTr <- SSTr / df
f <- MSTr/MSE
Df <- c(df, df_err, df_sum)
SS <- c(SSTr, SSE, SST)
MS <- c(round(MSTr,3), round(MSE,3), "")
# result table (same format as summary(aov))
res <- cbind(Df, SS = round(SS,2), MS, F = c(round(f,3),"",""))
kable(res)
Df | SS | MS | F |
---|---|---|---|
2 | 1.54 | 0.772 | 8.904 |
27 | 2.34 | 0.087 | |
29 | 3.89 |
# hypothesis testing
fa <- qf(0.99, df, df_err)
f > fa
## [1] TRUE
# p-value
df(f, df, df_err)
## [1] 0.0006458611
# Therefore, we can reject null hypothesis
# In other words, they are not statistically different.
boxplot(dt)
# Using base R
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
# tidy data
temp_mat <- as.data.frame(data) %>%
select(-procedure) %>%
gather(temperature, response, factor_key = TRUE)
# fitting anova
fit <- aov(response ~ temperature, data = temp_mat)
# result
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## temperature 2 1.545 0.7723 8.904 0.00107 **
## Residuals 27 2.342 0.0867
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# confidence interval
getci <- function(diff, stat, s, tukey=FALSE) {
ci <- matrix(rep(0,6), nrow = 3)
if (tukey == FALSE) {
for (i in 1:3) {
ci[i,] <- c(round(diff[i] - (stat * (s*sqrt(2/nt))),2),
round(diff[i] + (stat * (s*sqrt(2/nt))),2))
}
rownames(ci) <- c("mid-low", "high-low", "high-mid")
colnames(ci) <- c("lwr", "upr")
return(ci)
} else {
for (i in 1:3) {
ci[i,] <- c(round(diff[i] - (stat * (s*sqrt(1/nt))),2),
round(diff[i] + (stat * (s*sqrt(1/nt))),2))
}
rownames(ci) <- c("mid-low", "high-low", "high-mid")
colnames(ci) <- c("lwr", "upr")
return(ci)
}
}
getdiff <- function(x) {
diff <- c(x[2] - x[1], x[3] - x[1], x[3] - x[2])
return(diff)
}
meantemp <- apply(dt, 2, FUN = mean) # or by using dplyr
# meantemp <- temp_mat %>% group_by(temperature) %>% summarise(mean(response))
diff <- getdiff(meantemp)
t <- qt(0.995, df_err)
s <- sqrt(MSE)
# by t-distribution
getci(diff, t, s)
## lwr upr
## mid-low -0.24 0.48
## high-low 0.17 0.89
## high-mid 0.05 0.77
# tukey method
q <- qtukey(0.99, nmeans = k, df = df_err)
getci(diff, q, s,tukey = TRUE)
## lwr upr
## mid-low -0.30 0.54
## high-low 0.11 0.95
## high-mid -0.01 0.83
# Base R package: Tukey Honestly Significant Differences
TukeyHSD(fit, conf.level = 0.99) # where fit comes from aov()
## Tukey multiple comparisons of means
## 99% family-wise confidence level
##
## Fit: aov(formula = response ~ temperature, data = temp_mat)
##
## $temperature
## diff lwr upr p adj
## mid-low 0.12 -0.298625815 0.5386258 0.6381048
## high-low 0.53 0.111374185 0.9486258 0.0011718
## high-mid 0.41 -0.008625815 0.8286258 0.0117189
data <- matrix(c(40,24,31,21, 47,37,41,41, 46,53,48,46, 84,76,77,79), nrow = 8, byrow = TRUE)
data <- data.frame(Temperature = c(rep("Low",4), rep("High", 4)),
Fertilizer1 = data[,1], Fertilizer2 = data[,2])
growthdf <- data %>% gather(Fertilizer, growth, -Temperature, factor_key = TRUE)
# visualize data
library(ggplot2)
ggplot(growthdf, aes(x = Fertilizer, y = growth, col = Temperature)) +
geom_boxplot()
# base R package
fit <- aov(growth ~ Fertilizer * Temperature, data = growthdf)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 1 86 86 0.432 0.52346
## Temperature 1 3221 3221 16.257 0.00166 **
## Fertilizer:Temperature 1 77 77 0.386 0.54579
## Residuals 12 2377 198
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(fit,~.,test = "F") # type III SS and F Tests
## Single term deletions
##
## Model:
## growth ~ Fertilizer * Temperature
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2377.2 88.018
## Fertilizer 1 0.12 2377.4 86.019 0.0006 0.98037
## Temperature 1 1152.00 3529.3 92.340 5.8151 0.03283 *
## Fertilizer:Temperature 1 76.56 2453.8 86.525 0.3865 0.54579
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
for confidence interval, we use TukeyHSD function again.
\(H_{0} = \mu_{11} = ... = \mu_{j, k}\)
\(H_{A} = \mu_{11} \neq ... \neq \mu_{j, k}\)
data <- matrix(c(40,24,31,21, 47,37,41,41, 46,53,48,46, 84,76,77,79), nrow = 8, byrow = TRUE)
data <- data.frame(Temperature = c(rep("Low",4), rep("High", 4)),
Fertilizer1 = data[,1], Fertilizer2 = data[,2])
growthdf <- data %>% gather(Fertilizer, growth, -Temperature, factor_key = TRUE)
fit <- aov(growth ~ Fertilizer * Temperature, data = growthdf)
TukeyHSD(fit, conf.level = 0.99) # where fit comes from aov()
## Tukey multiple comparisons of means
## 99% family-wise confidence level
##
## Fit: aov(formula = growth ~ Fertilizer * Temperature, data = growthdf)
##
## $Fertilizer
## diff lwr upr p adj
## Fertilizer2-Fertilizer1 -4.625 -26.12124 16.87124 0.5234602
##
## $Temperature
## diff lwr upr p adj
## Low-High -28.375 -49.87124 -6.878757 0.0016633
##
## $`Fertilizer:Temperature`
## diff lwr upr p adj
## Fertilizer2:High-Fertilizer1:High -0.25 -38.96755 38.467553 0.9999940
## Fertilizer1:Low-Fertilizer1:High -24.00 -62.71755 14.717553 0.1276054
## Fertilizer2:Low-Fertilizer1:High -33.00 -71.71755 5.717553 0.0273384
## Fertilizer1:Low-Fertilizer2:High -23.75 -62.46755 14.967553 0.1328877
## Fertilizer2:Low-Fertilizer2:High -32.75 -71.46755 5.967553 0.0285670
## Fertilizer2:Low-Fertilizer1:Low -9.00 -47.71755 29.717553 0.8029139
Patient | Low | Medium | High |
---|---|---|---|
1 | 12.2 | 16.5 | 25.4 |
2 | 29.3 | 34.7 | 29.4 |
3 | 14.1 | 26.1 | 34.0 |
4 | 18.5 | 33.3 | 21.7 |
5 | 32.9 | 36.9 | 43.0 |
6 | 15.3 | 32.6 | 32.1 |
7 | 26.1 | 28.9 | 40.9 |
8 | 22.8 | 28.6 | 31.1 |
ddl <- matrix(c(1:8,
12.2, 29.3, 14.1, 18.5, 32.9, 15.3, 26.1, 22.8,
16.5, 34.7, 26.1, 33.3, 36.9, 32.6, 28.9, 28.6,
25.4, 29.4, 34.0, 21.7, 43.0, 32.1, 40.9, 31.1), nrow = 8, byrow = FALSE)
rownames(ddl) <- 1:8
colnames(ddl) <- c("Patient", "Low", "Medium", "High")
# tidy data & visualize
ddl_tidy_df <- as.data.frame(ddl) %>% gather(dosage, response, -Patient, factor_key = TRUE)
ggplot(ddl_tidy_df, aes(x = Patient, y = response, col = dosage)) +
geom_line()
ddl_tidy_df$Patient <- as.factor(ddl_tidy_df$Patient)
#### Anova
fit <- aov(response ~ dosage + Patient, data = ddl_tidy_df)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## dosage 2 511.4 255.71 10.983 0.00135 **
## Patient 7 724.7 103.53 4.446 0.00852 **
## Residuals 14 326.0 23.28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ddl <- matrix(c(1:8,
12.2, 29.3, 14.1, 18.5, 32.9, 15.3, 26.1, 22.8,
16.5, 34.7, 26.1, 33.3, 36.9, 32.6, 28.9, 28.6,
25.4, 29.4, 34.0, 21.7, 43.0, 32.1, 40.9, 31.1), nrow = 8, byrow = FALSE)
rownames(ddl) <- 1:8
colnames(ddl) <- c("Patient", "Low", "Medium", "High")
# tidy data & visualize
ddl_tidy_df <- as.data.frame(ddl) %>% gather(dosage, response, -Patient, factor_key = TRUE)
fit <- aov(response ~ dosage + Patient, data = ddl_tidy_df)
# parameter setting for calculation of qvalue
a <- 0.05
k <- 3
df <- 2 * (length(unique(ddl_tidy_df$Patient)) - 1)
# qvalue
q <- qtukey(1 - a, k, df)
TukeyHSD(fit, conf.level = 0.95) # where fit comes from aov()
## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## Patient
## Warning in TukeyHSD.aov(fit, conf.level = 0.95): 'which' specified some
## non-factors which will be dropped
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = response ~ dosage + Patient, data = ddl_tidy_df)
##
## $dosage
## diff lwr upr p adj
## Medium-Low 8.3 -0.2240681 16.82407 0.0571859
## High-Low 10.8 2.2759319 19.32407 0.0118146
## High-Medium 2.5 -6.0240681 11.02407 0.7418027
Not included in this tutorial
See Hayter’s book p729 - p731
See Hayter’s book p679 - p687
See Hayter’s book p687 - p693