Statistical Testing (Parametric)

Basics Theory of Hypothesis Testing

Significance test
Types of Error
Most powerful test

Parametric Statistical Testing

Z-test (one & two sample)
T-test (one & two sample)
Paired T-test
Variance Test (\(\chi^2\), F)
ANOVA (One way)

Nonparametric Statistical Testing (Lecture 8)

Chi-square(GOF) test
Fisher’s exact Test
Wilcoxon rank-sum Test
Komologorov-Smirnov (GOF)
Kruskal-Wallis test

Extra Issues (Lecture 8)

Multiple testing correction
Other Ad Hoc Tests including
Meta analysis and Cochran-Mantel-Haenszel test etc.

Tutorial

Basics of statistical testing

Significant Test (s5)

# one-sided
qnorm(0.95) # 95 %

## [1] 1.644854

qnorm(0.99) # 99 %

## [1] 2.326348

# two-sided
qnorm(0.975) # 95 %

## [1] 1.959964

qnorm(0.995) # 99 %

## [1] 2.575829

Examples of hypothesis testing with Z-score (s9)

Comparison of two samples

Recall that for the two large samples, \[Z = \frac {(\bar X_1 - \bar X_2 - D_0)}{\sqrt {\frac{\sigma_1}{n_1} + \frac{\sigma_2}{n_2}}} \sim N(0, 1)\] \[\bar X_1 - \bar X_2 \pm z_{\alpha/2}\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}\] \[\bar X_1 - \bar X_2 \pm z_{\alpha/2}\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\]

For two small samples with equal variance,

\[s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 -2}}\]

\[T = \frac {(\bar X_1 - \bar X_2 - D_0)}{s_p \sqrt {\frac{1}{n_1} + \frac{1}{n_2}}} \sim t(n_1 + n_2 - 2) \] \[\bar X_1 - \bar X_2 \pm t_{\alpha/2} (n_1 + n_2 -2) S_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\]

For two independent samples, \[T = \frac {(\bar X_1 - \bar X_2 - D_0)}{\sqrt {\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \sim t(\nu^*)\] where \[\nu^* = \frac{(\frac{s_1^2}{n_1} + \frac {s_2^2}{n_2})^2}{{\frac{({s_1^2})^2}{n_1^2(n_1 - 1)} + \frac{(s_2^2)^2}{n_2^2(n_2-1)}}}\], rounded down to the nearest integer or \(\nu = min(n_1, n_2)-1\) (less powerful)

\[\bar X_1 - \bar X_2 \pm t_{\alpha/2} (\nu^*) \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\] where \(D_0 = \mu_1 - \mu_2\)

Let’s consider the example below.

psycological study example (Wackearly, p500)

A psycological study was conducted to compare the reaction times of men and women to a stimulus. Independent random samples of 50 men and 50 women were employed in the experiment. The results are shown in the table below. Do the data present sufficient evidence to suggest a difference between true mean reaction times for men and women? Use \(\alpha = 0.05\).

	Men	Women
n	50.00	50.00
sec	3.60	3.80
s^2	0.18	0.14

n1 <- 50 # man
n2 <- 50 # women
y1 <- 3.6 # mean (sec)
y2 <- 3.8 # mean (sec)
var1 <- 0.18
var2 <- 0.14
d0 <- 0
alpha <- 0.05

# Using formular for Two large samples (Not recommended)
z <- abs((y1 - y2 - d0)/(sqrt((var1/n1) + (var2/n2)))); z

## [1] 2.5

z_critical <- qnorm(1 - alpha/2)
print(paste0("Since |z| = ",abs(z)," > z_critical = ", round(z_critical,2), 
            ", The difference between man and women is 
            statistically significant with confidence level of a = 0.05"))

## [1] "Since |z| = 2.5 > z_critical = 1.96, The difference between man and women is \n            statistically significant with confidence level of a = 0.05"

# Using t-distribution (Equal variance)
df <- n1 + n2 - 2
sp <- sqrt(((n1 - 1)*var1 + (n2 - 1)*var2)/df)
t <- abs((y1 - y2 - d0) / (sp*(sqrt(1/n1 + 1/n2)))); t

## [1] 2.5

t_critical <- qt(1 - alpha/2, df); t_critical

## [1] 1.984467

print(paste0("Since |t| = ",abs(t)," > t_critical = ", round(t_critical,2), 
            ", The difference between man and women is 
            statistically significant with confidence level of a = 0.05"))

## [1] "Since |t| = 2.5 > t_critical = 1.98, The difference between man and women is \n            statistically significant with confidence level of a = 0.05"

# Using t-distribution (Two independent samples)
nu <- floor(((var1/n1) + (var2/n2))^2 / (((var1/n1)^2/(n1 - 1)) + (var2/n2)^2/(n2 - 1)))
SE <- sqrt((var1/n1) + (var2/n2))
t <- abs((y1 - y2 - d0) / SE); t

## [1] 2.5

t_critical <- qt(1 - alpha/2, nu); t_critical

## [1] 1.984984

print(paste0("Since |t| = ",abs(t)," > t_critical = ", round(t_critical,2), 
            ", The difference between man and women is 
            statistically significant with confidence level of a = 0.05"))

## [1] "Since |t| = 2.5 > t_critical = 1.98, The difference between man and women is \n            statistically significant with confidence level of a = 0.05"

Relation between Type I and Type II error (Wackerly page 493)

For Jones’s political poll, \(N=15\) voters were sampled. We wish to test \(H_0: p = 0.5\) against the alternative, \(H_A: p < 0.5\). The test statistic is Y, the number of sampled voters favoring Jones.

Calculate \(\alpha\) if we select \(RR = \{y \leq 2\}\) as the rejection region.
Calculate \(\alpha\) if we select \(RR = \{y \leq 5\}\) as the rejection region.
Suppose the result of voting shows us \(\hat p = 0.3\). Calculate \(\beta\) if we select \(RR = \{y \leq 2\}\) as the rejection region.

\(RR = \{y \leq 2\}\), \(\alpha\)?

By definition,

\[\alpha = P(type \ I \ error) = P(rejecting \ H_0 \ when \ H_0 \ is \ true)\]

\[= P(value \ of \ test \ statistic \ is \ in \ RR \ when \ H_0 \ is \ true\]

\[= P(Y \leq 2 \ when \ p = 0.5)\]

Since Y is a binomial random variable with \(N = 15\), if \(H_0\) is true, \(p=0.5\) and we obtain

\[P(Incorrect \ rejecting \ H_0) = P(Y\leq2|p=0.5)\]

\[\alpha = \sum_{y=0}^2 {15 \choose y} (0.5)^y(0.5)^{15-y} = {15 \choose 0}(0.5)^{15} + {15 \choose 1}(0.5)^{15} + {15 \choose 2}(0.5)^{15} = 0.004\]

\(RR = \{y \leq 5\}\), \(\alpha\)?

\[P(Incorrect \ rejecting \ H_0) = P(Y\leq5|p=0.5)\]

\[\alpha = \sum_{y=0}^5 {15 \choose y} (0.5)^y(0.5)^{15-y} = {15 \choose 0}(0.5)^{15} + ... + {15 \choose 5}(0.5)^{15} = 0.151\]

\(\hat p = 0.3\), \(RR = \{y \leq 2\}\), \(\beta\)?

\[P(Incorrect \ accepting \ H_0) = P(Y>2|p=0.3)\]

\[\beta = \sum_{y=3}^{15} {15 \choose y} (0.3)^y(0.7)^{15-y}\]

\[= 1-\alpha = 1 - \sum_{y=0}^{2} {15 \choose y} (0.3)^y(0.7)^{15-y} = 1-0.127 = 0.873\]

# Direct calculation
# 1. RR = {y <= 2}, p = 0.5, alpha
a1 <- (choose(15,0) + choose(15,1) + choose(15,2))*0.5^15; a1

## [1] 0.003692627

# 2. RR = {y <= 5}, p = 0.5, alpha
a2 <- (choose(15,0) + choose(15,1) + choose(15,2) + choose(15,3) + 
        choose(15,4) + choose(15,5))*0.5^15; a2

## [1] 0.1508789

# 3. RR = {y <= 2}, p = 0.3, beta
b1 <- 1 - (choose(15,0)*0.3^0*0.7^15 + choose(15,1)*0.3^1*0.7^14 + 
           choose(15,2)*0.3^2*0.7^13); b1

## [1] 0.8731723

# 4. RR = {y <= 5}, p = 0.3, beta
b2 <- 1 - (choose(15,0)*0.3^0*0.7^15 + choose(15,1)*0.3^1*0.7^14 + 
       choose(15,2)*0.3^2*0.7^13 + choose(15,3)*0.3^3*0.7^12 + 
       choose(15,4)*0.3^4*0.7^11 + choose(15,5)*0.3^5*0.7^10); b2

## [1] 0.2783786

df <- round(matrix(c(a1, a2, b1, b2), nrow = 2, byrow = TRUE), 3)
rownames(df) <- c("a_w_p0.5", "b_w_p0.3")
colnames(df) <- c("RR_le_2", "RR_le_5")
kable(df)

	RR_le_2	RR_le_5
a_w_p0.5	0.004	0.151
b_w_p0.3	0.873	0.278

Relation between type1 and type2 error (s11)

# Using base R function

library(knitr)
# Analyzing number of positive voter (Y) to guess the positive ratio of voting (p). 
# selecting X number of positive voter out of 15 (For rejection region, RR1 = y<=2 and RR2 = y<=5)
# RR1
RR1 <- c(); RR2 <- c()
RR1$a <- round(pbinom(2, size = 15, prob = 0.5), 3)

# RR2
RR2$a <- round(pbinom(5, size = 15, prob = 0.5), 3)

# Type II error b for RR1 and when the actual result showed p = 0.3.
# P(Y > 2 | p=0.3) = 1 - P(Y <= 2 | p=0.3)

# RR1
RR1$b <- round(1 - pbinom(2, size = 15, prob = 0.3), 3)

# RR2
RR2$b <- round(1 - pbinom(5, size = 15, prob = 0.3), 3)

RR <- cbind(RR1, RR2)
rownames(RR) <- c("a_w_p0.5", "b_w_p0.3")
colnames(RR) <- c("RR_le_2", "RR_le_5")
kable(RR)

	RR_le_2	RR_le_5
a_w_p0.5	0.004	0.151
b_w_p0.3	0.873	0.278

We cannot reduce both errors at the same time except increasing n

The most powerful test (s12)

Example: Suppose \(y\) is a single observation (one data point) from a population with probability density function given by: \[f(y|\theta) = \begin{cases}{\theta y^{\theta -1}, \ 0<y<1,} \\ {0, \ elsewhere} \end{cases}\] Find the most powerful test (MPT), with significance level \(\alpha = 0.05\), for testing the simple null hypothesis \(H_0: \theta = 2\) against the simple alternative hypothesis \(H_A: \theta = 1\).

Solution) \[\frac{L(\theta_0)}{L(\theta_{\alpha})} = \frac{f(y|\theta_0)}{f(y|\theta_{\alpha})}= \frac{2y^{2-1}}{1y^{1-1}} = 2y \leq k\] for \(\alpha = 0.05\), Let \(K^* = \frac{k}{2}\)

\[\alpha = P(y < k^* \ when \ \theta = 2) = \int_0^{k^*} 2y \ dy = 0.05\] \[[y^2]_{y=0}^{y=k^*} = (k^*)^2 = 0.05\] \[k^* = (0.05)^{1/2} = 0.2236\]

# a = 0.05, theta_0 = 2, theta_a = 1
a <- 0.05
k <- a^(1/2)
y <- c(0, 1, 0.01)
theta_0 <- 2
theta_a <- 1
f <- function(y) theta_0*(y^(theta_0 - 1))
fa <- function(y) theta_a*(y^(theta_a - 1))
curve(f, col = "red")
curve(fa, add = TRUE, col = "blue")
abline(v = k)
abline(v = 0, lty = 2)
abline(v = 1, lty = 2)
abline(h = 0, lty = 2)
cord.x <- c(0, seq(0, k, 0.001), k)
cord.y <- c(0, f(seq(0, k, 0.001)), 0)
cord.xx <- c(k, seq(k, 1, 0.001), 1)
cord.yy <- c(0, fa(seq(k, 1, 0.001)), 0)
redtrans <- rgb(255, 0, 0, 100, maxColorValue = 255) 
bluetrans <- rgb(0, 0, 255, 100, maxColorValue = 255) 
polygon(cord.x,cord.y, col = redtrans) # alpha
polygon(cord.xx,cord.yy, col = bluetrans) # beta
legend("topleft", legend = c("H_0", "H_a"), col = c("red", "blue"), lty = rep(1,2))

for \(H_0:\theta = 1, H_a:\theta=2\) and \(\alpha = 0.1\) case,

\[\frac{L(\theta_0)}{L(\theta_{\alpha})} = \frac{f(y|\theta_0)}{f(y|\theta_{\alpha})}= \frac{1y^{1-1}}{2y^{2-1}} = \frac{1}{2y} \leq k, \\ y \geq \frac{1}{2k}\] for \(\alpha = 0.1\), Let \(K^* = \frac{1}{2k}\)

\[\alpha = P(y > k^* \ when \ \theta = 1) = \int_{k^*}^{1} 1 \ dy = 0.1 \\ 1-k^* = 0.1, k^* = 0.9\]

\[P(y \leq k^* \ when \ \theta = 2) = \int_0^{k^*} 2y \ dy = \int_0^{0.9} 2y \ dy = y^2|^{0.9}_0 = 0.81\] \[[y^2]_{y=0}^{y=k^*} = (k^*)^2 = 0.05\] \[k^* = (0.05)^{1/2} = 0.2236\]

# a = 0.1, theta_0 = 1, theta_a = 2
a <- 0.1
k <- 0.9
y <- c(0, 1, 0.01)
theta_0 <- 1
theta_a <- 2
f <- function(y) theta_0*(y^(theta_0 - 1))
fa <- function(y) theta_a*(y^(theta_a - 1))
curve(f, col = "red", ylim = c(0,2))
curve(fa, add = TRUE, col = "blue")
abline(v = k)
abline(v = 0, lty = 2)
abline(v = 1, lty = 2)
abline(h = 0, lty = 2)
cord.x <- c(k, seq(k, 1, 0.001), 1)
cord.y <- c(0, f(seq(k, 1, 0.001)), 0)
redtrans <- rgb(255, 0, 0, 100, maxColorValue = 255) 
bluetrans <- rgb(0, 0, 255, 100, maxColorValue = 255) 
cord.xx <- c(0, seq(0, k, 0.001), k)
cord.yy <- c(0, fa(seq(0, k, 0.001)), 0)
polygon(cord.x,cord.y, col = redtrans) # alpha
polygon(cord.xx,cord.yy, col = bluetrans) # beta
legend("topleft", legend = c("H_0", "H_a"), col = c("red", "blue"), lty = rep(1,2))

Sample size with fixed \(\alpha\) and \(\beta\) (s13)

Calculate appropriate sample size when \(H_0: \mu = 15\), \(H_a: \mu = 16\), \(\alpha = \beta = 0.05\), \(\sigma^2\) is approximately 9.

Since \(z_{\alpha} = z_{\beta} = z_{0.05} = 1.645\), \(n = \frac{(z_{\alpha}+z_{\beta})^2 \sigma^2}{(\mu_a - \mu_0)^2} = \frac{(1.645+1.645)^2(9)}{(16-15)^2} = 97.4\)

a <- 0.05
b <- 0.05
mu_0 <- 15
mu_a <- 16
s <- sqrt(9)
d <- (mu_a - mu_0)/s # effect size
za <- qnorm(a)
zb <- qnorm(b)

n <- ((za + zb)^2 * s^2) / (mu_a - mu_0)^2; n

## [1] 97.39956

# using R function
library(pwr)
pwr.norm.test(d = d, sig.level = a, power = 1 - b, alternative = "greater")

## 
##      Mean power calculation for normal distribution with known variance 
## 
##               d = 0.3333333
##               n = 97.39957
##       sig.level = 0.05
##           power = 0.95
##     alternative = greater

Parametric test (s14)

(Revisit) Examples of two sample Z-test (s17)

	Men	Women
n	50.00	50.00
sec	3.60	3.80
var	0.18	0.14

n1 <- 50 # man
n2 <- 50 # women
y1 <- 3.6 # mean (sec)
y2 <- 3.8 # mean (sec)
var1 <- 0.18
var2 <- 0.14
d0 <- 0
alpha <- 0.05

# Using formular for Two large samples (Not recommended)
z <- abs((y1 - y2 - d0)/(sqrt((var1/n1) + (var2/n2)))); z

## [1] 2.5

z_critical <- qnorm(1 - alpha/2); z_critical

## [1] 1.959964

# plot z-distribution
x <- seq(-5, 5, 0.01)
y <- dnorm(x)
plot(x, y, type = 'l')
abline(v = z_critical, col = "red")
cord.x <- c(z_critical, seq(z_critical, 5, 0.01), 5)
cord.y <- c(0, dnorm(seq(z_critical, 5, 0.01)), 0)
polygon(cord.x,cord.y,col = 'red')
abline(v = z, col = "blue")

# Using t-distribution (Two independent samples)
nu <- floor(((var1/n1) + (var2/n2))^2 / (((var1/n1)^2/(n1 - 1)) + (var2/n2)^2/(n2 - 1)))
SE <- sqrt((var1/n1) + (var2/n2))
t <- abs((y1 - y2 - d0) / SE); t

## [1] 2.5

t_critical <- qt(1 - alpha/2, nu); t_critical

## [1] 1.984984

# plot t-distribution
x <- seq(-5, 5, 0.01)
y <- dt(x, nu)
plot(x, y, type = 'l')
abline(v = t_critical, col = "red")
cord.x <- c(t_critical, seq(t_critical, 5, 0.01), 5)
cord.y <- c(0, dt(seq(t_critical, 5, 0.01), nu), 0)
polygon(cord.x,cord.y,col = 'red')
abline(v = t, col = "blue")

Since the value is less than the critical value, at the \(\alpha = 0.05\) level, we conclude that sufficient evidence exists to permit us to conclude that mean reaction times differ for men and women.

Two sample T-test (s20)

The effect of new treatment was compared to the standard treatment in two separate patient groups. Can we conclude that the new treatment improves with \(\alpha=0.01\)? Estimate with pooled and unpooled procedure.

x <- c(33, 54, 62, 46, 52, 42, 34, 51, 26, 68, 47, 40, 46, 51, 60)
y <- c(65, 61, 37, 47, 45, 53, 53, 69, 49, 42, 40, 67, 46, 43, 51)
data <- cbind(x,y)
rownames(data) <- 1:15

n1 <- length(x)
n2 <- length(y)
mu1 <- mean(x)
mu2 <- mean(y)
sd1 <- sd(x)
sd2 <- sd(y)
a <- 0.01

# unpooled procedure (one-sided)
df <- ((sd1^2/n1) + (sd2^2/n2))^2/((sd1^4/(n1^2*(n1 - 1))) + (sd2^4/(n2^2*(n2 - 1))))
t <- (mu1 - mu2) / sqrt((sd1^2/n1) + (sd2^2/n2)); t

## [1] -0.9495618

t_critical <- qt(a, df, lower.tail = TRUE); t_critical

## [1] -2.469326

# p-value
pt(t, df)

## [1] 0.1752889

# upper bound of 99 % CI
ci <- mu1 - mu2 + qt(1 - a, df)*sqrt((sd1^2/n1) + (sd2^2/n2)); ci

## [1] 5.975161

# pooled procedure (one-sided)
df <- n1 + n2 - 2
sp2 <- (((n1 - 1) * sd1^2) + ((n2 - 1) * sd2^2))/df
t <- (mu1 - mu2) / (sqrt(sp2) * sqrt(1/n1 + 1/n2))
t_critical <- qt(a, df, lower.tail = TRUE)
pt(t, df)

## [1] 0.1752298

# upper bound of 99 % CI
ci <- mu1 - mu2 + qt(0.990, df) * (sqrt(sp2) * sqrt(1/n1 + 1/n2)); ci

## [1] 5.966569

# base R function
# unpooled
t.test(x, y, alternative = "less", conf.level = 0.99)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -0.94956, df = 27.595, p-value = 0.1753
## alternative hypothesis: true difference in means is less than 0
## 99 percent confidence interval:
##      -Inf 5.975161
## sample estimates:
## mean of x mean of y 
##  47.46667  51.20000

# Pooled
t.test(x, y, alternative = "less", var.equal = TRUE, conf.level = 0.99)

## 
##  Two Sample t-test
## 
## data:  x and y
## t = -0.94956, df = 28, p-value = 0.1752
## alternative hypothesis: true difference in means is less than 0
## 99 percent confidence interval:
##      -Inf 5.966569
## sample estimates:
## mean of x mean of y 
##  47.46667  51.20000

Two sample T-test Example (\(\sigma_1^2 \neq \sigma_2^2\)) (s21)

The nerve conductivity data from healthy and nerve disorder are summarized in the following figure. Is there sufficient evidence to establish that the nerve conductivity speeds are different for the two groups of subjects?

n1 <- 32
mu1 <- 53.994
sd1 <- 0.974

n2 <- 27
mu2 <- 48.594
sd2 <- 2.490

df <- ((sd1^2/n1) + (sd2^2/n2))^2/((sd1^4/(n1^2*(n1 - 1))) + (sd2^4/(n2^2*(n2 - 1))))
t <- (mu1 - mu2) / sqrt((sd1^2/n1) + (sd2^2/n2)); t

## [1] 10.60498

# 99% ci
upper <- (mu1 - mu2) - (qt(0.005, df) * sqrt((sd1^2/n1) + (sd2^2/n2)))
lower <- (mu1 - mu2) + (qt(0.005, df) * sqrt((sd1^2/n1) + (sd2^2/n2)))
upper; lower

## [1] 6.792575

## [1] 4.007425

# p-value
1 - pt(t, df)

## [1] 2.04281e-12

# base R function (sampling the data from 2 normal distributions)
x <- rnorm(n1, mu1, sd1)
y <- rnorm(n2, mu2, sd2)
t.test(x, y, alternative = "two.sided", conf.level = 0.99)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 11.201, df = 34.183, p-value = 5.501e-13
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
##  3.982317 6.545930
## sample estimates:
## mean of x mean of y 
##  54.01585  48.75172

Paired T-test Example (s23)

The percentage reductions in heart rate for the standard drug \(x_i\) and the new drug \(y_i\), together with the differences \(z_i = x_i - y_i\), for the 40 experimental subjects are listed in the table. When the average difference of sample \(\bar z = -2.655\) and the sample standard error of the difference \(s = 3.730\), Can we conclude as there is no difference between two drugs?

data <- read.csv(file = "http://bisyn.kaist.ac.kr/bis335/example.csv")
data <- data[,-1]
z <- data[,3]

n <- length(z)
mu <- mean(z)
sd <- sd(z)

t <- (sqrt(n)*(mu - 0))/sd; t

## [1] -4.501646

df <- n - 1; df

## [1] 39

pval <- 2*pt(t, df); pval

## [1] 5.948044e-05

# 99% ci
upper <- mu - (qt(0.005, df)*sd/sqrt(n))
lower <- mu + (qt(0.005, df)*sd/sqrt(n))
upper; lower

## [1] -1.057915

## [1] -4.252085

# base R function
t.test(data$x, data$y, paired = TRUE, conf.level = 0.99)

## 
##  Paired t-test
## 
## data:  data$x and data$y
## t = -4.5016, df = 39, p-value = 5.948e-05
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
##  -4.252085 -1.057915
## sample estimates:
## mean of the differences 
##                  -2.655

# plot
x <- seq(-15, 15, 0.01)
plot(x, dnorm(x, mu, sd), type = 'l')
abline(v = mu)
abline(v = upper, col = "red")
abline(v = lower, col = "red")

Test of variance ratio (s25, Wackerly p537)

Pain thresholds to electrical shocks for males and females. Is variability of male and female are different?

	Males	Females
n	14.0	10.0
ybar	16.2	14.9
s^2	12.7	26.4

n1 <- 14; n2 <- 10; y1 <- 16.2; y2 <- 14.9; var1 <- 12.7; var2 <- 26.4; alpha <- 0.1

f <- var2 / var1; f

## [1] 2.07874

df1 <- n1 - 1; df1

## [1] 13

df2 <- n2 - 1; df2

## [1] 9

# f value for a = 0.1
f_critical <- qf(0.95, df2, df1); f_critical

## [1] 2.714356

f <- pf(f, df2, df1); f

## [1] 0.8881824

# variability is not different
x <- seq(0,5,0.01)
plot(x, df(x, df2, df1), type = 'l', ylab = "f(x)")
abline(v = f_critical, col = "red")
abline(v = f)

# base R function
x <- rnorm(n1, y1, sqrt(var1))
y <- rnorm(n2, y2, sqrt(var2))
var.test(x, y, alternative = "two.sided", conf.level = 0.90)

## 
##  F test to compare two variances
## 
## data:  x and y
## F = 0.28111, num df = 13, denom df = 9, p-value = 0.03827
## alternative hypothesis: true ratio of variances is not equal to 1
## 90 percent confidence interval:
##  0.09224125 0.76303187
## sample estimates:
## ratio of variances 
##          0.2811097

ANOVA (s27)

In a silicon model of an artery, the flowrates at collapse for the three different stenosis (blocking) levels (L1=0.78, L2=0.71, L3=0.65) were measured. Average collapse flowrates were X1=11.209, X2=15.086, and X3=17.330 for three stenosis models. The sample sizes were n1=11, n2=14, n3=10. Can we conclude that the average collapse flowrates are significantly different?

L1 <- c(10.6, 9.7, 10.2, 14.3, 10.6, 11.0, 13.1, 10.8, 14.3, 10.4, 8.3)
L2 <- c(11.7, 12.7, 12.6, 16.6, 16.3, 14.7, 19.5, 16.8, 15.1, 15.2, 13.2, 14.8, 14.4, 17.6)
L3 <- c(19.6, 15.1, 15.6, 19.2, 18.7, 13.0, 18.9, 18.0, 18.6, 16.6)
L <- unlist(c(L1, L2, L3))

k <- 3 # L1, L2, L3
n <- length(L)
n1 <- length(L1)
n2 <- length(L2)
n3 <- length(L3)

mu <- mean(L)
mu1 <- mean(L1)
mu2 <- mean(L2)
mu3 <- mean(L3)

ss <- sum(L^2)
sst <- ss - n*mu^2
sstr <- (n1 * mu1^2) +  (n2 * mu2^2) + (n3 * mu3^2) - (n * mu^2)

sse <- sst - sstr
df <- k - 1 # L1, L2, L3
mstr <- sstr/df
mse <- sse/(n - k)
f <- mstr/mse; f

## [1] 23.5747

pval <- pf(f, k - 1, n - k, lower.tail = FALSE)

stenosis_level <- c(k - 1, sstr, mstr, f, pval)
residual <- c(n - k, sse, mse, NA , NA)
summary_mat <- rbind(stenosis_level, residual)
colnames(summary_mat) <- c("Df", "Sum sq", "Mean sq", "F value", "Pr(>F)")
knitr::kable(summary_mat)

	Df	Sum sq	Mean sq	F value	Pr(>F)
stenosis_level	2	204.0202	102.010097	23.5747	5e-07
residual	32	138.4672	4.327101	NA	NA

# base R function
df <- data.frame(flowrate = c(L1, L2, L3), 
                 stenosis_level = factor(rep(c("L1", "L2", "L3"), 
                                             times = c(length(L1), length(L2), length(L3)))))
fit <- aov(flowrate ~ stenosis_level, data = df)
anova(fit)

## Analysis of Variance Table
## 
## Response: flowrate
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## stenosis_level  2 204.02 102.010  23.575 5.096e-07 ***
## Residuals      32 138.47   4.327                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Extra - Power test example

p <- seq(0.40, 0.8, 0.05)
beta <- c(0.886, 0.758, 0.581, 0.385, 0.213, 0.094, 0.031, 0.007, 0.001)
power <- c(0.114, 0.242, 0.419, 0.615, 0.787, 0.906, 0.969, 0.993, 0.999)
tbl <- cbind(Probability = p, Beta = beta, Power = power)
knitr::kable(tbl)

Probability	Beta	Power
0.40	0.886	0.114
0.45	0.758	0.242
0.50	0.581	0.419
0.55	0.385	0.615
0.60	0.213	0.787
0.65	0.094	0.906
0.70	0.031	0.969
0.75	0.007	0.993
0.80	0.001	0.999

Performance Analysis (s29)

We will cover this topic in the classification part

ROC Analysis (s30)

We will cover this topic in the classification part

Multiple-Testing Problem (s31 ~ s32)

Recall the error table, where \(\alpha = P(Type \ \ I \ \ Error)\) and \(\beta = P(Type \ \ II \ \ Error)\)

In other words, If we perform 1 hypothesis test, the probability of at least 1 false positive is equal to \(\alpha\).

What is the probability of at least 1 false positive if we conduct \(m\) hypothesis tests? Since we conduct \(m\) hypothesis at the same time, it is given as below.

\[P(Making \ \ an \ \ error) = \alpha\] \[P(Not \ \ making \ \ an \ \ error) = 1 - \alpha\] \[P(Not \ \ making \ \ an \ \ error \ \ in \ \ m \ \ test) = (1 - \alpha)^m\] \[P(making \ \ at \ \ least \ \ 1 \ \ error \ \ in \ \ m \ \ test) = 1 - (1 - \alpha)^m\]

m <- 0:100
a <- 0.05

plot(m, 1 - (1 - a)^m, type = 'l', ylab = "P(At least 1 false positive)")

Let’s assume we are testing \(H^1, H^2, ..., H^m\). In addition, \(m_0\) represents the number of true hypotheses and \(R\) represents the number of rejected hypotheses. Based on this notation, we can generate table as shown below.

	Called significant	Not called significant	Total
Null True	\(V\)	\(m_0-V\)	\(m_0\)
Alternative True	\(S\)	\(m_1-S\)	\(m_1\)
True	\(R\)	\(m-R\)	\(m\)

in the table, \(V\) represents the number of false positives and \(S\) represents the number of true positives. In other words, \(V\) represents the number of type I errors and \(m_1-S\) represents the number of type II errors.

What people say “adjusting p-values for the number of hypothesis tests performed”, what they mean is controlling the Type I error rate. There are many ways to deal with and it is active research area but here we list some of well known methods.

Different Approaches to control Type I Errors

Per comparison error rate (PCER): the expected value of the number of Type I errors over the number of hypotheses, \(PCER = \frac{E(V)}{m}\)
Per-family error rate (PFER): the expected number of Type I errors, \(PFER = E(V)\)
Family-wise error rate (FWER): the probability of at least one type I error, \(FWER = P(V\geq1)\)
False discovery rate (FDR): the expected proportion of Type I errors among the rejected hypotheses, \(FDR = E(\frac{V}{R}|R>0) P(R>0)\)
Positive false discovery rate (pFDR): the rate that discoveries are false, \(pFDR = E(\frac{V}{R}|R>0)\)

Controlling FWER

There are two general types of FWER corrections:

Single step: equivalent adjustments made to each p-value (Bonferroni)
Sequential: adaptive adjustment made to each p-value (Holm’s Method)

Bonferroni

Bonferroni is a very simple method for ensuring that the overall Type I error rate of \(\alpha\) is maintained when performing \(m\) independent hypothesis tests. The procedure is as follows Rejects any hypothesis with \(p-value\leq \frac{\alpha}{m}\): \(\tilde p_j = \ min[m \cdot p_j ,1]\). For example, if we want to have an experiment wide Type I error rate of 0.05, when we perform 10,000 hypothesis tests, we’d need a p-value of \(\frac {0.05}{10000} = 5 \times 10^{-6}\) to declare significance. (or \(p_j = min[10000 \cdot p_j, 1]\)) It is the most strict FWER correction method but there is high probability of Type II error with this method.

p <- c(0.0008, 0.009, 0.165, 0.205, 0.396, 0.450, 0.641, 0.781, 0.900, 0.993)

# Bonferroni
m <- length(p)
pmin(1, m * p)

##  [1] 0.008 0.090 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

p.adjust(p, method = "bonferroni")

##  [1] 0.008 0.090 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

# cf pmin, pmax, cummin, and cummax
min(1:5, pi)

## [1] 1

pmin(1:5, pi)

## [1] 1.000000 2.000000 3.000000 3.141593 3.141593

max(1:5, pi)

## [1] 5

pmax(1:5, pi)

## [1] 3.141593 3.141593 3.141593 4.000000 5.000000

cummin(1:5)

## [1] 1 1 1 1 1

cummax(1:5)

## [1] 1 2 3 4 5

Holm’s Method

Holm’s method is a simplest sequential FWER correction method. The procedure is as follows. Order the unadjusted p-values such that \(p_1 \leq p_2 \leq ... \leq p_m\). For control of the FWER at level \(\alpha\), the step-down Holm adjsuted p-value are \(\tilde p_j = \ min[(m-j+1) \cdot p_j,1]\). here \(j\) represent the rank of p-value by degree of significance. The point here is that we don’t multiply every \(p_i\) by the same factor \(m\). For example, when \(m = 10000\): \(\tilde p_1 = 10000 \cdot p_1\), \(\tilde p_2 = 9999 \cdot p_2\), …, \(\tilde p_m = 1 \cdot p_m\)

p <- c(0.0008, 0.009, 0.165, 0.205, 0.396, 0.450, 0.641, 0.781, 0.900, 0.993)

# Holm's procedure
m <- length(p)
j <- seq_len(m)
o <- order(p)
ro <- order(o)
pmin(1, cummax((m - j + 1L) * p[o]))[ro]

##  [1] 0.008 0.081 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

# base R
p.adjust(p, method = "holm")

##  [1] 0.008 0.081 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

pitfalls of Controlling FWER

FWER is appropriate when you want to guard against ANY false positives. However, in many cases especially when the number of test is large, it become too strict to draw any meaningful conclusion. In these cases, the more relevant quantity to control is the false discovery rate (FDR). FDR is designed to control the proportion of false positives among the set of rejected hypotheses (\(R\)). In our table, \(FDR = \frac{V}{R}\) and \(FPR = \frac{V}{m_0}\) (FPR: False positive rate).

Benjamini and Hochberg (BH) FDR

To control FDR at level \(\delta\): first, order the unadjusted p-values: \(p_1 \leq p_2 \leq ... \leq p_m\). Then find the test with the highest rank, \(j\), for which the p-value, \(p_j\), is less than or equal to \(\frac{j}{m} \times \delta\). Declare the test of rank \(1, 2, ..., j\) as significant when \(p(j) \leq \delta \frac{j}{m}\) (or \(p(j)\frac{m}{j} \leq \delta\))

p <- c(0.0008, 0.009, 0.165, 0.205, 0.396, 0.450, 0.641, 0.781, 0.900, 0.993)

# BH procedure
m <- length(p)
j <- m:1L
o <- order(p, decreasing = TRUE)
ro <- order(o)
pmin(1, cummin(m/j * p[o]))[ro]

##  [1] 0.0080000 0.0450000 0.5125000 0.5125000 0.7500000 0.7500000 0.9157143
##  [8] 0.9762500 0.9930000 0.9930000

p.adjust(p, method = "BH")

##  [1] 0.0080000 0.0450000 0.5125000 0.5125000 0.7500000 0.7500000 0.9157143
##  [8] 0.9762500 0.9930000 0.9930000

Benjamini and Yekutieli (BY) FDR

BY correction is very similar to BH correction. It is appropriate when one suspects very strong correlations among tests. Simply replace \(\frac{m}{j}\) by \(q\frac{m}{j}\) where \(q = \sum_{j=1}^{m} \frac{1}{j}\).

p <- c(0.0008, 0.009, 0.165, 0.205, 0.396, 0.450, 0.641, 0.781, 0.900, 0.993)

# BY procedure
m <- length(p)
j <- m:1L
o <- order(p, decreasing = TRUE)
ro <- order(o)
q <- sum(1L/(1L:m))
pmin(1, cummin(q * m/j * p[o]))[ro]

##  [1] 0.02343175 0.13180357 1.00000000 1.00000000 1.00000000 1.00000000
##  [7] 1.00000000 1.00000000 1.00000000 1.00000000

p.adjust(p, method = "BY")

##  [1] 0.02343175 0.13180357 1.00000000 1.00000000 1.00000000 1.00000000
##  [7] 1.00000000 1.00000000 1.00000000 1.00000000