Statistical Test (Non-Parametric)

Distribution-free method

When the assumption of model cannot be satisfied: Sample size, Deviation
Qualitative or Categorical data

Goodness-Of-Fit Test for Categorical Data analysis (apprx. parametric)

Fit to the expected value or model distribution
One-factor classification test (Chi-square test):
Multi-factor classification test (contingency or dependency test)

Sign or order analysis

Sign Test, Signed rank test
Wilcoxon rank-sum test (two sample test: t-test alternative)
Kruskal-Wallis test (multi-sample test: ANOVA F-test alternative)

Extra issues (not included in this session)

Randomness: Run test, Correlation: Spearman’s rank correlation test
Evaluation of overall performance and errors including multiple testing error
Analysis of heterogeneous source of data: integrative or meta analysis.

Tutorial

Example of chi-square test I (s7, Hayter p468)

Out of \(n = 46\) machine breakdowns, \(x_1 = 9\) are attributable to electrical problems, \(x_2 = 24\) are attributable to mechanical problems, and \(x_3 = 13\) are attributable to operator misuse. It is suggested that the probabilities of these three kinds of breakdown are respectively \[p_1^* = 0.2, \ p_2^* = 0.5, \ p_3^*=0.3\] Examine plausibility of this suggestion with a chi-square goodness of fit test.

Consider the null hypothesis \(H_0 : p_1 = 0.2, p_2 = 0.5, p_3 = 0.3\)

Under this null hypothesis the expected cell frequencies are

\(e_1 = np_{1}^* = 46 \times 0.2 = 9.2\)

\(e_2 = np_{2}^*= 46 \times 0.5 = 23.0\)

\(e_3 = np_{3}^*= 46 \times 0.3 = 13.8\)

Pearson chi-square statistic: \(\chi^2 = \frac{(9.0 - 9.2)^2}{9.2} + \frac{(24.0 - 23.0)^2}{23.0} + \frac{(13.0 - 13.8)^2}{13.8} = 0.0942\)

Likelihood ratio chi-square statistic: \(G^2 = 2 \times \bigg(9.0 \times ln \bigg(\frac{9.0}{9.2}\bigg) + 24.0 \times \bigg(\frac{24.0}{23.0}\bigg) + 13.0 \times \bigg(\frac{13.0}{13.8}\bigg) \bigg) = 0.0945\)

which, compared with a chi-square distribution with \(k-1 = 3 - 1 = 2\) degrees of freedom (which is an exponential distribution with mean 2) give a p-value of \[p-value \approxeq p(\chi_2^2 \geq 0.094) = 0.95\]

x1 <- 9; x2 <- 24; x3 <- 13
p1 <- 0.2; p2 <- 0.5; p3 <- 0.3
df <- 2
n <- sum(x1, x2, x3)
e1 <- n * p1
e2 <- n * p2
e3 <- n * p3

chi_square <- ((x1 - e1)^2/e1) + ((x2 - e2)^2/e2) + ((x3 - e3)^2/e3); chi_square

## [1] 0.0942029

G_square <- 2 * (x1 * log(x1/e1) + x2 * log(x2/e2) + x3 * log(x3/e3)); G_square

## [1] 0.09454107

pchisq(chi_square, df, lower.tail = FALSE)

## [1] 0.9539906

pexp(chi_square, rate = 1/2, lower.tail = FALSE) # exponential distribution with mean = 2

## [1] 0.9539906

pchisq(G_square, df, lower.tail = FALSE)

## [1] 0.9538293

Example of chi-square test II (s8)

The number of accidents per week during 50 weeks are collected. How can we test of size \(\alpha = 0.05\) for the hypothesis that the accidents follows Poisson distribution model?

y	Frequency
0	32
1	12
2	6
3 or more	0

\(p(y|\lambda) = \frac{\lambda^y e^{-\lambda}}{y!}, \ \ \ \ \ \ \ y = 0, 1, 2, ...\)

Answer

Goodness of fit test for \(H_0\): expected value from Poisson distribution = observed values.

Expected values: \(n_i = np_i\), need to know \(\lambda\) for \(p_i\), MLE of \(\lambda: \hat \lambda = \bar Y: \frac{(12 \times 1 + 6 \times 2)}{50} = 0.48\)

\(p_1 = P(Y=0) = e^{-\lambda}, \ \ \hat p_1 = e^{-0.48} = 0.619\)

\(p_2 = P(Y=1) = \lambda e^{-\lambda}, \ \ \hat p_2 = 0.48 \times e^{-0.48} = 0.297\)

\(p_3 = P(Y \geq 2) = 1 - e^{-\lambda} - \lambda e^{-\lambda}, \ \ \hat p_3 = 1 - \hat p_2 - \hat p_3 = 0.084\)

\(\hat E(n_1) = n \hat p_1 = 50 \times 0.619 = 30.95\)

\(\hat E(n_2) = n \hat p_2 = 50 \times 0.297 = 14.85\)

\(\hat E(n_3) = n \hat p_3 = 50 \times 0.048 = 4.20\)

\(\chi^2 = \sum_{i=1}^3 \frac{[n_i - \hat E(n_i)]^2}{\hat E(n_i)}\)

\(\chi^2 = \frac{(32-30.95)^2}{30.95} + \frac{(12-14.85)^2}{14.85} + \frac{(6-4.20)^2}{4.20}\)

\(\chi^2_{0.05} = 3.841\), with 1 df.

d.f. is 1 rather than 2 because one more variable is fixed by determining \(\lambda\) mean of \(Y_i = np_i\)

alpha <- 0.05
y <- c(0, 1, 2, 3)
freq <- c(32, 12, 6, 0)
n <- sum(freq)
lambda_hat <- sum(y*freq)/n

# assume Poisson distribution
p1 <- exp(-lambda_hat)
p2 <- lambda_hat * exp(-lambda_hat)
p3 <- 1 - p1 - p2 # 2 or more weeks
e1 <- n * p1; e2 <- n * p2; e3 <- n * p3
chi_square <- ((freq[1] - e1)^2/e1) + ((freq[2] - e2)^2/e2) + ((freq[3] - e3)^2/e3)

# df = 1, for p
qchisq(0.05, df = 1, lower.tail = FALSE)

## [1] 3.841459

# pvalue
pchisq(chi_square, df = 1, lower.tail = FALSE)

## [1] 0.2462138

Since \(\chi^2_{.05} = 3.841\) with df = 1, \(P(\chi^2 > 1.354) = 0.245\), accept null Hypothesis with \(\alpha\) = 0.05 because p-value > 0.05. Therfore, Poisson distribution can be used as a model of number of accidents/week.

Test of Independence (s11, Hayter p483)

Three drugs are compared with respect to the types of allergic reaction that they cause to patients. A group of \(n = 300\) patients is randomly split into three groups of 100 patients, each of which is given one of the three drugs. The patients are then categorized as being hyperallergic, allergic, mildly allergic, or as having no allergy.

the contingency table looks like below.

	Hyperallergic	Allergic	Mildly allergic	No allergy
Drug A	11	30	36	23
Drug B	8	31	25	36
Drug C	13	28	28	31

Test for independence between the type of drug and the type of reaction.

# Manual calculation - calculation of expected value

# data
data <- matrix(c(11, 30, 36, 23, 8, 31, 25, 36, 13, 28, 28, 31), nrow = 3, byrow = TRUE)
rownames(data) <- c("Drug A", "Drug B", "Drug C")
colnames(data) <- c("Hyperallergic", "Allergic", "Mildly allergic", "No allergy")

# contingency table
dt <- cbind(data, rowSums(data))
dt <- rbind(dt, colSums(dt))

# define result dataframe
expDat <- matrix(rep(0,12), nrow = 3)

# calculate expected counts
for (i in 1:4) {
expDat[,i] <- dt[1:3, 5]*dt[4,i]/dt[4,5]
}

# Print expected counts
expDat

##          [,1]     [,2]     [,3] [,4]
## [1,] 10.66667 29.66667 29.66667   30
## [2,] 10.66667 29.66667 29.66667   30
## [3,] 10.66667 29.66667 29.66667   30

# calculate chisq statistic
chisq.res <- sum((data - expDat)^2/expDat); chisq.res

## [1] 6.391245

# p-value
pchisq(chisq.res, df = 7 - 1, lower.tail = FALSE)

## [1] 0.3808181

# using base R package
chisq <- chisq.test(data)
chisq

## 
##  Pearson's Chi-squared test
## 
## data:  data
## X-squared = 6.3912, df = 6, p-value = 0.3808

chisq$expected

##        Hyperallergic Allergic Mildly allergic No allergy
## Drug A      10.66667 29.66667        29.66667         30
## Drug B      10.66667 29.66667        29.66667         30
## Drug C      10.66667 29.66667        29.66667         30

The null hypothesis of independence between the type of drug and the type of reaction can be interpreted as meaning that the chances of the various kinds of allergic reaction are the same for each of the three drugs. This means that there is a set of probability values \(p_1, p_2, p_3\), and \(p_4\) that represent the probabilities of getting the four types of reaction regardless of which drug is used. If the null hypothesis of independence is rejected, then this implies that there is evidence to conclude that the three drugs have different sets of probability values for the four types of reaction. This means that the three types of drug cannot be considered to be equivalent in terms of the reactions that they cause. (Actually, two of the three drugs could still be equivalent, but all three cannot be the same)

\(p-value = 0.3808 > 0.05\), implies that the null hypothesis of independence is plausible, and so there is no evidence to conclude that the three drugs are any different with respect to the types of allergic reaction that they cause.

Simpson’s Paradox (s12)

The paradox where a trend appeared in segmented groups of data disappears or reverses when these groups are combined. Using iris data set, we are going to investigate this topic as an example.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
# correlation for all species
cor(iris[-5])

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

# correlation for each species
species_all <- unique(iris$Species)
for (species in species_all) {
  df <- iris %>% filter(Species == species)
  message(species)
  print(cor(df[-5]))
}

## setosa

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.7425467    0.2671758   0.2780984
## Sepal.Width     0.7425467   1.0000000    0.1777000   0.2327520
## Petal.Length    0.2671758   0.1777000    1.0000000   0.3316300
## Petal.Width     0.2780984   0.2327520    0.3316300   1.0000000

## versicolor

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.5259107    0.7540490   0.5464611
## Sepal.Width     0.5259107   1.0000000    0.5605221   0.6639987
## Petal.Length    0.7540490   0.5605221    1.0000000   0.7866681
## Petal.Width     0.5464611   0.6639987    0.7866681   1.0000000

## virginica

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.4572278    0.8642247   0.2811077
## Sepal.Width     0.4572278   1.0000000    0.4010446   0.5377280
## Petal.Length    0.8642247   0.4010446    1.0000000   0.3221082
## Petal.Width     0.2811077   0.5377280    0.3221082   1.0000000

# all species
iris %>% ggplot(aes(x = Sepal.Width , y = Petal.Length)) + 
  geom_point(color = 'blue') + 
  ggtitle('All species') +
  theme(axis.title = element_text(size = 14), 
        plot.title = element_text(size = 16, colour = "purple"),
        axis.text = element_text(size = 14)) + 
  geom_smooth(method = "lm")

# each species
iris %>% 
  ggplot(aes(x = Sepal.Width , y = Petal.Length, color = Species)) + 
  geom_point() + 
  ggtitle('Each species') + 
  geom_smooth(method = 'lm') + 
  theme(axis.title = element_text(size = 14),
        plot.title = element_text(size = 16,colour = "purple"), 
        axis.text = element_text(size = 14))

# each species
iris %>% 
  mutate(group = ifelse(Species %in% c("versicolor", 'virginica'), 
                        'versicolor and virginica','setosa')) %>% 
  ggplot(aes(x = Sepal.Width , y = Petal.Length, color = group)) + 
  geom_point() + 
  ggtitle('grouping 2') + 
  theme(axis.title = element_text(size = 14), 
        plot.title = element_text(size = 16,colour = "purple"),
        axis.text = element_text(size = 14)) + 
  geom_smooth(method = "lm")

Sign or Order Test

Kolomogorov-smirnov Test (2 samples) (s15, Hayter 718)

In petroleum processing, changes in the conditions of crude oil and natural gas may cause dissolved waxes to crystallize and to therefore affect the operation of processing equipment. If the waxes can be made to crystallize in a structure that is less likely to form a deposit, or if the waxes can be prevented from crystallizing altogether, then a problem that may be of serious concern to the petroleum industry can be mitigated. It is therefore important to study the crystallization properties of the dissolved waxes.

An experiment is performed whereby a solution of waxes is expanded rapidly through a nozzle, as shown in Figure below, so that the wax particles can be collected on a plate. Scanning electron microscopy is then used to photograph the samples of wax particles so that their diameters can be measured. A question of importance is how the pre-expansion temperature of the solution affects the particle sizes.

In the first experiment a pre-expansion temperature of \(80^\circ C\) is used, and in a second experiment a pre-expansion temperature of \(170^\circ C\) is used. In either case 36 samples of particles are obtained, and the means of the particle diameters within each sample are calculated as shown below.

## [1] "Pre-expansion temperature of 80"

9.00	9.50	9.75	10.50	10.50	11.50	12.00	12.25	12.75
13.25	14.25	14.75	15.25	17.00	18.25	18.25	18.50	18.75
19.25	19.75	20.25	20.75	22.00	22.75	22.75	23.00	23.00
23.25	26.00	26.00	29.00	31.75	32.50	33.00	36.50	38.00

n	mean	sd	max	upper	median	lower	min
36	19.875	7.881828	38	23.0625	19	13.125	9

## [1] "Pre-expansion temperature of 170"

4.00	4.17	4.75	4.83	5.17	5.33	5.50	5.67	5.67
5.67	6.17	6.17	6.67	6.67	7.33	7.60	7.67	7.67
7.75	8.17	8.50	8.50	8.67	8.67	8.67	9.17	9.17
9.33	9.44	9.67	10.00	10.00	10.25	10.67	10.83	12.16

n	mean	sd	max	upper	median	lower	min
36	7.675833	2.091119	12.16	9.21	7.71	5.67	4

How does the pre-expansion temperature affect the particle sizes?

We can test the effect of pre-expansion temperature with Kolomogorov-Smirnov statistic.

Formally, the value of the Kolomogorov-Smirnov statistic is

\[M = \frac{30}{36} = 0.8333\]

This value is the maximum vertical distance between the two empirical cumulative distribution functions \(\hat F_A(x)\) and \(\hat F_B(x)\), which occurs when \(10.25 < x < 10.50\), in which case

\[\hat F_A(x) = \frac {3}{36} = 0.0833, \ \hat F_B(x) = \frac{33}{36} = 0.9167, \ and \ \ M = 0.8333\]

and also when \(10.83 < x < 11.50\), in which case

\[\hat F_A(x) = \frac {5}{36} = 0.1389, \ \hat F_B(x) = \frac{35}{36} = 0.9722, \ and \ M = 0.8333\]

A size \(\alpha = 0.01\) critical point is

\[d_{0.01}\sqrt{\frac{1}{n} + \frac{1}{m}} = 1.63 \times \sqrt{\frac{1}{36} + \frac{1}{36}} = 0.38\]

The value \(M = 0.83\) is much larger than this critical point, which indicates that there is very strong evidence that the distributions of the particle sizez are different at the two different pre-expansion temperatures.

In general, to test difference of two samples (A, B) with their maximum distance of empirical cumulative function, with null hypothesis

\[H_0: F_{A}(x) = F_{B}(x)\]

use test statistic M: most different \(F(x)_s\) as

\[M = max|\hat{F}_{A}(x) - \hat{F}_{B}(x)|\]

where \(\hat{F}_{A}(x) = \frac{\#x_{i} \leq X}{n}\), \(\hat{F}_{B}(x) = \frac{\#y_{i} \leq X}{n}\)

A <- c(9, 9.5, 9.75, 10.5, 10.5, 11.5, 12, 12.25, 12.75,
         13.25, 14.25, 14.75, 15.25, 17, 18.25, 18.25, 18.5, 18.75,
         19.25, 19.75, 20.25, 20.75, 22, 22.75, 22.75, 23, 23, 
         23.25, 26, 26, 29, 31.75, 32.5, 33, 36.5, 38)
B <- c(4, 4.17, 4.75, 4.83, 5.17, 5.33, 5.5, 5.67, 5.67,
       5.67, 6.17, 6.17, 6.67, 6.67, 7.33, 7.6, 7.67, 7.67,
       7.75, 8.17, 8.5, 8.5, 8.67, 8.67, 8.67, 9.17, 9.17, 
       9.33, 9.44, 9.67, 10, 10, 10.25, 10.67, 10.83, 12.16)

da <- 1.63 # from Hayter p699 (for n > 40, a = 0.01)
cv <- 1.63 * sqrt(1/length(A) + 1/length(B))

plot.ecdf(A, col = "blue")
plot.ecdf(B, add = TRUE, col = "red")
abline(v = mean(10.25, 10.5), col = "purple", lty = 2)
abline(v = mean(10.83, 11.5), col = "purple", lty = 2)
abline(h = ecdf(A)(10.5) + cv, col = "black", lty = 2) # critical value
abline(h = ecdf(A)(10.5), col = "red", lty = 2) # M
abline(h = ecdf(B)(11.5), col = "red", lty = 2)

# using base R
ks.test(A, B)

## Warning in ks.test(A, B): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  A and B
## D = 0.83333, p-value = 2.778e-11
## alternative hypothesis: two-sided

Sign Test (Matched Pair Samples Median) (s17, Wackerly p744)

A general two-sample shift model

Many times, an experimenter takes observations from two populations with the objective of testing whether the populations have the same distribution. For example, if independent random samples \(X_1, X_2, ..., X_{n_1}\) and \(Y_1, Y_2, ..., Y_{n_2}\) are taken from normal populations with equal variances and respective means \(\mu_X\) and \(\mu_Y\), the experimenter may wish to test \(H_0: \mu_X - \mu_Y = 0\) versus \(H_a: \mu_X - \mu_Y \neq 0\). In this case, if \(H_0\) is true, both populations are normally distributed with the same mean and the same variance; that is, the population distributions are identical. If \(H_a\) is true, then \(\mu_X \neq \mu_Y\) and the distributions of \(X_1\) and \(Y_1\) are the same, except that the location parameter (\(\mu_Y\)) for \(Y_1\) is differ from the location parameter (\(\mu_X\)) for \(X_1\). Hence, the distribution of \(Y_1\) is shifted to the right or left of the distribution of \(X_1\)

mu <- 0; sd <- 1; theta <- 5
mu_y1 <- 0 + theta
mu_y2 <- 0 - theta

range <- seq(-10, 10, 0.01)
x1 <- dnorm(range, mu, sd)
y1 <- dnorm(range, mu_y1, sd)
y2 <- dnorm(range, mu_y2, sd)

plot(range, x1, type = 'l', col = "blue", ylab = "f(x)", xlab = "x")
lines(range, y1, col = "red")
lines(range, y2, col = "green")

This is an example of a two-sample parametric shift (or location) model. The model is parametric because the distributions are specified (normal) except for the values of the parameters \(\mu_X, \mu_Y\), and \(\sigma^2\). The amount that the distribution of \(Y_1\) is shifted to the right or left ofthe distribuiton of \(X_1\) is \(\mu_Y - \mu_X\).

Now, Let \(X_1, X_2, ..., X_{n_1}\) be a random sample from a population with distribution function \(F(x)\) and let \(Y_1, Y_2, ..., Y_{n_2}\) be a random sample from a populatio with distribution function G(y). If we wish to test whether the two populations have the same distribution - that is, \(H_0: F(z) = G(z)\) versus \(H_a: F(z) \neq G(z)\), with the actual form of \(F(z)\) and \(G(z)\) unspecified - a non parametric method is required.

Notice that \(H_a\) is a very broad hypothesis. Many times, an experimenter may wish to consider the more specific alternative hypothesis that \(Y_1\) has the same distribution as \(X_1\) shifted by an (unknown) amount \(\theta\) - that is, that the distributions differ in location. Then \(G(y) = P(Y_1 \leq y) = P(X_1 \leq y-\theta) = F(y-\theta)\) for some unknown parameter value \(\theta\). Notice that the particular form of \(F(x)\) remains unspecified.

In summary, we refer to the two-sample shift (location) model as we assume that \(X_1, X_2, ..., X_{n_1}\) constitute a random sample from distribution fuction \(F(x)\) and that \(Y_1, Y_2, ..., Y_{n_2}\) constitute a random sample from distribution function \(G(y) = F(y-\theta)\) for some unknown value \(\theta\). for the two-sample shift model, \(H_0:F(z) = G(z)\) is equivalent to \(H_0: \theta =0\). If \(\theta\) is greater (less) than 0, then the distribution of Y-values is locatd to the right (left) of the distribution of the X-values.

The Sign test for a matched-pairs experiment

Suppose that we have n pairs of observations of the form \((X_i , Y_i)\) and that we wish to test the hypothesis that the distribution of the \(X\)’s is the same as that of the \(Y\)’s versus the alternative that the distributions differ in location. We let \(D_i = X_i - Y_i\). One of the simplest nonparametric tests is based on the signs of these differences and, reasonably enough, is called the sign test. Under the null hypothesis that \(X_i\) and \(Y_i\) come from the same continuous probability distributions, the probability that \(D_i\) is positive is equal to \(\frac{1}{2}\) (as is the probability that \(D_i\) is negative). Let \(S\) denote the total number of positive (or negative) differences. Then if the variables \(X_i\) and \(Y_i\) have the same distribution, \(S\) has a binomial distribution with \(p = \frac{1}{2}\), and the rejection region for a test based on \(S\) can be obtained by using the binomial probability distribution.

mu_x <- 0; sd <- 1; theta <- 1; n <- 100
mu_y <- 0 + theta
S_equal <- S_nonequal <- vector(length = n)

# sample S n times
for (i in 1:n) {
  x1 <- rnorm(n, mu, sd)
  x2 <- rnorm(n, mu, sd)
  y1 <- rnorm(n, mu_y, sd)
  S_equal[i] <- sum(x1 - x2 > 0) # D = x1 - x2
  S_nonequal[i] <- sum(x1 - y1 > 0) # D = x1 - y1
}

# estimate mu of S
mu_S_equal <- mean(S_equal)
mu_S_nonequal <- mean(S_nonequal)

# plot S on the binomial distribution with probaility = 1/2
plot(0:n, dbinom(0:n, size = n, prob = 0.5), type = 'l', ylab = "f(x)", xlab = "x")
abline(v = mu_S_equal)
abline(v = mu_S_nonequal, col = "blue")
abline(v = qbinom(0.01, n, prob = 0.5), col = "red") # critical value a = 0.05

# pvalue
pbinom(mu_S_equal, n, prob = 0.5) # drawn from the same distribution

## [1] 0.4602054

pbinom(mu_S_nonequal, n, prob = 0.5) # drawn from the same distribution with different location parameter (mu)

## [1] 2.75679e-08

# cf) z approximation, see later part of this tutorial
z1 <- (2 * mu_S_equal - n) / sqrt(n)
z2 <- (2 * mu_S_nonequal - n) / sqrt(n)

pnorm(z1)

## [1] 0.4832493

pnorm(z2)

## [1] 6.821188e-08

qnorm(0.01) # critical value

## [1] -2.326348

# plot S on the normal distribution with mu_x and sd
plot(seq(-10,10,0.01), dnorm(seq(-10,10,0.01), mu_x, sd), type = 'l', ylab = "f(x)", xlab = "x")
abline(v = z1)
abline(v = z2, col = "blue")
abline(v = qnorm(0.01, mu_x, sd), col = "red") # critical value a = 0.05

The sign test is summarized as follows.

Let \(p = P(X > Y)\).

Null hypothesis: \(H_0: p = \frac{1}{2}\) or \(H_0: median \ of \ D_i = 0\)

Alternative hypothesis: \(H_a: p > \frac{1}{2} or \ (p < \frac{1}{2} \ or \ p \neq \frac{1}{2})\) or \(H_a: median \ of \ D_i \neq 0\)

Test statistic: \(S = \ number \ of \ positive \ differences \ where \ D_i = X_i - Y_i\)

Rejection region:

For \(H_a: p > \frac{1}{2}\), reject \(H_0\) for the largest values of S
For \(H_a: p < \frac{1}{2}\), reject \(H_0\) for the smallest values of S
For \(H_a: p \neq \frac{1}{2}\), reject \(H_0\) for very large or very small values of S

Asummptions: The pairs \((X_i, Y_i)\) are randomly and independently selected

Example

The number of defective electrical fuses produced by each of two production lines, \(A\) and \(B\),was recorded daily for a period of 10 days, with the results shown in Table below.

Day	A	B
1	172	201
2	165	179
3	206	159
4	184	192
5	174	177
6	142	170
7	190	182
8	169	179
9	161	169
10	200	210

Assume that both production lines produced the same daily output. Compare the number of defectives produced by \(A\) and \(B\) each day and let \(S\) equal the number of days when \(A\) exceeded \(B\). Do the data present sufficient evidence to indicate that either production line produces more defectives than the other? State the null hypothesis to be tested and use \(S\) as a test statistic.

Solution \(H_0:\) The two distributions of defectives are identical

Formally, when \(d_i = x_i - y_i, \ for \ i = 1,...,n\) and \(d_i\) are assumed to be i.i.d., values of \(x_i\) and \(y_i\) are ordered.

\(H_0: F(d_0) = p_0 \ (d_0 = 0, p_0 = 0.5, \ actually \ means \ F_0 = P(\#d_i > d_0) = P(\#x_i > \#y_i) = 0.5)\)

Test statistic: \(S = \# of \ d_i > d_0 \ or \ x_i > y_i\): does not count those \(x_i = y_i\)

Rejection region: some extreme values of \(S \geq S_0\) or \(S \leq S_0\), p-value of \(S < \alpha\) (Iterative test of p value from the extreme values to the smaller values) If there are ties (no difference), we don’t use them in the calculation (sample size \(n\) is reduced too)

let \(S\) be the number of days that the observed number of defectives for production line A exceeds that for line B. from the data, \(S = 2\). The p-value is given as \(2 \times P(S \leq 2)\) (two-sided test). Under the null hypothesis, \(S\) has a binomal distribution with \(n = 10, p = 0.5\), so that p-value is given as \[p-value = 2 P(S \leq 2) = 2 \times \sum_{x=0}^2 {10 \choose{x}} 0.5^{10} = 0.11\] Therefore, 0.11 is the smallest value of \(\alpha\) for which the null hypothesis can be rejected.

# the test result of A and B
Day <- 1:10
A <- c(172, 165, 206, 184, 174, 142, 190, 169, 161, 200)
B <- c(201, 179, 159, 192, 177, 170, 182, 179, 169, 210)
diff <- A - B
S <- sum(diff > 0); S

## [1] 2

# Direct calculation of rejection region (calculation of pvalue)
RR1 <- choose(10,0) * 0.5^10 + choose(10,10) * 0.5^10 # S = 0

RR2 <- choose(10,0) * 0.5^10 + choose(10,1) * 0.5^10 + # S = 1
       choose(10,9) * 0.5^10 + choose(10,10) * 0.5^10

RR3 <- choose(10,0) * 0.5^10 + choose(10,1) * 0.5^10 + choose(10,2) * 0.5^10 +  # S = 2
       choose(10,8) * 0.5^10 + choose(10,9) * 0.5^10 + choose(10,10) * 0.5^10

RR1; RR2; RR3

## [1] 0.001953125

## [1] 0.02148438

## [1] 0.109375

# Using binomial distribution (calculation of pvalue)
2 * pbinom(2, size = length(A - B), prob = 0.5)

## [1] 0.109375

# R package
library(BSDA)

## Loading required package: lattice

## 
## Attaching package: 'BSDA'

## The following object is masked from 'package:datasets':
## 
##     Orange

SIGN.test(A,B)

## 
##  Dependent-samples Sign-Test
## 
## data:  A and B
## S = 2, p-value = 0.1094
## alternative hypothesis: true median difference is not equal to 0
## 95 percent confidence interval:
##  -23.457778   4.431111
## sample estimates:
## median of x-y 
##            -9 
## 
## Achieved and Interpolated Confidence Intervals: 
## 
##                   Conf.Level   L.E.pt  U.E.pt
## Lower Achieved CI     0.8906 -14.0000 -3.0000
## Interpolated CI       0.9500 -23.4578  4.4311
## Upper Achieved CI     0.9785 -28.0000  8.0000

One problem that may arise in connection with a sign test is that the observations associated with one or more pairs may be equal and therefore may result in ties. When this situation occurs, delete the tied pairs and reduce \(n\), the total number of pairs. You will also encounter situations where \(n\), the number of pairs, is large. Then, the values of \(\alpha\) associated with the sign test can be approximated by using the normal approximation to the binomial probability distribution. You can verify (by comparing exact probabilities with their approximations) that these approximations will be quite adequate for \(n\) as small as 10 or 15. This result is due to the symmetry of the binomial probability distribution for \(p = 0.5\). For \(n \geq 25\), the \(Z\) test will suffice, where \[Z = \frac{S - np}{\sqrt {npq}} = \frac{S - \frac{n}{2}}{\frac{1}{2}\sqrt n} = \frac{2S-n}{\sqrt n}\]

Day <- 1:10
A <- c(172, 165, 206, 184, 174, 142, 190, 169, 161, 200)
B <- c(201, 179, 159, 192, 177, 170, 182, 179, 169, 210)
diff <- A - B
S <- sum(diff > 0); S

## [1] 2

z <- (2 * S - length(diff)) / sqrt(length(diff)) 
pnorm(z) # inexact approximation because of too small n

## [1] 0.02888979

Wilcoxon Rank Sum Test for single & paired samples (s18, Wackerly p752)

Due to oven-to-oven variation, a matched-pairs experiment was used to test for differences in cakes prepared using mix A and mix B. Two cakes, one prepared using each mix, were baked in eah of six different ovens (a total of 12 cakes). Test the hypothesis that there is no difference in population distributions of cake densities using the two mixes. What can be said about te attained significance level?

The origial data and differences in densities (in ounces per cubic inch, \(oz/in^3\)) for the six pairs of cakes are shown in below

A	B	Diff	Abs_Diff	Rank_of_Abs_Diff
0.135	0.129	0.006	0.006	3.0
0.102	0.120	-0.018	0.018	5.0
0.108	0.112	-0.004	0.004	1.5
0.141	0.152	-0.011	0.011	4.0
0.131	0.135	-0.004	0.004	1.5
0.144	0.163	-0.019	0.019	6.0

As with our other nonparametric tests, the null hypothesis to be tested is that the two population frequency distributions of cake densities are identical. The alternative hypothesis is that the distributions differ in location, which implies that a two-tailed test is required.

Because the amount of data is small, we will conduct our test by using \(\alpha = 0.10\). The critical value of \(W\) for a two-tailed test, \(\alpha = 0.10\), is \(W_0 = 2\). Hence, we will reject \(H_0\) if \(W \leq 2\). There is only one positive difference, and that difference has rank 3; therefore, \(W^+ = 3\). Because \(W^+ + W^- = \frac {n(n+1)}{2}\), \(W^- = 21-3 = 18\) and the observed value of \(W\) is \(min(3,18) = 3\). Notice that 3 exceeds the critical value of \(W\), implying that there is insufficient evidence to indicate a difference in the two population frequency distribution of cake densities. Because we can not reject \(H_0\) for \(\alpha = 0.10\), we can only say that p-value > 0.10.

a <- 0.10
A <- c(0.135, 0.102, 0.108, 0.141, 0.131, 0.144)
B <- c(0.129, 0.120, 0.112, 0.152, 0.135, 0.163)
n <- length(A - B)

# direct calculation
dt <- data.frame(A, B, diff = A - B, abs_diff = abs(A - B), rank = rank(abs(A - B)))
W_critical <- qsignrank(0.1, length(A))/2; W_critical

## [1] 2

W_total <- n*(n + 1)/2
identical(W_total, sum(dt$rank)) # double check

## [1] TRUE

W_pos <- dt[dt[,"diff"] > 0,]$rank
W_neg <- W_total - W_pos
W <- min(W_pos, W_neg); W

## [1] 3

W > W_critical

## [1] TRUE

sum(dt$rank[dt$rank > W_critical])

## [1] 18

message("Since W < W_critical, we can not reject the null hypothesis with a = 0.10")

## Since W < W_critical, we can not reject the null hypothesis with a = 0.10

# using base R
wilcox.test(A, B, paired = TRUE, conf.level = 0.90)

## Warning in wilcox.test.default(A, B, paired = TRUE, conf.level = 0.9):
## cannot compute exact p-value with ties

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  A and B
## V = 3, p-value = 0.1411
## alternative hypothesis: true location shift is not equal to 0

# cf) distribution of the Wilcoxon signed rank statistic
par(mfrow = c(2,2))
for (n in c(4:5,10,40)) {
  x <- seq(0, n*(n + 1)/2, length = 501)
  plot(x, dsignrank(x, n = n)/2, type = "l",
       main = paste0("dsignrank(x, n = ", n, ")"))
}

# visualize A and B
plot.ecdf(A - B) # using base R
ggplot(NULL, aes(x = A - B)) + # using ggplot
  geom_step(stat = "ecdf") +
  labs(x = "difference in cake densities", y = "F(x)")

For large sample case (\(n > 25\)), we canuse z statistic as summarized as below

Null hypothesis: \(H_0\): The population relative frequency distributions for the \(X\)’s and \(Y\)’s are identical.

Alternative hypothesis: - (1) \(H_a\): The two population relative frequency distributions differ in location (a two-tailed test) - (2) \(H_a\): The population relative frequency distribution for the \(X\)’s is shifted to the right (or left) of the relative freqeuncy distribution of the \(Y\)’s (one-tailed tests).

Test statistic: \(Z = \frac{W^+ - [\frac{n(n+1)}{4}]}{\sqrt {\frac {n(n+1)(2n+1)}{24}}}\)

Rejection region: - Reject \(H_0\) if \(z \geq z_{\alpha/2}\) or \(z \leq -z_{\alpha/2}\) for a two-tailed test. - To detect a shift in the distributions of the \(X\)’s to the right of the \(Y\)’s, reject \(H_0\) when \(z \geq z_{\alpha}\). - To detect a shift in the opposite direction, reject \(H_0\) if \(z \leq -z_{\alpha}\).

Wilcoxon Rank Sum Test for independent samples (s19, Wackerly p756)

A statistical test for comparing two populations based on independent random samples, the rank-sum test, was proposed by Frank Wilcoxon in 1945. Again, we assume that we are interested in testing whether the two populations have the same distribution versus the shift (or location) alternative. Suppose that you were to select independent random samples of \(n_1\) and \(n_2\) observations from populations I and II, respectively. Wilcoxon’s idea was to combine the \(n_1 + n_2 = n\) observations and rank them, in order of magnitude, from 1 (the smallest) to \(n\) (the largest). If two or more observations are tied for the same rank, the average of the ranks that would have been assigned to these observations is assigned to each member of the tied group. If the observations were selected from identical populations, the rank sums for the samples should be more or less proportional to the sample sizes \(n_1\) and \(n_2\). For example, if \(n_1\) and \(n_2\) were equal, you would expect the rank sums to be nearly equal. In contrast, if the observations in one population???say, population I???tended to be larger than those in population II, the observations in sample I would tend to receive the highest ranks and sample I would have a larger than expected rank sum. Thus (sample sizes being equal), if one rank sum is very large (and, correspondingly, the other is very small), it may indicate a statistically significant difference between the locations of the two populations.

Example

The bacteria counts per unit volume are shown in below for two types of cultures, I and II. our observations were made for each culture. Let \(n_1\) and \(n_2\) represent the number of observations in samples I and II, respectively. The corresponding ranks are shown below as well.

	I	II
batch 1	27	32
batch 2	31	29
batch 3	26	35
batch 4	25	28

	I	II
batch 1	3	7
batch 2	6	5
batch 3	2	8
batch 4	1	4
Rank Sum	12	24

Do these data present sufficient evidence to indicate a difference in the locations of the population distributions for cultures I and II?

Solution

Let \(W\) equal the rank sum for sample I (for this sample, \(W = 12\)). Certainly, very small or very large values of \(W\) provide evidence to indicate a difference between the locations of the two population distributions; henceW, the rank sum, can be employed as a test statistic. The rejection region for a given test is obtained in the same manner as for the sign test. We start by selecting the most contradictory values of \(W\) as the rejection region and add to these until \(\alpha\) is of acceptable size.

The minimum rank sum includes the ranks \(1, 2, 3, 4, or \ W = 10\). Similarly, the maximum includes the ranks \(5, 6, 7, 8, or \ W = 26\). Therefore, we include these two values of \(W\) in the rejection region. What is the corresponding value of \(\alpha\)?

If the populations are identical, every permutation of the eight ranks represents a sample point and is equally likely. Then, \(\alpha\) is the sum of the probabilities of the sample points (arrangements) that imply \(W = 10\) or \(W = 26\). The total number of permutations of the eight ranks is 8!. The number of different arrangements of the ranks 1, 2, 3, 4 in sample I with the 5, 6, 7, 8 of sample II is 4! × 4!. Similarly, the number of arrangements that place the maximum value of \(W\) in sample I (ranks 5, 6, 7, 8) is 4!×4!. Then, the probability that \(W = 10\) or \(W = 26\) is

\[p(10) + p(26) = 2 \times \frac{4! \times 4!}{8!} = \frac {2}{{8 \choose 4}} = \frac {1}{35} = 0.029\]

If this value of \(\alpha\) is too small, the rejection region can be enlarged to include the next smallest and next largest rank sums, \(W = 11\) and \(W = 25\). The rank sum \(W = 11\) includes the ranks 1, 2, 3, 5, and the rank sum \(W=25\) includes the ranks 4, 6, 7, 8 and

\[p(11) = p(25) = \frac{4! \times 4!}{8!} = \frac{1}{70}\]

Then,

\[\alpha = p(10) + p(11) + p(25) + p(26) = \frac{2}{35} = 0.057\].

As the same procedure, the next rejection region can be calculated as

\[\alpha = p(10) + p(11) + p(12) + p(24) + p(25) + p(25) = \frac{4}{35} = 0.114\]

This value of \(\alpha\) might be considered too large for practical purposes. Hence, we are better satisfied with the rejection region \(W = 10, 11, 25\), and \(26\). The rank sum for the sample, \(W = 12\), does not fall in this preferred rejection region, so we do not have sufficient evidence to reject the hypothesis that the population distributions of bacteria counts for the two cultures are identical.

I <- c(27, 31, 26, 25)
II <- c(32, 29, 35, 28)

vec <- c(I, II)
rank_I <- rank(vec)[1:4]
rank_II <- rank(vec)[5:8]

# a1 = P(W = 10, 26)
a1 <- 2 * 1/(choose(8,4))

# a2 = P(W = 10,26,11,25)
a2 <- 4 * 1/(choose(8,4))

# a3 = P(W = 10,26,11,25,12,24)
a3 <- 8 * 1/(choose(8,4))

a1; a2; a3

## [1] 0.02857143

## [1] 0.05714286

## [1] 0.1142857

# base R package
wilcox.test(II, I)

## 
##  Wilcoxon rank sum test
## 
## data:  II and I
## W = 14, p-value = 0.1143
## alternative hypothesis: true location shift is not equal to 0

Mann-Whitney U test for independent samples (s20)

The Mann-Whitney statistic \(U\) is obtained by ordering all \((n_1 + n_2)\) observations according to their magnitude and counting the number of observations in sample I that precede each observation in sample II. The statistic \(U\) is the sum of these counts. We denote the observations in sample I as \(x_1, x_2, . . . , x_{n_1}\) and the observations in sample II as \(y_1, y_2, . . . , y_{n_2}\). For example, the eight ordered observations of previous example are shown below.

value	label	rank
27	\(x_1\)	3
31	\(x_2\)	6
26	\(x_3\)	2
25	\(y_1\)	1
32	\(y_2\)	7
29	\(x_4\)	5
35	\(y_3\)	8
28	\(y_4\)	4

The smallest \(y\) observation is \(y_1 = 28\), and \(u_1 = 3\) \(x\)’s precede it. Similarly, \(u_2 = 3\) \(x\)’s precede \(y_2 = 29\) and \(u_3 = 4\), and \(u_4 = 4\) \(x\)’s precede \(y_3 = 32\) and \(y_4 = 35\), respectively. Then, \[U = u_1 + u_2 + u_3 + u_4 = 3 + 3 + 4 + 4 = 14\] Very large or very small values of \(U\) imply a separation of the ordered \(x\)’s and \(y\)’s and thus provide evidence to indicate a difference (a shift of location) between the distributions of populations I and II. The Mann-Whitney U statistic is related to Wilcoxon’s rank sum as shown below. \[U = n_1n_2 + \frac{n_1(n_1 +1)}{2} -W\] where \(n_1 =\) number of observations in sample I, \(n_2 =\) number of observations in sample II, \(W =\) rank sum for sample I.

To see how to locate the rejection region for the Mann-Whitney U test, suppose that \(n_1 = 4\) and \(n_2 = 5\). Then, you would consult the Table in lecture slide 26 (the one corresponding to \(n_2 = 5\)). Notice that the table is constructed assuming that \(n_1 \leq n_2\). That is, you must always identify the smaller sample as sample I. From the table we see, for example, \(P(U \leq 2) = 0.0317\) and \(P(U \leq 3) = 0.0556\). So if you want to conduct a lower-tail Mann-Whitney U test with \(n_1 = 4\) and \(n_2 = 5\) for \(\alpha\) near 0.05, you should reject the null hypothesis of equality of population relative frequency distributions when \(U \leq 3\). The probability of a type I error for the test is \(\alpha = 0.0556\). if the test is two-tailed, the p-value is \(2 \times 0.0556 = 0.1112\).

Summary of the Mann-Whitney U Test

Population I is the population from which the smaller sample was taken.

Null hypothesis: \(H_0\): The distributions of populations I and II are identical.
Alternative hypothesis:
- 1. \(H_a\): The distributions of populations I and II have different locations (a two-tailed test)
- 1. \(H_a\): The distributions of populations I is shifted to the right of the distribution of population II
- 1. \(H_a\): The distributions of populations I is shifted to the left of the distribution of population II
Test statistic: \(U = n_1n_2 + \frac{n_1(n_1 +1)}{2} -W\)
Rejection region:
- 1. For the two-tailed test and a given value of \(\alpha\), reject \(H_0\) if \(U \leq U_0\) or \(U \geq n_1n_2 - U_0\), where \(P(U \leq U_0) = \frac{\alpha}{2}\). [Note: Observe that \(U_0\) is the value such that \(P(U \leq U_0)\) is equal to half of \(\alpha\)]
- 1. To test that population I is shifted to the right of population II with a given value of \(\alpha\), reject \(H_0\) if \(U \leq U_0\), where \(P(U \leq U_0) = \alpha\).
- 1. To test that population I is shifted to the left of population II with a given value of \(\alpha\), reject \(H_0\) if \(U \geq n_1n_2 - U_0\), where \(P(U \leq U_0) = \alpha\).
Assumptions: Samples have been randomly and independently selected from their respective populations. Ties in the observations can be handled by averaging the ranks thatwould have been assigned to the tied observations and assigning this average rank to each. Thus, if three observations are tied and are due to receive ranks 3, 4, and 5, we assign rank 4 to all three.

I <- c(27, 31, 26, 25)
II <- c(32, 29, 35, 28)
n1 <- length(I)
n2 <- length(II)
a <- 0.1

vec <- c(I, II)
rank_I <- rank(vec)[1:4]
rank_II <- rank(vec)[5:8]
W <- sum(rank_I) 

# Calculate U by sum(ui)
dt <- data.frame(x = vec, label = c(rep("I", n1), rep("II", n2)), rank = rank(vec))
u <- vector(length = n1)
for (i in 1:n1) {
  u[i] <- sum(dt[dt$label == "I", ]$rank < sort(dt[dt$label == "II", ]$rank)[i])
}
u

## [1] 3 3 4 4

U <- sum(u); U

## [1] 14

# Relation to wilcoxon-test
U_by_formular <- (n1 * n2) + (n1*(n1 + 1)/2) - W
U == U_by_formular

## [1] TRUE

# Rejection Region (two-sided)
p <- 0.0571 # n1 = 4, n2 = 4, U0 = 2 from s26
pval <- 2 * 0.0571

# using package
library(coin)

## Loading required package: survival

## 
## Attaching package: 'coin'

## The following object is masked _by_ '.GlobalEnv':
## 
##     alpha

wilcox_test(dt$x ~ factor(dt$label), distribution = "exact")

## 
##  Exact Wilcoxon-Mann-Whitney Test
## 
## data:  dt$x by factor(dt$label) (I, II)
## Z = -1.7321, p-value = 0.1143
## alternative hypothesis: true mu is not equal to 0

Kruskal-Wallis Test (s21)

In previous lecture, we presented an analysis of variance (ANOVA) procedure to compare the means of \(k\) populations. The resultant \(F\)-test was based on the assumption that independent random samples were taken from normal populations with equal variances. That is, we were interested in testing whether all the populations had the same distribution versus the alternative that the populations differed in location. A key element in the development of the procedure was the quantity identified as the sum of squares for treatments, SSTr. The larger the value of SSTr, the greater will be the weight of evidence favoring rejection of the null hypothesis that the means are all equal. Here we present a nonparametric technique to test whether the populations differ in location. Like other nonparameteric methods, the Kruskal-Wallis procedure requires no assumptions about the actual form of the probability distributions.

We assume that independent random samples have been drawn from \(k\) populations that differ only in location. However, we need not assume that these populations possess normal distributions. For complete generality, we permit the sample sizes to be unequal, and we let \(n_i\) , for \(i = 1, 2, . . . , k\), represent the size of the sample drawn from the \(i_{th}\) population. Then combine all the \(n_1 + n_2 + ... + n_k = n\) observations and rank them from 1 (the smallest) to \(n\) (the largest). Ties are treated as in previous sections. Let \(R_i\) denote the sum of the ranks of the observations from population \(i\) and let \(\bar R_i = \frac{R_i}{n_i}\) denote the corresponding average of the ranks. If \(\bar R_i\) equals the overall average of all of the ranks, consider the rank analogue of SSTr, which is computed by using the ranks rather than the actual values of the measurements:

\[V = \sum_{i=1}^k n_i(\bar R_i - \bar R)^2\] Here V denotes variation of the ranks between groups of populations. If the null hypothesis is true and the populations do not differ in location, we would expect the \(R_i\) values to be approximately equal and the resulting value of \(V\) to be relatively small. If the alternative hypothesis is true, we would expect this to be exhibited in differences among the values of the \(R_i\) values, leading to a large value for \(V\). Thus,

\[V = \sum_{i=1}^k n_i \bigg(\bar R_i - \frac{n+1}{2}\bigg)^2\]

where

\[\bar R = \frac{sum \ of \ the \ first \ n \ integers}{n} = \frac{\bigg [\frac{n(n+1)}{2} \bigg]}{n} = \frac {n+1}{2}\] Instead of focusing on \(V\), Kruskal and Wallis (1952) considered the statistic \(H = \frac{12V}{n(n + 1)}\), which may be rewritten as

\[H = \frac{12V}{n(n + 1)} \sum_{i=1}^k \frac{R_i^2}{n_i} - 3(n+1)\] As previously noted, the null hypothesis of equal locations is rejected in favor of the alternative that the populations differ in location if the value of \(H\) is large. Thus, the corresponding \(\alpha\)-level test calls for rejection of the null hypothesis in favor of the alternative if \(H > h(\alpha)\), where \(h(\alpha)\) is such that, when \(H_0\) is true, \(P[H > h(\alpha)] = \alpha\).

If the underlying distributions are continuous and if there are no ties among the \(n\) observations, the null distribution of \(H\) can (tediously) be found by using the methods of probability. We can find the distribution of \(H\) for any values of \(k\) and \(n_1, n_2, ... , n_k\) by calculating the value of \(H\) for each of the \(n!\) equally likely permutations of the ranks of the \(n\) observations. These calculations have been performed and tables developed for some relatively small values of \(k\) and for \(n_1, n_2, ..., n_k\) [see, for example, Table A.12 of Hollander and Wolfe (1999)]

Kruskal and Wallis showed that if the \(n_i\) values are “large” the null distribution of \(H\) can be approximated by a \(\chi^2\) distribution with \(k-1\) degrees of freedom (df). This approximation is generally accepted to be adequate if each of the \(n_i\) values is greater than or equal to 5. If you wish to use the Kruskal-Wallis analysis for smaller data sets, where this large-sample approximation is not adequate, refer to Hollander and Wolfe (1999) to obtain the appropriate critical values.

In summary, the large sample Kruskal-Wallis procedure as follows.

Null hypothesis: \(H_0\): The \(k\) population distributions are identical.
Alternative hypothesis: \(H_a\): At least two of the population distributions differ in location.
Test statistic: \(H = \frac{12V}{n(n + 1)} \sum_{i=1}^k \frac{R_i^2}{n_i} - 3(n+1)\), where
- \(n_i=\) number of measurements in the sample from population \(i\),
- \(R_i=\) rank sum for sample \(i\), where the rank of each measurement is computed according to its relative size in the overall set of \(n = n_1 + n_2 + ... + n_k\) observations formed by combining the data from all k samples.
Rejection region: Reject \(H_0\) if \(H > \chi^2_{\alpha}\) with \((k-1)\) df.
Assumptions: The \(k\) samples are randomly and independently drawn. There are five or more measurements in each sample.

Example

A quality control engineer has selected independent samples from the output of three assembly lines in an electronics plant. For each line, the output of ten randomly selected hours of production was examined for defects. Do the data in table below provide evidence that the probability distributions of the number of defects per hour of output differ in location for at least two of the lines? Use \(\alpha = 0.05\). Also give the p-value associated with the test.

Solution

In this case, \(n_1 = n_2 = n_3 = 10\) and \(n=30\). Thus,

\[H = \frac{12}{30(31)} \bigg[\frac{(120)^2}{10} + \frac{(210.5)^2}{10} + \frac{(134.5)^2}{10}\bigg] - 3(31) = 6.097\] Because all the \(n_i\) values are greater than or equal to 5, we may use the approximation for the null distribution of \(H\) and reject the null hypothesis of equal locations if \(H > \chi^2_{\alpha}\) based on \(k - 1 = 2\) df. From the table, \(\chi^2_{0.05} = 5.991\). Thus, we reject the null hypothesis at the \(\alpha = 0.05\) level and conclude that at least one of the three lines tends to produce a greater number of defects than the others.

According to the \(\chi^2\) table, the value of \(H = 6.097\) leads to rejection of the null hypothesis if \(\alpha = 0.05\) but not if \(\alpha = 0.025\). Thus, \(0.025 < p-value < 0.05\). You can find the approximate \(p-value = P(\chi^2 > 6.097) = .0474\) in R directly.

L1 <- c(6, 38, 3, 17, 11, 30, 15, 16, 25, 5)
L2 <- c(34, 28, 42, 13, 40, 31, 9, 32, 39, 27)
L3 <- c(13, 35, 19, 4, 29, 0, 7, 33, 18, 24)
L <- c(L1, L2, L3)
R1 <- rank(L)[1:length(L1)]
R2 <- rank(L)[length(L1) + 1:length(L2)]
R3 <- rank(L)[length(L1) + length(L2) + 1:length(L3)]
sum(R1); sum(R2); sum(R3)

## [1] 120

## [1] 210.5

## [1] 134.5

H <- (12/(length(L)*(length(L) + 1))) * 
     (((sum(R1))^2/length(L1)) + ((sum(R2))^2/length(L2)) + ((sum(R3))^2/length(L3))) -
     3 * (length(L) + 1)

# significant H value
qchisq(0.95, df = 2)

## [1] 5.991465

# p-value
pchisq(H, df = 2, lower.tail = FALSE)

## [1] 0.04742007

# base R package
data <- data.frame(L, group = c(rep("L1", length(L1)), 
                                rep("L2", length(L2)), rep("L3", length(L3))))
kruskal.test(L ~ group, data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  L by group
## Kruskal-Wallis chi-squared = 6.0988, df = 2, p-value = 0.04739

Extra issues - Correlation test

Pearson correation coefficient (r) measures a linear dependence between two variables. It is also known as a parametric correlation test because it depends on the distribution of the data. It can be used properly when x and y are normally distributed. Spearman’s rho (\(\rho\)) and Kendall’s tau (\(\tau\)) are rank-based correlation coefficients. Because they are not assume the distribution of the data, they are regarded as non-parametric correlation tests. Here we briefly review the formular of each correlation coefficient and see how to calculate it with R.

Pearson correation coefficient (PCC, r)

If we have one dataset \(\{x_1,...,x_n\}\) containing \(n\) values and another dataset \(\{y_1,...,y_n\}\) containing \(n\) values then that formula for \(PCC(r)\) is:

\[r = \frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^n(x_i-\bar x)^2\sum_{i=1}^n(y_i-\bar y)^2}}\] where \(n\) is the sample size, \(x_i, y_i\) are the individual sample points indexed with \(i\), \(\bar x = \sum_{i=1}^n x_i\) (the sample mean) and analogously for \(\bar y\)

For pairs from an uncorrelated bivariate normal distribution, the sampling distribution of a certain function of Pearson’s correlation oefficient follows Student’s t-distribution with degrees of freedom \(n-2\). Therefore, the p-value can be determined by using the correlation coefficient table for the degrees of freedom: \(df = n-2\). or by calculating the t values as below \[t = r\sqrt\frac{n-2}{1-r^2}\] or for the critical value of \(r\)

\[r = \frac{r}{\sqrt {n-2+t^2}}\]

x <- rnorm(40,0,1)
y <- rnorm(40,0,1)

n <- length(x)
r <- cor(x,y) 
t <- r * sqrt((n - 2)/(1 - r^2)) 
p <- min(1, 2 * pt(t, n - 2)) # two-sided
r; t; p

## [1] 0.1689353

## [1] 1.056573

## [1] 1

# base R
cor.test(x, y) # default "pearson"

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 1.0566, df = 38, p-value = 0.2974
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1504937  0.4564253
## sample estimates:
##       cor 
## 0.1689353

Note that the null hypothesis of correlation test is \(H_0: r=0\) and the alternative hypothesis is \(H_A: r\neq 0 \ or \ H_A:r<0 \ or \ H_A:r>0\)

Spearman rank correation coefficient (SCC, \(\rho\))

Spearman rank correlation coefficient is computed the same as PCC but with rank of the variables.

\[\rho = \frac{\sum_{i=1}^n(x^{'}_i-\bar {x^{'}})(y^{'}_i-\bar {y^{'}})}{\sqrt{\sum_{i=1}^n(x^{'}_i-\bar {x^{'}})^2 \sum_{i=1}^n(y^{'}_i-\bar {y^{'}})^2}}\]

where \(x_i^{'} = rank(x)\), and \(y_i^{'} = rank(y)\). you can use the same method as PCC to find p-value of SCC.

x <- rnorm(40,0,1)
y <- rnorm(40,0,1)

n <- length(x)
rho <- cor(x,y, method = "spearman")
t <- rho * sqrt((n - 2)/(1 - rho^2)) 
p <- min(1, 2 * pt(t, n - 2)) # two-sided
rho; t; p

## [1] -0.08893058

## [1] -0.5503856

## [1] 0.585276

# base R
cor.test(x, y, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 11608, p-value = 0.5841
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.08893058

Kendall rank correation coefficient (KRCC, \(\tau\)) The Kendall rank correlation coefficient can be calculated from the correspondence between the ranking of \(x\) and \(y\). The total number of possible pairings of \(x\) and \(y\) is \(\frac{n(n???1)}{2}\). The definition of Kendall \(\tau\) coefficient is:

\[\tau = \frac{(number \ \ of \ \ concordant \ \ pairs) - (number \ \ of \ \ discordant \ \ pairs)}{\frac{n(n-1)}{2}}\] or more formally

\[\tau = \frac{2}{n(n-1)}\sum_{i<j}sgn(x_i - x_j)sgn(y_i-y_j)\] For testing the significance of the Kendall’s \(\tau\) statistic, based on a large sample normal approximation, the test statistic, \(Z\), is computed as: \[Z = 3 \times \tau \sqrt{\frac{n(n-1)}{2(2n+5)}}\] for more information about Kendall’s rank correlation coefficient, see The Advanced Theory of Statistics by Kendall and Stuart (1979).

x <- rnorm(40,0,1)
y <- rnorm(40,0,1)
n <- length(x)

tau <- cor(x,y, method = "kendall")
z <- 3 * tau * sqrt(n * (n - 1)) / sqrt(2 * (2*n + 5))
z2 <- 1/2 * log((1 + tau)/(1 - tau)) - 1/2 * log((1 + tau)/(1 - tau))
p <- min(1, 2*pnorm(z))
tau; z; p

## [1] -0.1282051

## [1] -1.165103

## [1] 0.2439771

# base R
cor.test(x, y, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  x and y
## T = 340, p-value = 0.2505
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.1282051

References:

Kendall, M. G., and A. Stuart. 1979. The Advanced Theory of Statistics, 4th ed., vol 2. New York: Hafner Press

Statistical Test (Non-Parametric)

Bis335

Contents

Distribution-free method

Goodness-Of-Fit Test for Categorical Data analysis (apprx. parametric)

Sign or order analysis

Extra issues (not included in this session)

Tutorial

Example of chi-square test I (s7, Hayter p468)

Example of chi-square test II (s8)

Test of Independence (s11, Hayter p483)

Simpson’s Paradox (s12)

Sign or Order Test

Kolomogorov-smirnov Test (2 samples) (s15, Hayter 718)

Sign Test (Matched Pair Samples Median) (s17, Wackerly p744)

Wilcoxon Rank Sum Test for single & paired samples (s18, Wackerly p752)

Wilcoxon Rank Sum Test for independent samples (s19, Wackerly p756)

Mann-Whitney U test for independent samples (s20)

Kruskal-Wallis Test (s21)

Extra issues - Correlation test

Day	A	B
1	172	201
2	165	179
3	206	159
4	184	192
5	174	177
6	142	170
7	190	182
8	169	179
9	161	169
10	200	210

Day	A	B
1	172	201
2	165	179
3	206	159
4	184	192
5	174	177
6	142	170
7	190	182
8	169	179
9	161	169
10	200	210

Day	A	B
1	172	201
2	165	179
3	206	159
4	184	192
5	174	177
6	142	170
7	190	182
8	169	179
9	161	169
10	200	210