Statistical Inference

Statistics of Sampling Distribution

Introduction of statistical inference and random sample
Statistics of sampling distribution from normal & non-normal population
Statistics of sampling distribution of two samples

Point estimation

Desired properties of estimator
Methods of point estimation

Interval estimation

Confidence intervals of typical parameters
The effect of sample size

Tutorial

Practice of Normally Distributed Sample Statistics (s7)

Suppose a sample with n=20 whose underlying distribution is assumed normal. What is the probability that the value of \(\hat \mu = \bar X\) lies within \(\frac{\sigma}{4}\) of the true mean \(\mu\)? How does it change when the sample size becomes n=40? If the sample variance \(S^2\) is 16, what are the probabilities that the value of \(\hat \mu = \bar X\) lies within \(\mu \pm 1\) for those two sample sizes?

The effect of sample size,

\(P(\mu - \frac{\sigma}{4} \leq \bar X \leq \mu + \frac{\sigma}{4}) \sim N(\mu, \frac{\sigma^2}{n})\)

\(Z(\bar X) = \frac{\bar X-\mu}{\sigma/\sqrt {n}}\)

\(n=20, P(\mu - \frac{\sigma}{4} \leq \bar X \leq \mu + \frac{\sigma}{4}) = P(-\frac{\sqrt{20}}{4} \leq Z(\bar X) \leq +\frac{\sqrt{20}}{4}) = \phi(1.118) - \phi(-1.118) = 0.7364\)

from the table, we get \(\phi(1.12) - \phi(-1.12) = 0.8686 - 0.1314 = 0.7372\), which is the value shown in the lecture slide.

\(n=40, P(\mu - \frac{\sigma}{4} \leq \bar X \leq \mu + \frac{\sigma}{4}) = P(-\frac{\sqrt{40}}{4} \leq Z(\bar X) \leq +\frac{\sqrt{40}}{4}) = \phi(1.581) - \phi(-1.581) = 0.8861\)

again, from the table, we get \(\phi(1.58) - \phi(-1.58) = 0.9429 - 0.0571 = 0.8858\)

The probability that \(\bar{X}\) lies within \(\mu +1\),

z-statistic \(P(\mu -1 \leq \bar X \leq \mu +1) = P(-\frac {\sqrt{20}}{\sigma} \leq Z(\bar X) \leq +\frac {\sqrt{20}}{\sigma})\),

we cannot solve without knowing \(\sigma\)

Since we don’t know the absolute value of both \(\mu\) and \(\sigma\) from the sample data set. If the question is for one of these variable, we need to know the value of the others. To solve this we need to approximate the value of \(\frac{\sigma}{\sqrt{n}}\) with \(\frac{s}{\sqrt{n}}\)

with \(s = 4, n = 20\), t-statistic \(t(\bar X) = \frac {\bar X - \mu}{s/\sqrt{20}}\)

\(P(\mu -1 \leq \bar X \leq \mu +1) = P(-\frac {\sqrt{20}}{4} \leq t(\bar X) \leq +\frac {\sqrt{20}}{4}) = P(-1.118 \leq t(\bar X) \leq +1.118) = 0.7225\)

with \(s = 4, n = 40\), t-statistic \(t(\bar X) = \frac {\bar X - \mu}{s/\sqrt{20}}\)

\(P(\mu -1 \leq \bar X \leq \mu +1) = P(-\frac {\sqrt{40}}{4} \leq t(\bar X) \leq +\frac {\sqrt{40}}{4}) = P(-1.581 \leq t(\bar X) \leq +1.581) = 0.8780\)

# n=20
n <- 20
z <- sqrt(n)/4; z

## [1] 1.118034

z <- round(z, 2); z

## [1] 1.12

pnorm(1.118) - pnorm(-1.118)

## [1] 0.736433

pnorm(z) - pnorm(-z)

## [1] 0.7372862

# n=40
n <- 40
z <- sqrt(n)/4; z

## [1] 1.581139

z <- round(z, 2); z

## [1] 1.58

pnorm(1.581) - pnorm(-1.581)

## [1] 0.886122

pnorm(z) - pnorm(-z)

## [1] 0.8858931

# x_bar lies within mu - 1 <= x <= mu + 1, n=20, s=4
n <- 20
s <- 4
t <- sqrt(n)/s; t

## [1] 1.118034

t <- round(t, 2); t

## [1] 1.12

pt(1.118, df = 19) - pt(-1.118, df = 19)

## [1] 0.7224957

pt(t, df = n - 1) - pt(-t, df = n - 1)

## [1] 0.7233276

# x_bar lies within mu - 1 <= x <= mu + 1, n=40, s=4
n <- 40
s <- 4
t <- sqrt(n)/s; t

## [1] 1.581139

t <- round(t, 2); t

## [1] 1.58

pt(1.581, df = 39) - pt(-1.581, df = 39)

## [1] 0.8780447

pt(t, df = n - 1) - pt(-t, df = n - 1)

## [1] 0.8778157

Point Estimation (s19)

Suppose we have a small pond in our backyard, and in the pond there live some fish. We would like to know how many fish live in the pond. How can we estimate this? One procedure developed by researchers is the capture-recapture method. Here is how it works.

Captures fish, Make a tag and release (e.g. 1st tester fishes n1=7).
Revisit & sample & count tagged and untagged fishes (2nd tester n2=4, tagged t=3, untagged n2-t=1).
Estimate most probable total number (N) of fish in a pond.

Suppose we assume that the total number of fish in a fond follows hypergeometric distribution. Since we tagged 7 fishes and when we revisit, we found 3 tagged and 1 untagged fishes from n2=4 samples, the minimum of total number of fishes is 8. Therefore, the range of possible total number of fishes is from 8 to infinity. Let’s plot the likelihood of total number of fishes in a pond.

# Parameter setting
N <- 8:50 # total number of fishes
m <- 7 # tagged fishes in 1st trial (total number of tagged fishes)
n <- 4 # sampling number
x <- 3 # tagged fishes in 2nd trial

# calculate probabilities
n.fish <- dhyper(x, m, N - m, n)
plot(N, n.fish, type = 'h', col = 3, ylim = c(0,0.6), xlab = 'the number of fish in a pond', 
     ylab = 'Likelihood', main = "Estimation of Total Fishes in a pond ")
points(N, n.fish, col = 2)
abline(h = 0, col = 3)

# Maximum likelihood
identical(max(n.fish), dhyper(x, m, 9 - m, n))

## [1] TRUE

print("The most probable total number of fishes in a pond is 9")

## [1] "The most probable total number of fishes in a pond is 9"

max(n.fish)

## [1] 0.5555556

# by formular
ceiling(((m * n) / x) - 1)

## [1] 9

Hardy-Weinberg’s estimation (s21)

In a small population of 450 people, 200 had genotype AA, 150 had genotype Aa, 100 had genotype aa. estimate genotype probability of whole population.

library(knitr)
geno <- c(200, 150, 100, 450)
names(geno) <- c("AA", "Aa", "aa", "Total")
kable(geno)

	x
AA	200
Aa	150
aa	100
Total	450

# Parameter setting 
k1 <- 200 # AA
k2 <- 150 # Aa
k3 <- 100 # aa
N <- k1 + k2 + k3

# Allele frequency
k1/N; k2/N; k3/N

## [1] 0.4444444

## [1] 0.3333333

## [1] 0.2222222

# MLE
theta_hat <- (2*k1 + k2) / (2*(k1 + k2 + k3))
theta_hat

## [1] 0.6111111

AA <- theta_hat^2
Aa <- 2*theta_hat*(1 - theta_hat)
aa <- (1 - theta_hat)^2

# Estimate of genotype probability of whole population
AA; Aa; aa

## [1] 0.3734568

## [1] 0.4753086

## [1] 0.1512346

# Error 
err_AA <- k1/N - AA
err_Aa <- k2/N - Aa
err_aa <- k3/N - aa
err <- c(err_AA, err_Aa, err_aa)
names(err) <- c("AA", "Aa", "aa")
abs(err)

##         AA         Aa         aa 
## 0.07098765 0.14197531 0.07098765

Interval Estimation - Example of Pivotal Method (s25 ~ s26)

Find exact C.I. of population mean of exponential distribution \(Exp(\lambda)\) by using pivotal method when we have random samples whose \(X_i\) assumed i.i.d from exponential distribution \(Exp(\lambda)\) and the obtained sample mean is 10. (\(\nu =2\))

\(f(x) = \lambda e^{-\lambda x}, F(x) = 1-\lambda e^{-\lambda x}, \mu = \frac{1}{\lambda}, \sigma^2 = \frac{1}{\lambda^2}\)

\(Q=\lambda X\) can be a pivotal as its distribution becomes parameter independent \(Exp(1)\)

\(Q = \lambda X, X = \frac{Q}{\lambda}, \frac{dx}{dq} = \frac{1}{\lambda}, f_q(Q) = f_x(\frac{Q}{\lambda})\frac {1}{\lambda} = e^{-Q} \sim Exp(1)\)

for 90 % of confidence interval, \(P(c_1 \leq Q \leq c_2) = 0.9, or P(Q \leq c_1) = P(Q \geq c_2) = 0.05\)

\(P(Q \leq c_1) = \int_0^{c_1} e^{-Q}dq = 1-e^{-c_1} = 0.05\)

\(P(Q \geq c_2) = \int_{c_2}^\infty e^{-Q}dq = e^{-c_2} = 0.05\)

\(c_1 = ln(0.95) = 2.996, \ c_2 = -ln(0.05) = 0.051\)

Therefore \(P(Q \leq c_1) = P(\frac{x}{\mu} \leq c_1) = P(\frac{x}{c_1} \leq \mu) = P(X/2.996 \leq \mu)\)

\(P(Q \geq c_2) = P(\frac{x}{\mu} \geq c_2) = P(\frac{x}{c_2} \geq \mu) = P(X/0.051 \geq \mu)\)

the sample mean is 10, the C.I of \(\mu = \frac{1}{\lambda}\) will be 3.38, 194.96

xbar <- 10
n <- 1
df <- 2 * n
a <- 0.1
crit.chisq <- qchisq(c(1 - a/2, a/2), df)
ci.mu <- (df*xbar)/crit.chisq

# using gamma
crit.gamma <- qgamma(c(1 - a/2, a/2), n, 1/2) 
ci.mu2 <- 2*n*xbar/crit.gamma

t <- seq(0,200,0.01)
plot(t, dexp(t, 1/xbar), type = 'l', ylab = 'f(x)', xlab = 't (sec)')
abline(v = xbar, lwd = 2)
abline(v = ci.mu[1], col = "red", lwd = 2, lty = 2) # lower
abline(v = ci.mu[2], col = "red", lwd = 2, lty = 2) # upper
abline(v = ci.mu2[1], col = "green") # lower
abline(v = ci.mu2[2], col = "green") # upper
legend("topright", legend = c("Mean", "CI w chisq", "CI w gamma"), 
       lty = c(1,2,1,1), col = c("black", "red", "green"))

Interval Estimation - Hayter 17.3 (page 777)

Exponential Distribution If the failure times \(t_1, ..., t_n\) are considered to be observations from an exponential distribution, then the parameter \(\lambda\) can be estimated as

\[\hat\lambda = \frac{n}{t_1 + ... + t_n} = \frac {1}{\bar t}\] mean failure time \(\mu = 1/\lambda\) is then estimated as \[\hat \mu = \frac{1}{\hat \lambda} = \bar t\] which is simply the sample average. A \(1-\alpha\) confidence level confidence interval for the mean failure time can be constructed as \[\mu \in (\frac{2n\bar t}{\chi^2_{2n, \alpha/2}}, \frac{2n\bar t}{\chi^2_{2n, 1-\alpha/2}})\] where \(\chi^2_{2n, \alpha/2}\) is the upper \(\alpha/2\) critical point and \(\chi^2_{2n, 1-\alpha/2}\) is the lower \(\alpha/2\) critical point of a chi-square distribution with 2n degrees of freedom. This confidence interval can be rewritten in terms of the parameter \(\lambda\) as \[\lambda = (\frac{\chi^2_{2n, 1-\alpha/2}}{2n\bar t}, \frac{\chi^2_{2n, \alpha/2}}{2n\bar t})\]

s <- c(0.6, 8.4, 12.8, 4.3, 1.3, 8.3, 18.1, 20.6, 
       21.0, 17.8, 3.3, 6.2, 9.7, 35.2, 13.9, 2.6, 5.2, 5.6, 13.7, 3.8) # time in seconds
df <- 2 * length(s)
a <- 0.1
crit.chisq <- qchisq(c(1 - a/2, a/2), df)
ci.mu <- (df*mean(s))/crit.chisq

t <- seq(0,100,0.01)
plot(t, exp(-1/mean(s)*t), type = 'l', ylab = 'probability of failure', xlab = 't (sec)')
abline(v = mean(s))
abline(v = mean(s) + sd(s), col = "blue")
abline(v = mean(s) - sd(s), col = "blue")
abline(v = ci.mu[1], col = "red") # lower
abline(v = ci.mu[2], col = "red") # upper
legend("topright", legend = c("Mean", "CI", "SD"), 
       lty = rep(1,3), col = c("black", "red", "blue"))

Practice for polled two samples (s30)

Clinical test for men and women showed that there was difference in systolic blood pressures.

	Men	Women
Sample number	6.0	4.0
Sample mean	117.5	126.8
Sample S.E	9.7	12.0

Assuming that they have common population variance obtain the C.I. of the difference in mean of systolic blood pressure between men & women?

Estimate of pooled standard deviation \(S_p = \sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}}\)

\(S_p = \sqrt{\frac{(6-1)9.7^2+(4-1)12.0^2}{6+4-2}} = \sqrt{112.81} = 10.6\)

\(df = {n_1}+{n_2}-2 = 6+4-2 = 8\). \(CL = 95 \%\), \(t_{\alpha = 0.025} = 2.306\)

\((\bar x_1 - \bar x_2) \pm t S_p\sqrt{\frac{1}{n_1}+ \frac{1}{n_2}} = (117.5 - 126.8) \pm 2.306 (10.6) S_p\sqrt{\frac{1}{6}+ \frac{1}{4}} = -9.3 \pm 2.306(6.84)\)

\(CI = (-25.7, 6.47)\). Since the C.I. of difference include 0 (means no difference), which means the difference is not significant statistically

n1 <- 6
n2 <- 4
df <- n1 + n2 - 2
mu1 <- 117.5
s1 <- 9.7
mu2 <- 126.8
s2 <- 12.0
mu <- mean(c(mu1, mu2))
     
sp <- sqrt((((n1 - 1)*s1^2) + ((n2 - 1)*s2^2)) / (n1 + n2 - 2))
     
# ci calculation
lower <- (mu1 - mu2) - (qt(0.975, df) * sp * sqrt((1/n1) + (1/n2)))
upper <- (mu1 - mu2) + (qt(0.975, df) * sp * sqrt((1/n1) + (1/n2)))
lower; upper

## [1] -25.10961

## [1] 6.509606

range <- seq(-30 + mu, 30 + mu, 0.1)
plot(range, dnorm(range, mu1, s1), type = 'l', xlab = 'x', 
     ylab = 'probability density', col = "blue",
     main = 'Interval estimation for systoling blood pressure between man and woman', cex = 1)
lines(range, dnorm(range, mu2, s2), col = "red", type = 'l')
abline(v = mu1, col = "black")
abline(v = mu2, col = "black")
abline(v = mu1 + lower, col = "blue", lty = 2)
abline(v = mu1 + upper, col = "blue", lty = 2)
abline(v = mu2 + lower, col = "red", lty = 2)
abline(v = mu2 + upper, col = "red", lty = 2)
legend("topright", legend = c("sys.bp.man", "sysbp.woman", "mean", "CI for man", "CI for woman"), 
lty = c(1, 1, 1, 2, 2), col = c("blue", "red", "black", "blue", "red"))

# for BMI (from Quiz)
n1 <- 6
n2 <- 4
df <- n1 + n2 - 2
mu1 <- 28.0
s1 <- 3.6
mu2 <- 26.2
s2 <- 2.0
mu <- mean(c(mu1, mu2))
     
sp <- sqrt((((n1 - 1)*s1^2) + ((n2 - 1)*s2^2)) / (n1 + n2 - 2))
     
# ci calculation
lower <- (mu1 - mu2) - (qt(0.975, df) * sp * sqrt((1/n1) + (1/n2)))
upper <- (mu1 - mu2) + (qt(0.975, df) * sp * sqrt((1/n1) + (1/n2)))
lower; upper

## [1] -2.812008

## [1] 6.412008

range <- seq(-30 + mu, 30 + mu, 0.1)
plot(range, dnorm(range, mu1, s1), type = 'l', xlab = 'x', 
     ylab = 'probability density', col = "blue",
     main = 'Interval estimation for BMI between man and woman', cex = 1, ylim = c(0, 0.2))
lines(range, dnorm(range, mu2, s2), col = "red", type = 'l')
abline(v = mu1, col = "black")
abline(v = mu2, col = "black")
abline(v = mu1 + lower, col = "blue", lty = 2)
abline(v = mu1 + upper, col = "blue", lty = 2)
abline(v = mu2 + lower, col = "red", lty = 2)
abline(v = mu2 + upper, col = "red", lty = 2)
legend("topright", legend = c("BMI.man", "BMI.woman", "mean", "CI for man", "CI for woman"), 
lty = c(1, 1, 1, 2, 2), col = c("blue", "red", "black", "blue", "red"))

Interpretation

We are 95% confident that the difference in mean systolic blood pressures between men and women is between -25.07 and 6.47 units. In this sample, the men have lower mean systolic blood pressures than women by 9.3 units. Based on this interval, we also conclude that there is no statistically significant difference in mean systolic blood pressures between men and women, because the 95% confidence interval includes the null value, zero.

Again, the confidence interval is a range of likely values for the difference in means. Since the interval contains zero (no difference), we do not have sufficient evidence to conclude that there is a difference. (Same as the BMI case)

Example of sample size consideration (s32)

One-year survival rate (%) test in 15 cancer patients for the effect of a cancer drug showed its sample mean 59.81 and a sample standard error 4.94.
Taking two-sided \(t_{0.005} = 2.9769\) with d.f. of 14, obtain a confidence interval for the mean survival rate (%) with a confidence level of 99%.
If we require a 99% confidence interval with a length no larger than 5%, how many patients do we need to test more?

1. C.I. of mean survival rate with 99 % confidence level

\(\bar x - \frac{t_{\alpha/2, n-1}s}{\sqrt{n}}, \bar x+\frac{t_{\alpha/2, n-1}s}{\sqrt{n}}\)

\(=(59.81 - \frac{2.9769 \times 4.94}{\sqrt{15}}, 59.81 + \frac{2.9769 \times 4.94}{\sqrt{15}}) = (56.01, 63.61)\)

Sample number that can reduce that C.I. no larget than 5% with the same confidence level
- \(n \geq 4 \times (\frac{t_{\alpha/2, n-1}s}{L_0})^2 = 4 \times (\frac{2.9769 \times 4.94}{5})^2 = 34.6\)
The patients to test more
- at least 20

xbar <- 59.81
n <- 15
se <- 4.94
a <- 0.01
df <- n - 1
L0 <- 5
t.crit <- qt(a/2, df = 14, lower.tail = FALSE)

# ci
lower <- xbar - (t.crit * se / sqrt(n))
upper <- xbar + (t.crit * se / sqrt(n))
CI <- upper - lower; CI

## [1] 7.593941

# n
desire <- 4 * (qt(a/2, df = 14) * se / L0)^2
require <- desire - n; ceiling(require)

## [1] 20