Suppose a sample with n=20 whose underlying distribution is assumed normal. What is the probability that the value of \(\hat \mu = \bar X\) lies within \(\frac{\sigma}{4}\) of the true mean \(\mu\)? How does it change when the sample size becomes n=40? If the sample variance \(S^2\) is 16, what are the probabilities that the value of \(\hat \mu = \bar X\) lies within \(\mu \pm 1\) for those two sample sizes?
\(P(\mu - \frac{\sigma}{4} \leq \bar X \leq \mu + \frac{\sigma}{4}) \sim N(\mu, \frac{\sigma^2}{n})\)
\(Z(\bar X) = \frac{\bar X-\mu}{\sigma/\sqrt {n}}\)
\(n=20, P(\mu - \frac{\sigma}{4} \leq \bar X \leq \mu + \frac{\sigma}{4}) = P(-\frac{\sqrt{20}}{4} \leq Z(\bar X) \leq +\frac{\sqrt{20}}{4}) = \phi(1.118) - \phi(-1.118) = 0.7364\)
from the table, we get \(\phi(1.12) - \phi(-1.12) = 0.8686 - 0.1314 = 0.7372\), which is the value shown in the lecture slide.
\(n=40, P(\mu - \frac{\sigma}{4} \leq \bar X \leq \mu + \frac{\sigma}{4}) = P(-\frac{\sqrt{40}}{4} \leq Z(\bar X) \leq +\frac{\sqrt{40}}{4}) = \phi(1.581) - \phi(-1.581) = 0.8861\)
again, from the table, we get \(\phi(1.58) - \phi(-1.58) = 0.9429 - 0.0571 = 0.8858\)
The probability that \(\bar{X}\) lies within \(\mu +1\),
z-statistic \(P(\mu -1 \leq \bar X \leq \mu +1) = P(-\frac {\sqrt{20}}{\sigma} \leq Z(\bar X) \leq +\frac {\sqrt{20}}{\sigma})\),
we cannot solve without knowing \(\sigma\)
Since we don’t know the absolute value of both \(\mu\) and \(\sigma\) from the sample data set. If the question is for one of these variable, we need to know the value of the others. To solve this we need to approximate the value of \(\frac{\sigma}{\sqrt{n}}\) with \(\frac{s}{\sqrt{n}}\)
with \(s = 4, n = 20\), t-statistic \(t(\bar X) = \frac {\bar X - \mu}{s/\sqrt{20}}\)
\(P(\mu -1 \leq \bar X \leq \mu +1) = P(-\frac {\sqrt{20}}{4} \leq t(\bar X) \leq +\frac {\sqrt{20}}{4}) = P(-1.118 \leq t(\bar X) \leq +1.118) = 0.7225\)
with \(s = 4, n = 40\), t-statistic \(t(\bar X) = \frac {\bar X - \mu}{s/\sqrt{20}}\)
\(P(\mu -1 \leq \bar X \leq \mu +1) = P(-\frac {\sqrt{40}}{4} \leq t(\bar X) \leq +\frac {\sqrt{40}}{4}) = P(-1.581 \leq t(\bar X) \leq +1.581) = 0.8780\)
# n=20
n <- 20
z <- sqrt(n)/4; z
## [1] 1.118034
z <- round(z, 2); z
## [1] 1.12
pnorm(1.118) - pnorm(-1.118)
## [1] 0.736433
pnorm(z) - pnorm(-z)
## [1] 0.7372862
# n=40
n <- 40
z <- sqrt(n)/4; z
## [1] 1.581139
z <- round(z, 2); z
## [1] 1.58
pnorm(1.581) - pnorm(-1.581)
## [1] 0.886122
pnorm(z) - pnorm(-z)
## [1] 0.8858931
# x_bar lies within mu - 1 <= x <= mu + 1, n=20, s=4
n <- 20
s <- 4
t <- sqrt(n)/s; t
## [1] 1.118034
t <- round(t, 2); t
## [1] 1.12
pt(1.118, df = 19) - pt(-1.118, df = 19)
## [1] 0.7224957
pt(t, df = n - 1) - pt(-t, df = n - 1)
## [1] 0.7233276
# x_bar lies within mu - 1 <= x <= mu + 1, n=40, s=4
n <- 40
s <- 4
t <- sqrt(n)/s; t
## [1] 1.581139
t <- round(t, 2); t
## [1] 1.58
pt(1.581, df = 39) - pt(-1.581, df = 39)
## [1] 0.8780447
pt(t, df = n - 1) - pt(-t, df = n - 1)
## [1] 0.8778157
Suppose we have a small pond in our backyard, and in the pond there live some fish. We would like to know how many fish live in the pond. How can we estimate this? One procedure developed by researchers is the capture-recapture method. Here is how it works.
Suppose we assume that the total number of fish in a fond follows hypergeometric distribution. Since we tagged 7 fishes and when we revisit, we found 3 tagged and 1 untagged fishes from n2=4 samples, the minimum of total number of fishes is 8. Therefore, the range of possible total number of fishes is from 8 to infinity. Let’s plot the likelihood of total number of fishes in a pond.
# Parameter setting
N <- 8:50 # total number of fishes
m <- 7 # tagged fishes in 1st trial (total number of tagged fishes)
n <- 4 # sampling number
x <- 3 # tagged fishes in 2nd trial
# calculate probabilities
n.fish <- dhyper(x, m, N - m, n)
plot(N, n.fish, type = 'h', col = 3, ylim = c(0,0.6), xlab = 'the number of fish in a pond',
ylab = 'Likelihood', main = "Estimation of Total Fishes in a pond ")
points(N, n.fish, col = 2)
abline(h = 0, col = 3)
# Maximum likelihood
identical(max(n.fish), dhyper(x, m, 9 - m, n))
## [1] TRUE
print("The most probable total number of fishes in a pond is 9")
## [1] "The most probable total number of fishes in a pond is 9"
max(n.fish)
## [1] 0.5555556
# by formular
ceiling(((m * n) / x) - 1)
## [1] 9
In a small population of 450 people, 200 had genotype AA, 150 had genotype Aa, 100 had genotype aa. estimate genotype probability of whole population.
library(knitr)
geno <- c(200, 150, 100, 450)
names(geno) <- c("AA", "Aa", "aa", "Total")
kable(geno)
x | |
---|---|
AA | 200 |
Aa | 150 |
aa | 100 |
Total | 450 |
# Parameter setting
k1 <- 200 # AA
k2 <- 150 # Aa
k3 <- 100 # aa
N <- k1 + k2 + k3
# Allele frequency
k1/N; k2/N; k3/N
## [1] 0.4444444
## [1] 0.3333333
## [1] 0.2222222
# MLE
theta_hat <- (2*k1 + k2) / (2*(k1 + k2 + k3))
theta_hat
## [1] 0.6111111
AA <- theta_hat^2
Aa <- 2*theta_hat*(1 - theta_hat)
aa <- (1 - theta_hat)^2
# Estimate of genotype probability of whole population
AA; Aa; aa
## [1] 0.3734568
## [1] 0.4753086
## [1] 0.1512346
# Error
err_AA <- k1/N - AA
err_Aa <- k2/N - Aa
err_aa <- k3/N - aa
err <- c(err_AA, err_Aa, err_aa)
names(err) <- c("AA", "Aa", "aa")
abs(err)
## AA Aa aa
## 0.07098765 0.14197531 0.07098765
Find exact C.I. of population mean of exponential distribution \(Exp(\lambda)\) by using pivotal method when we have random samples whose \(X_i\) assumed i.i.d from exponential distribution \(Exp(\lambda)\) and the obtained sample mean is 10. (\(\nu =2\))
\(Q=\lambda X\) can be a pivotal as its distribution becomes parameter independent \(Exp(1)\)
\(Q = \lambda X, X = \frac{Q}{\lambda}, \frac{dx}{dq} = \frac{1}{\lambda}, f_q(Q) = f_x(\frac{Q}{\lambda})\frac {1}{\lambda} = e^{-Q} \sim Exp(1)\)
for 90 % of confidence interval, \(P(c_1 \leq Q \leq c_2) = 0.9, or P(Q \leq c_1) = P(Q \geq c_2) = 0.05\)
\(P(Q \leq c_1) = \int_0^{c_1} e^{-Q}dq = 1-e^{-c_1} = 0.05\)
\(P(Q \geq c_2) = \int_{c_2}^\infty e^{-Q}dq = e^{-c_2} = 0.05\)
\(c_1 = ln(0.95) = 2.996, \ c_2 = -ln(0.05) = 0.051\)
Therefore \(P(Q \leq c_1) = P(\frac{x}{\mu} \leq c_1) = P(\frac{x}{c_1} \leq \mu) = P(X/2.996 \leq \mu)\)
\(P(Q \geq c_2) = P(\frac{x}{\mu} \geq c_2) = P(\frac{x}{c_2} \geq \mu) = P(X/0.051 \geq \mu)\)
the sample mean is 10, the C.I of \(\mu = \frac{1}{\lambda}\) will be 3.38, 194.96
xbar <- 10
n <- 1
df <- 2 * n
a <- 0.1
crit.chisq <- qchisq(c(1 - a/2, a/2), df)
ci.mu <- (df*xbar)/crit.chisq
# using gamma
crit.gamma <- qgamma(c(1 - a/2, a/2), n, 1/2)
ci.mu2 <- 2*n*xbar/crit.gamma
t <- seq(0,200,0.01)
plot(t, dexp(t, 1/xbar), type = 'l', ylab = 'f(x)', xlab = 't (sec)')
abline(v = xbar, lwd = 2)
abline(v = ci.mu[1], col = "red", lwd = 2, lty = 2) # lower
abline(v = ci.mu[2], col = "red", lwd = 2, lty = 2) # upper
abline(v = ci.mu2[1], col = "green") # lower
abline(v = ci.mu2[2], col = "green") # upper
legend("topright", legend = c("Mean", "CI w chisq", "CI w gamma"),
lty = c(1,2,1,1), col = c("black", "red", "green"))
Exponential Distribution If the failure times \(t_1, ..., t_n\) are considered to be observations from an exponential distribution, then the parameter \(\lambda\) can be estimated as
\[\hat\lambda = \frac{n}{t_1 + ... + t_n} = \frac {1}{\bar t}\] mean failure time \(\mu = 1/\lambda\) is then estimated as \[\hat \mu = \frac{1}{\hat \lambda} = \bar t\] which is simply the sample average. A \(1-\alpha\) confidence level confidence interval for the mean failure time can be constructed as \[\mu \in (\frac{2n\bar t}{\chi^2_{2n, \alpha/2}}, \frac{2n\bar t}{\chi^2_{2n, 1-\alpha/2}})\] where \(\chi^2_{2n, \alpha/2}\) is the upper \(\alpha/2\) critical point and \(\chi^2_{2n, 1-\alpha/2}\) is the lower \(\alpha/2\) critical point of a chi-square distribution with 2n degrees of freedom. This confidence interval can be rewritten in terms of the parameter \(\lambda\) as \[\lambda = (\frac{\chi^2_{2n, 1-\alpha/2}}{2n\bar t}, \frac{\chi^2_{2n, \alpha/2}}{2n\bar t})\]
s <- c(0.6, 8.4, 12.8, 4.3, 1.3, 8.3, 18.1, 20.6,
21.0, 17.8, 3.3, 6.2, 9.7, 35.2, 13.9, 2.6, 5.2, 5.6, 13.7, 3.8) # time in seconds
df <- 2 * length(s)
a <- 0.1
crit.chisq <- qchisq(c(1 - a/2, a/2), df)
ci.mu <- (df*mean(s))/crit.chisq
t <- seq(0,100,0.01)
plot(t, exp(-1/mean(s)*t), type = 'l', ylab = 'probability of failure', xlab = 't (sec)')
abline(v = mean(s))
abline(v = mean(s) + sd(s), col = "blue")
abline(v = mean(s) - sd(s), col = "blue")
abline(v = ci.mu[1], col = "red") # lower
abline(v = ci.mu[2], col = "red") # upper
legend("topright", legend = c("Mean", "CI", "SD"),
lty = rep(1,3), col = c("black", "red", "blue"))
Clinical test for men and women showed that there was difference in systolic blood pressures.
Men | Women | |
---|---|---|
Sample number | 6.0 | 4.0 |
Sample mean | 117.5 | 126.8 |
Sample S.E | 9.7 | 12.0 |
Assuming that they have common population variance obtain the C.I. of the difference in mean of systolic blood pressure between men & women?
\(S_p = \sqrt{\frac{(6-1)9.7^2+(4-1)12.0^2}{6+4-2}} = \sqrt{112.81} = 10.6\)
\(df = {n_1}+{n_2}-2 = 6+4-2 = 8\). \(CL = 95 \%\), \(t_{\alpha = 0.025} = 2.306\)
\((\bar x_1 - \bar x_2) \pm t S_p\sqrt{\frac{1}{n_1}+ \frac{1}{n_2}} = (117.5 - 126.8) \pm 2.306 (10.6) S_p\sqrt{\frac{1}{6}+ \frac{1}{4}} = -9.3 \pm 2.306(6.84)\)
\(CI = (-25.7, 6.47)\). Since the C.I. of difference include 0 (means no difference), which means the difference is not significant statistically
n1 <- 6
n2 <- 4
df <- n1 + n2 - 2
mu1 <- 117.5
s1 <- 9.7
mu2 <- 126.8
s2 <- 12.0
mu <- mean(c(mu1, mu2))
sp <- sqrt((((n1 - 1)*s1^2) + ((n2 - 1)*s2^2)) / (n1 + n2 - 2))
# ci calculation
lower <- (mu1 - mu2) - (qt(0.975, df) * sp * sqrt((1/n1) + (1/n2)))
upper <- (mu1 - mu2) + (qt(0.975, df) * sp * sqrt((1/n1) + (1/n2)))
lower; upper
## [1] -25.10961
## [1] 6.509606
range <- seq(-30 + mu, 30 + mu, 0.1)
plot(range, dnorm(range, mu1, s1), type = 'l', xlab = 'x',
ylab = 'probability density', col = "blue",
main = 'Interval estimation for systoling blood pressure between man and woman', cex = 1)
lines(range, dnorm(range, mu2, s2), col = "red", type = 'l')
abline(v = mu1, col = "black")
abline(v = mu2, col = "black")
abline(v = mu1 + lower, col = "blue", lty = 2)
abline(v = mu1 + upper, col = "blue", lty = 2)
abline(v = mu2 + lower, col = "red", lty = 2)
abline(v = mu2 + upper, col = "red", lty = 2)
legend("topright", legend = c("sys.bp.man", "sysbp.woman", "mean", "CI for man", "CI for woman"),
lty = c(1, 1, 1, 2, 2), col = c("blue", "red", "black", "blue", "red"))
# for BMI (from Quiz)
n1 <- 6
n2 <- 4
df <- n1 + n2 - 2
mu1 <- 28.0
s1 <- 3.6
mu2 <- 26.2
s2 <- 2.0
mu <- mean(c(mu1, mu2))
sp <- sqrt((((n1 - 1)*s1^2) + ((n2 - 1)*s2^2)) / (n1 + n2 - 2))
# ci calculation
lower <- (mu1 - mu2) - (qt(0.975, df) * sp * sqrt((1/n1) + (1/n2)))
upper <- (mu1 - mu2) + (qt(0.975, df) * sp * sqrt((1/n1) + (1/n2)))
lower; upper
## [1] -2.812008
## [1] 6.412008
range <- seq(-30 + mu, 30 + mu, 0.1)
plot(range, dnorm(range, mu1, s1), type = 'l', xlab = 'x',
ylab = 'probability density', col = "blue",
main = 'Interval estimation for BMI between man and woman', cex = 1, ylim = c(0, 0.2))
lines(range, dnorm(range, mu2, s2), col = "red", type = 'l')
abline(v = mu1, col = "black")
abline(v = mu2, col = "black")
abline(v = mu1 + lower, col = "blue", lty = 2)
abline(v = mu1 + upper, col = "blue", lty = 2)
abline(v = mu2 + lower, col = "red", lty = 2)
abline(v = mu2 + upper, col = "red", lty = 2)
legend("topright", legend = c("BMI.man", "BMI.woman", "mean", "CI for man", "CI for woman"),
lty = c(1, 1, 1, 2, 2), col = c("blue", "red", "black", "blue", "red"))
Interpretation
We are 95% confident that the difference in mean systolic blood pressures between men and women is between -25.07 and 6.47 units. In this sample, the men have lower mean systolic blood pressures than women by 9.3 units. Based on this interval, we also conclude that there is no statistically significant difference in mean systolic blood pressures between men and women, because the 95% confidence interval includes the null value, zero.
Again, the confidence interval is a range of likely values for the difference in means. Since the interval contains zero (no difference), we do not have sufficient evidence to conclude that there is a difference. (Same as the BMI case)
\(\bar x - \frac{t_{\alpha/2, n-1}s}{\sqrt{n}}, \bar x+\frac{t_{\alpha/2, n-1}s}{\sqrt{n}}\)
\(=(59.81 - \frac{2.9769 \times 4.94}{\sqrt{15}}, 59.81 + \frac{2.9769 \times 4.94}{\sqrt{15}}) = (56.01, 63.61)\)
xbar <- 59.81
n <- 15
se <- 4.94
a <- 0.01
df <- n - 1
L0 <- 5
t.crit <- qt(a/2, df = 14, lower.tail = FALSE)
# ci
lower <- xbar - (t.crit * se / sqrt(n))
upper <- xbar + (t.crit * se / sqrt(n))
CI <- upper - lower; CI
## [1] 7.593941
# n
desire <- 4 * (qt(a/2, df = 14) * se / L0)^2
require <- desire - n; ceiling(require)
## [1] 20