Estimate of a population statistic from many samples
I added this part since some of you have difficulity in understanding the concept of C.I. We are covering some of basic examples about C.I. to understand concept clearly.
Suppose you want to know the “Mean TV hours per week” in each households in Korea. Since you can not collect the data for all households, you need to collect data for a small subset of the population. Now, if another student want to know the same thing, then the student is facing the same problem too. The student also need to collect a small subset of the population which are the households in Korea to estimate the mean TV hours per week.
Let’s explore this idea. Suppose all of the student in Bis335 want to know the mean of TV hours per week in Korea and they decide to collect TV hours of 50 household per student. (suppose TV hours per week follows beta distribution, Beta(20, 20)) Then we can get the sample mean of TV hours watched for each 50 household from 26 students. The distribution of this sample mean can be visualized by histogram. As you studied from lecture material, the distribution of this sample mean is approaching normal distribution when sample size grows to infinity. (Central Limit Theorem) Note that the histogram does not represent the distribution of mean hours of TV watched for the population, but the distribution of the sample averages.
From these sample means, we can compute the arithmetic mean of the sample means. This statistic is reffered to as the grand mean. We can assess the chance that this estimate(the grand mean) is not representative of the whole population mean by calculating standard error.
The Standard error (SE) is the standard deviation of the sampling distribution of a statitic. It is the likely size of chance error when estimating a statistic. For this case, standard error is computed by taking the standard deviation of the sample means.
It is important note that SE is not an estimate of the population standard deviation but a measure of uncertainty in the population mean estimate.
n <- 23 # number of samples to collect (number of student in our class)
size <- 50 # size of each sample (number of household investigated by a student)
s.mean <- vector(length = n)
g.mean <- vector(length = 3)
SE <- vector(length = 3)
sample <- function() rbeta(size, 20, 20)*60 # 60 hours total
for (i in 1:n) s.mean[i] <- mean(sample())
g.mean[1] <- mean(s.mean)
SE[1] <- sd(s.mean)
hist(s.mean, breaks = 10, xlab = " Mean TV hours per week")
n <- 1000
for (i in 1:n) s.mean[i] <- mean(sample())
g.mean[2] <- mean(s.mean)
SE[2] <- sd(s.mean)
hist(s.mean, breaks = 10, xlab = " Mean TV hours per week")
n <- 10000
for (i in 1:n) s.mean[i] <- mean(sample())
g.mean[3] <- mean(s.mean)
SE[3] <- sd(s.mean)
hist(s.mean, breaks = 10, xlab = " Mean TV hours per week")
g.mean; SE
## [1] 29.99338 29.98544 30.00699
## [1] 0.8302823 0.6845761 0.6542369
Estimate of a population statistic from a single sample
Suppose that your classmate does not agree with the survey above but you would like to estimate the survey result. In this case, you can do it by estimating grand mean and standard error from your sample statistic.
The standard error of the mean can be estmated by SEmean=SD√n where SE is the standard error of the mean from your sample, SD is the standard deviation of whole population (i.e. all households in Korea), and n is your sample size. The approximation applies only - to the uncertainty associated with the estimate of the population mean (not other statistics like the median, count, or percentage) - for a large sample sizes (usually 30 or greater)
Since we do not know the SD of population, we need to estimate SD using sample means’ standard deviation SDsample. Therefore, SEmean=SDsample√n
As I mentioned, standard error is not an estimate of the population standard deviation. It is a measure of how likely the interval, defined by SE, encompasses the true (population) mean. As you can see below, 1,438 sample means out of 10,000 sample means are not included 99% confidence interval range. This represent our confidence that the interval contains the population mean.
x <- sample()
mean.x <- mean(x); mean.x; g.mean
## [1] 30.78272
## [1] 29.99338 29.98544 30.00699
SE.x <- sd(x)/sqrt(length(x)); SE.x; SE
## [1] 0.6732375
## [1] 0.8302823 0.6845761 0.6542369
# 68 % C.I. n = 10000
sum(!(s.mean < mean.x + qnorm(0.84)*SE.x & s.mean > mean.x + qnorm(0.16)*SE.x))
## [1] 5763
# 95 % C.I. n = 10000
sum(!(s.mean < mean.x + qnorm(0.975)*SE.x & s.mean > mean.x + qnorm(0.025)*SE.x))
## [1] 2034
# 99 % C.I. n = 10000
sum(!(s.mean < mean.x + qnorm(0.995)*SE.x & s.mean > mean.x + qnorm(0.005)*SE.x))
## [1] 718
hist(s.mean, breaks = 10, xlab = " Mean TV hours per week")
plot(density(s.mean))
General solution to compute the standard error : the Bootstrap
What if we want to estimate the population parameter in general? The solution is called bootstrap.
median.boot <- numeric()
for (i in 1:10000) {
sample.boot <- base::sample(x, size = length(x), replace = TRUE)
median.boot[i] <- median(sample.boot)
}
SE.boot <- sd(median.boot)
SE.boot
## [1] 0.7942579
# using boot library
library(boot)
med <- function(x, i) median(x[i])
median.boot <- boot(x, R = 10000, statistic = med)
SE.boot <- sd(as.vector(median.boot$t))
SE.boot
## [1] 0.8073408
str(median.boot)
## List of 11
## $ t0 : num 30.5
## $ t : num [1:10000, 1] 28.9 30 30.5 30.1 30.2 ...
## $ R : num 10000
## $ data : num [1:50] 29.6 29 37.1 38.6 33.3 ...
## $ seed : int [1:626] 403 220 -1058924540 -889257474 -516029897 -95774534 -183859689 -358286234 -1879390551 100813375 ...
## $ statistic:function (x, i)
## ..- attr(*, "srcref")= 'srcref' int [1:8] 11 8 11 34 8 34 11 11
## .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x000000001a988988>
## $ sim : chr "ordinary"
## $ call : language boot(data = x, statistic = med, R = 10000)
## $ stype : chr "i"
## $ strata : num [1:50] 1 1 1 1 1 1 1 1 1 1 ...
## $ weights : num [1:50] 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ...
## - attr(*, "class")= chr "boot"
## - attr(*, "boot_type")= chr "boot"
So, what is confidence intervals?
The idea of a confidence interval, CI, is a natural extension of the standard error. It allows us to define a level of confidence in our population parameter estimate gleaned from a sample. For example, if we wanted to be 95 % confident that the range of mean TV hours per week computed from our sample encompasses the true mean value for all households in the population we would compute this interval by adding and subtracting 1.96 SE to/from the sample mean. (the value 1.96 come from quantile value of normal distribution)
mean.int.90 <- mean.x + qt(c(0.05, 0.95), length(x) - 1) * SE.x
mean.int.90
## [1] 29.65400 31.91144
mean.int.95 <- mean.x + qt(c(0.025, 0.975), length(x) - 1) * SE.x
mean.int.95
## [1] 29.42980 32.13564
median.int.90 <- mean(median.boot$t) + qt(c(0.05, 0.95), length(x) - 1) * SE.boot
median.int.90
## [1] 29.08046 31.78756
# bias corrected bootstrap
median.int.90.BCa <- boot.ci(median.boot, conf = .90, type = "bca")
median.int.90.BCa
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = median.boot, conf = 0.9, type = "bca")
##
## Intervals :
## Level BCa
## 90% (29.09, 31.71 )
## Calculations and Intervals on Original Scale