Bayes Estimation (s34 ~ s35, For advanced learner only)

binomial, beta Distribution (Bayesian Learning)

Let’s comeback to the probability class we studied at week 3 with slight different way. Suppose \(X_i \sim Ber(\theta)\) i.e. \(P(X_i = 1) = \theta (``heads") \ \ \ P(X_i = 0) = 1-\theta(``tails")\) and \(\theta \in [0,1]\) is the parameter to be estimated. Given a series of N observed coin tosses, the probability that we observe k heads is given by the Binomial distribution: \[Bin(k | N, \theta) = {N \choose k} \theta^k (1-\theta)^{n-k}\] Recall that binomial distribution looks like below. Here the x axis represent the number of trials and y axis represent probability density of the number of successes.

Thus, the likelihood has the form \[P(D|\theta) \propto \theta^{N_1}(1-\theta)^{N_0}\] where \(N_0\) and \(N_1\) are the number of tails and heads seen respectively. These are called sufficient statistics since this is all we need to know about the data to estimate \(\theta\). Formally, \(s(D)\) are a set of sufficient statistics for D if \[P(\theta|D) = P(\theta|s(D))\] we can compute \(P(\theta|D)\) using Bayes rule. \[P(\theta|D)= \frac{P(D|\theta)P(\theta)}{\int_0^1P(D|\theta)P(\theta)d\theta}\] The calculation of this value is bit daunting due to the integral in the denominator. We can avoid it via a clever trick of choosing suitable prior. In other words, we need to have a prior with support [0,1] and of the same form as likelihood. For binomial distribution, Beta distribution is the right choice. Beta distribution is defined as \[Beta(\theta|a,b) = \frac{1}{B(a,b)}\theta^{a-1} (1-\theta)^{b-1}\] with hyperparameters a, b, and where \[B(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\] is a normalizing factor. The mean of beta distribution is \(\frac{a}{a+b}\) and mode is \(\frac{a-1}{a+b-2}\). The variance of beta distribution is \(\frac{ab}{(a+b)^2(a+b+1)}\)

Q: If we believe that the mean is 0.7 and standard deviation is 0.2, we can set a = 2.975, and b = 1.275 (practice yourself!)

Recall that beta distribution looks like below. Here the x axis represent the average of two events and y axis represent probability density of the average.

(plz see this post if you are not clear about beta distribution)

Multiplying prior and likelihood gives the posterior: \[P(\theta | D) \propto Bin(N_1 | \theta, N_0+N_1)Beta(\theta|a,b) = Beta(\theta|N_1+a, N_0+b)\] where \(N_0\) stands for the number of tails and \(N_1\) stands for the number of heads.

The posterior has the same distribution as the prior (with different parameters): the Beta distribution is said to be a conjugate prior for the Binomial distribution. The posterior is obtained by simply adding the prior parameters to the empirical counts, hence the hyperparameters are often called pseudo-counts.

The Maximum A Posterior estimate is given by \[\hat{\theta}_{MAP} = \frac{a+N_1-1}{a+b+N-2}\] where N is the total number of trials.

with uniform prior \(a=b=1\), \[\hat{\theta}_{MLE} = \frac{N_1}{N}\] which is just the MLE.

The posterior mean is \[\bar{\theta} = \frac{a+N_1}{a+b+N}\] which is a convex combination of the prior mean and the MLE \[\bar{\theta} = \lambda\frac{a}{a+b} + (1-\lambda)\hat{\theta}_{MLE}\] with \(\lambda := \frac{a+b}{a+b+N}\). Note that as \(N \rightarrow \inf\), \(\bar{\theta} \rightarrow \hat{\theta}_{MLE}\)

# parameters
a <- 2 # number of tails (prior)
b <- 5 # number of heads (prior)
N0 <- 14 # number of tail
N1 <- 12 # number of head
N <- N0 + N1 # number of trials

theta_hat.MAP <- (N1 + b - 1)/(a + b + N - 2)
theta_hat.MLE <- N1/N

x <- seq(0,1,0.01)
plot(x, dbeta(x, 5, 2), type = "l", col = "red", lwd = 2, xlab = "", ylab = "", ylim = c(0,4.5))
lines(x, dbeta(x, 12, 14), lty = 2, col = "black", lwd = 2)
lines(x, dbeta(x, 16, 15), lty = 4, col = "blue", lwd = 2)
abline(v = theta_hat.MLE, col = "black", lwd = 2)
abline(v = theta_hat.MAP, col = "blue", lwd = 2)
legend("topright", legend = c("prior Be(5,2)", "likelihood Be(12,14)", 
                              "Post Be(17, 16)"), 
       col = c("red", "black", "blue"), lty = 1:3, lwd = rep(2,3), cex = 0.8)

The probability of a heads in a single new coin toss is: \[P(\hat x = 1|D) = \int_0^1 P(x=1 | \theta) P(\theta | D) d\theta\] \[= \int_0^1\theta Beta(\theta | N_1 +a, N_0 +b) d\theta\] \[=E_{Beta}[\theta]\] \[\frac {N_1 + a}{N_1+N_0+a+b}\]

The probability of predicting x heads in M future trials :

\[P(x|D) = \int_0^1 Bin(x|\theta, M)P(\theta|D) d\theta\] \[= \int_0^1 Bin(x|\theta, M)Beta(\theta|N_1+a, N_0 +b) d\theta\] \[= {M \choose x} \frac{1}{B(N_1 +a, N_0 +b)} \int_0^1 \theta ^{x+N_1+a-1} (1-\theta)^{M-x+N_0+b-1} d\theta\] \[= {M \choose x} \frac {B(x+N_1+a, (M-x)+N_0+b)}{B(N_1+a, N_0+b)}\] the compound Beta-Binomial distribution, with mean and variance \(E[x] = M \frac {N_1+a}{N+a+b}\), \(var[x]=M\frac {(N_1+a)(N_0+b)}{(N_1+a+N_0+b)^2} \frac{N+a+b+M}{N+1}\)

# M = 100, future trials, how many heads?
M <- 100
ex <- M * (N1 + a)/(N + a + b); ex 
## [1] 42.42424
var <- M * (((N1 + a) * (N0 + b)) / (N1 + a + N0 + b)^2) * ((N + a + b + M)/(N + 1))
sd <- sqrt(var)

x <- seq(0, 100, 1)
plot(x, dbinom(x, size = 100, prob = (N1 + a)/(N + a + b)), type = 'l', 
     ylab = 'f(x)', main = "How many heads?")
abline(v = ex)
abline(v = ex - sd, col = 'red')
abline(v = ex + sd, col = 'red')

Dirichlet multinomial Distribution (Bayesian Learning)

Tossing a Dice. Suppose we observe \(N\) dice roll \(D = \{x_1, x_2, ..., x_N\}\) where each \(x_i \in \{1, ..., K\}\) Probability \(P(D|\theta)\) is given as \[P(D|\theta) = \prod^K_{k=1} \theta_k^{N_k}\]

where \(\theta_k\) is the (unknown) probability of showing face k and \(N_k\) is the observed number outcomes showing face k

The probability of observing face k appear \(x_k\) times in n rolls of a dice with face probabilities \(\theta := (\theta_k, k \in \{1, ..., K\}\) is given as multinomial distribution. \[Mu(x|n, \theta) = {n \choose {x_1, x_2, ..., x_k}} \prod_{k=1}^K \theta_k^{x_k}\] Since the parameter vector \(\theta\) lives in the K-dimensional simplex \[S_K := \{(x_1, ..., x_K) | x_1 + ... + x_K = 1\}\], we need a prior that has support on this simplex and the prior is ideally also conjugate to the likelihood distribution. i.e. multinomial. That is the Dirichlet distribution.

The Dirichlet distribution is given as \[Dir(x|\alpha) := \frac{1}{B(\alpha)} \prod_{k=1}^K x_k^{\alpha_k-1} 1[x \in S_K]\] where \(B(\alpha)\) is the normalization factor \[B(\alpha) := \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma(\alpha_0)}\] with \(\alpha_0 := \sum_{k=1}^K \alpha_0\)

The Dirichlet distribution’s parameter \(\alpha\) controls the strength and location of the peak. In the three-dimensional case, Dir (1,1,1) represents a uniform distribution, and as the parameter increases, the distribution has a stronger “spike” at the center. When \(\alpha_k<1\), we get a more “spike” towards the coner of others.

# Drawing some of dirichlet distributions
library(Compositional)
library(MCMCpack)
## Loading required package: coda
## Loading required package: MASS
## ##
## ## Markov Chain Monte Carlo Package (MCMCpack)
## ## Copyright (C) 2003-2018 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
## ##
## ## Support provided by the U.S. National Science Foundation
## ## (Grants SES-0350646 and SES-0350613)
## ##
# uniform alpha, a = 1
draws <- rdirichlet(200, c(1,1,1))
bivt.contour(draws)

# uniform alpha, a = 2
draws <- rdirichlet(200, c(2,2,2))
bivt.contour(draws)

# uniform alpha, a = 20
draws <- rdirichlet(200, c(20,20,20))
bivt.contour(draws)

# uniform alpha, a1 < 1
draws <- rdirichlet(200, c(0.5,2,2))
bivt.contour(draws)

# Different way of plotting
library("MCMCpack")
DiriPlotData <- function(gridsize = 100 , alpha ) {
# needs grid interval of at least 1/ 100 , otherwise there will be nasty
# outliers on the Z side that corrupt the plot
  
gridaxis <- seq(0 , 1, length = gridsize)
dens <- matrix(nr = gridsize , nc = gridsize)
for (x in 1:gridsize) 
  {
  for (y in 1:gridsize)
    {
    dens[x ,y] <- NA
    if ((gridaxis[x] + gridaxis[y]) < 1)
      dens[x,y] <- ddirichlet(c(gridaxis[x] , gridaxis[y],(1 - gridaxis[x] - gridaxis[y])), alpha)
    }
  }
  list(gridline = gridaxis, values = dens)
}


DiriPersp <- function(alph = c(2 ,3 ,2), ...) {
  # oldpar <- par ( bg = " white ")
  diridata <- DiriPlotData(gridsize = 100, alpha = alph)
  persp(diridata$gridline, diridata$gridline, 
        diridata$values , zlim = c(-3,3), theta = 310, 
        expand = 0.5 , col = "lightblue" , ticktype = "detailed", 
        xlab = "X" ,ylab = "Y" ,zlab = "density" , ...)
  title(main = bquote(list(alpha[1] == .(alph[1]) , alpha[2] == .(alph[2]) ,alpha[3] == .(alph[3]))))
}

DiriPersp(c(4 ,4 ,2), phi = 50)
## Warning in persp.default(diridata$gridline, diridata$gridline,
## diridata$values, : surface extends beyond the box

DiriPersp(c(2 ,4 ,4), phi = 50)
## Warning in persp.default(diridata$gridline, diridata$gridline,
## diridata$values, : surface extends beyond the box

DiriPersp(c(2 ,4 ,2), phi = 40)
## Warning in persp.default(diridata$gridline, diridata$gridline,
## diridata$values, : surface extends beyond the box

DiriPersp(c(2 ,2 ,2), phi = 40)
## Warning in persp.default(diridata$gridline, diridata$gridline,
## diridata$values, : surface extends beyond the box

DiriPersp(c(1 ,1 ,1), phi = 30)

DiriPersp(c(0.5 ,0.5 ,0.5), phi = 30)
## Warning in persp.default(diridata$gridline, diridata$gridline,
## diridata$values, : surface extends beyond the box

Prior: Dirichlet Prior \[Dir(\theta | \alpha) = \frac{1}{B(\alpha)} \prod_{i=1}^K \theta_k^{\alpha_k-1}\]

Posterior \[P(\theta|D) \propto P(D|\theta) P(\theta)\] \[\propto \prod_{k=1}^K \theta_k^{N_k} \theta_k^{\alpha_k -1}\] \[= \prod_{k=1}^K \theta_k^{\alpha_k + N_k -1}\] \[= Dir(\theta| \alpha_1 + N_1, ..., \alpha_k + N_k)\] \[\max_{\theta} Dir(\theta| \alpha_1 + N_1, ..., \alpha_k + N_k) = \prod_{k=1}^K \theta_k^{\alpha_k + N_k -1}\] \[subject \ to \ \ \ \theta_1 + \theta_2 + ... + \theta_K = 1\] Use Lagrange multipliers, \[\hat \theta_k = \frac{\alpha_k + N_k -1}{\alpha_0 + N - K}\] with uniform prior this becomes \(\hat\theta_k = \frac {N_k}{N}\)

\(P(X = k|D) = \int P(X = k | \theta) P(\theta | D) d\theta\) \(= \int P(X = k | \theta) [P(\theta_{\_k}, \theta_{k} | D) d\theta_{\_k}] d\theta_k\) \(= \int \theta_k P(\theta_{k}| D)d\theta_k\) \(E[\theta_k|D]\) \(\frac {\alpha_k + N_k}{\alpha + N}\)

Let’s have one example about this topic Suppose observe the following sentences: Mary had a little lamb, little lamb, little lamb Mary had a little lamb, it’s fleece as white as snow Can we predict which word comes next?

  • Vocaburary: mary, lamb, little, big, fleece, white, black, snow, rain, unk (unknown)
  • Representation: 1 to 10
  • Stop words: a, as, the (remove them)
  • Stemming: i. e. reduce words to base form by removing plurals etc. For example, “running” becomes “run”.
vocaburary <- c("mary", "lamb", "little", "big", "fleece", "white", "black", "snow", "rain", "unk")
sequence <- c(1, 10, 3, 2, 3, 2, 3, 2, 3, 2, 1, 10, 3, 2, 10, 5, 10, 6, 8)
occurence <- table(factor(sequence, level = 0:10))

# using Dir(1,1,...,1) as prior
alpha <- rep(1,10)
alpha_k <- alpha[length(alpha)]
alpha_0 = sum(alpha)
N = length(sequence)

E_theta <- (alpha_k + occurence)/(alpha_0 + N)
plot(E_theta)

# return most probable words
vocaburary[E_theta == max(E_theta)]
## [1] "little" "big"
# using Dir(20, ... 20)
alpha <- rep(20,10)
alpha_k <- alpha[length(alpha)]
alpha_0 = sum(alpha)
N = length(sequence)

E_theta <- (alpha_k + occurence)/(alpha_0 + N)
plot(E_theta)

# return most probable words
vocaburary[E_theta == max(E_theta)]
## [1] "little" "big"

Dirichlet example 2

This example concerns the efficacy of a new antibiotic in patients who are hospitalised in the Pediatric Intensive Care Unit (PICU) and who are severely infected by pneumococci (which is associated with pneumonia, meningitis, and septicaemia, among other conditions). The possible results after the infection are:

  • to survive in good condition
  • to have a sequel
  • or to die

An expert is asked to provide judgements about the proportions of patients who will have each of these possible results. Denoting these proportions by \(\pi_1, \pi_2, \pi_3\) these form a set of proportions that must sum to 1.

The Dirichlet distribution is parameterised by \[f(\pi_1, \pi_2, \pi_3) \propto \prod_{i=1}^{3} \pi_{i}^{d_i-1}\]

with \[n=\sum_{n=1}^{3} d_i\]

The elicited judgements for the three marginal proportions were

Good outcome Sequel Dead
Lower quartile 0.50 0.22 0.11
Median 0.55 0.30 0.15
Upper quartile 0.60 0.35 0.20

For each marginal proportion \(\pi_i\), the expert has provided a lower quartile, a median and an upper quartile, so we define a single vector of probabilities, specifying which quantiles have been elicited.

(Here you can see how to tailor the individual beta distribution to the Dirichlet distribution)

p1 <- c(0.25, 0.5, 0.75)
v.good <- dt[,1]
v.seql <- dt[,2]
v.dead <- dt[,3]

# fit beta distribution
library(SHELF)
fit.good <- fitdist(vals = v.good, probs = p1, lower = 0, upper = 1)
fit.seql <- fitdist(vals = v.seql, probs = p1, lower = 0, upper = 1)
fit.dead <- fitdist(vals = v.dead, probs = p1, lower = 0, upper = 1)

plotfit(fit.good, ql = 0.1, qu = 0.9, d = "beta")

plotfit(fit.seql, ql = 0.1, qu = 0.9, d = "beta")

plotfit(fit.dead, ql = 0.1, qu = 0.9, d = "beta")

feedback(fit.good, quantiles = c(0.1, 0.9))
## $fitted.quantiles
##     Normal Student-t Gamma Log normal Log Student-t  Beta
## 0.1  0.455     0.443 0.459      0.461         0.451 0.455
## 0.9  0.645     0.657 0.650      0.653         0.667 0.643
## 
## $fitted.probabilities
##      elicited Normal Student-t Gamma Log normal Log Student-t  Beta
## 0.5      0.25   0.25      0.25 0.247      0.246         0.246 0.251
## 0.55     0.50   0.50      0.50 0.505      0.507         0.506 0.499
## 0.6      0.75   0.75      0.75 0.747      0.746         0.745 0.751
feedback(fit.seql, quantiles = c(0.1, 0.9))
## $fitted.quantiles
##     Normal Student-t Gamma Log normal Log Student-t  Beta
## 0.1  0.166     0.151 0.178      0.182         0.173 0.175
## 0.9  0.417     0.434 0.438      0.453         0.481 0.427
## 
## $fitted.probabilities
##      elicited Normal Student-t Gamma Log normal Log Student-t  Beta
## 0.22     0.25  0.233     0.231 0.230      0.228         0.226 0.231
## 0.3      0.50  0.535     0.532 0.545      0.550         0.547 0.541
## 0.35     0.75  0.725     0.724 0.716      0.711         0.711 0.719
feedback(fit.dead, quantiles = c(0.1, 0.9))
## $fitted.quantiles
##     Normal Student-t  Gamma Log normal Log Student-t  Beta
## 0.1  0.067    0.0558 0.0804     0.0845        0.0786 0.079
## 0.9  0.239    0.2490 0.2520     0.2630        0.2830 0.249
## 
## $fitted.probabilities
##      elicited Normal Student-t Gamma Log normal Log Student-t  Beta
## 0.11     0.25  0.261     0.262 0.251      0.247         0.246 0.253
## 0.15     0.50  0.483     0.485 0.498      0.506         0.505 0.496
## 0.2      0.75  0.760     0.760 0.751      0.746         0.746 0.753
# fit dirichlet distribution
d.fit.opt <- fitDirichlet(fit.good, fit.seql, fit.dead,
categories = c("Good outcome","Sequel","Dead"), n.fitted = "opt")

## 
## Directly elicted beta marginal distributions:
##  
##        Good outcome  Sequel    Dead
## shape1      24.9000  6.2000  4.6000
## shape2      20.4000 14.7000 24.4000
## mean         0.5490  0.2960  0.1590
## sd           0.0731  0.0975  0.0667
## sum         45.3000 20.9000 29.0000
## 
## Sum of elicited marginal means: 1.004
##  
## Beta marginal distributions from Dirichlet fit:
##  
##        Good outcome  Sequel    Dead
## shape1      16.6000  8.9600  4.8000
## shape2      13.8000 21.4000 25.6000
## mean         0.5470  0.2950  0.1580
## sd           0.0889  0.0814  0.0651
## sum         30.4000 30.4000 30.4000
feedbackDirichlet(d.fit.opt, quantiles = c(0.1, 0.5, 0.9))
##     Good.outcome Sequel Dead
## 0.1         0.43   0.19 0.08
## 0.5         0.55   0.29 0.15
## 0.9         0.66   0.40 0.25
d.fit.min <- fitDirichlet(fit.good, fit.seql, fit.dead,
categories = c("Good outcome","Sequel","Dead"), n.fitted = "min")

## 
## Directly elicted beta marginal distributions:
##  
##        Good outcome  Sequel    Dead
## shape1      24.9000  6.2000  4.6000
## shape2      20.4000 14.7000 24.4000
## mean         0.5490  0.2960  0.1590
## sd           0.0731  0.0975  0.0667
## sum         45.3000 20.9000 29.0000
## 
## Sum of elicited marginal means: 1.004
##  
## Beta marginal distributions from Dirichlet fit:
##  
##        Good outcome  Sequel    Dead
## shape1       11.500  6.1800  3.3100
## shape2        9.480 14.8000 17.6000
## mean          0.547  0.2950  0.1580
## sd            0.106  0.0974  0.0779
## sum          20.900 20.9000 20.9000
feedbackDirichlet(d.fit.min, quantiles = c(0.1, 0.5, 0.9))
##     Good.outcome Sequel  Dead
## 0.1         0.41   0.17 0.067
## 0.5         0.55   0.29 0.150
## 0.9         0.68   0.43 0.260

for about the graph above, see this post. In addition, check this page for more graphics.

Cauchy Distribution

# Cauchy distribution
x <- seq(-5,5,0.01)
plot(x, dcauchy(x, 0, 0.5), xlab = 'x', ylab = 'probability density', ylim = c(0, 0.7),
     main = "Cauchy distribution", type = 'l', yaxs = "i", col = "blue")
lines(x, dcauchy(x, 0, 1), col = "red", lty = 2); 
lines(x, dcauchy(x, 0, 2), col = "green", lty = 3); 
lines(x, dcauchy(x, -2, 1), col = "purple", lty = 4)
legend(2.0, 0.65, legend = c("x0=0, gamma=0.5", "x0=0, gamma=1", 
                             "x0=0, gamma=2", "x0=-2, gamma=1"), 
       col = c("blue", "red", "green", "purple"), lty = 1:5, cex = 0.8)