Normal & Extra Distribution

Contents

Selected Model

  • Normal distribution related: most of extra models below

Extra Models (for advanced students, self study)

  • Chi-squared, T, F, Extreme value, Log-normal, Cauchy, Weibull

Multivariate Model

  • Multinomial, Multivariate normal, Dirichlet distribution

Tutorial

Normal Distribution (s9)

Suppose that the weights of blocks are normally distributed with a mean value of \(\mu\) = 11.0 kg and a standard deviation of \(\sigma\) = 0.3 kg. Get the probability that a block weighs less than 10.5kg.

pnorm(10.5, mean = 11, sd = 0.3)
## [1] 0.04779035
x <- seq(9, 13, length = 100)
hx <- dnorm(x, 11, 0.3)
plot(x, hx, type = "l", lty = 1, xlab = "x", ylim = c(0,2),
     ylab = "f(x)", main = "normal distribution")
abline(v = 10.5, col = 'red', lty = 2)

Linear combination of normal RV (s11)

Let’s assume that we have two different experimental results \(X_{A}\) and \(X_{B}\) for a biological activity from two different methods A and B, showing each of result is normally distributed. If both results have similar mean values but different variations which are 2.97 from Method A and 1.62 from Method B, then, what could be the best estimation of “Y” in the form of linear combination of both results \(X_{A}\) and \(X_{B}\) to get the minimized variation, and what is the variation of the Y?

\(Y=pX_A + (1-p)X_B\)

\(\mu_Y = pE(X_A)+(1-p)E(X_B) = pX + (1-p)X = X\)

\(\sigma_Y^2 = p^2 Var(X_A) + (1-p)^2 Var(X_B) = 2.97p^2 + 1.62(1-p)^2\)

The best weight \(p\) can be obtained from the derivatives of \(\sigma_Y^2\) with respect to \(p\) as,

\(\frac{d\sigma_Y^2}{dp} = 5.94p - 3.24(1-p) = 0, p=0.35\)

The best estimate of \(Y = 0.35X_A + 0.65X_B\)

\(\sigma_Y^2 = 2.97 \times 0.35^2 + 1.62 \times 0.65^2 = 1.05\), which is smaller than that of method B, 1.62.

# Estimation
Xa <- 2.97
Xb <- 1.62
p <- 0.35

mu_Y <- p * Xa + (1 - p) * Xb
var_Y <- p^2 * Xa + (1 - p)^2 * Xb
mu_Y; var_Y
## [1] 2.0925
## [1] 1.048275

Normal Approximation (s12 ~ s13)

Find probability of below and approximate it with normal distribution.

  • \(X \sim B(16, 0.5)\)

  • \(P(8 \leq X \leq 11)\)

# X ~ B(16, 0.5), P(8 <= X <= 11)
# Y ~ N(8, 4), P(8 - 0.5 <= Y <= 11 + 0.5) # continuity correction

# same quantities
# direct calculation of P(8 <= X <= 11)
(factorial(16)/(factorial(8)*factorial(8))) * 0.5^16 + 
  (factorial(16)/(factorial(9)*factorial(7))) * 0.5^16 +
    (factorial(16)/(factorial(10)*factorial(6))) * 0.5^16 +
      (factorial(16)/(factorial(11)*factorial(5))) * 0.5^16
## [1] 0.5597839
# by definition of dbinom
dbinom(8, 16, 0.5) + dbinom(9, 16, 0.5) + dbinom(10, 16, 0.5) + dbinom(11, 16, 0.5)
## [1] 0.5597839
# by definition of pbinom
pbinom(11, 16, 0.5) - pbinom(7, 16, 0.5) # including 8
## [1] 0.5597839
# from pnorm
mu <- 16*0.5 # mean
var <- 16*0.5^2 # variance
sd <- sqrt(var)
pnorm(11, mu, sd) - pnorm(8, mu, sd) # wo continuity correction +0.5, -0.5
## [1] 0.4331928
pnorm(11.5, mu, sd) - pnorm(7.5, mu, sd) # w continuity correction +0.5, -0.5
## [1] 0.5586472
# cf - arbitrary distribution (10 <= X <= 12)
pbinom(12, 20, 0.7) - pbinom(9, 20, 0.7) # including 10.
## [1] 0.2105834
pnorm(12.5, 20*0.7, sqrt(20*0.7*0.3)) - pnorm(9.5, 20*0.7, sqrt(20*0.7*0.3))
## [1] 0.2180531

Normal Approximation (s13)

Suppose that a particular vaccine has a probability of 0.0005 adverse reaction. When this vaccine is to be administered to 500,000 people, what is the probability to find people having adverse reaction between 203 to 297? Try to calculate with a Normal approximation. Check if your calculator can estimate the values with binomial model.

# binomial parameters
# n = 500,000 , p = 0.0005,  E(X)=np= 500,000 x 0.0005 = 250,  Var(X) = npq = 249.9, SD(X)=15.808
n <- 500000
p <- 0.0005
Ex <- n*p
Var <- n*p*(1 - p)
SD <- sqrt(Var)
Ex; Var; SD
## [1] 250
## [1] 249.875
## [1] 15.80743
plot(0:500, dbinom(0:500, 500000, 0.0005), type = 'l', lty = 2, 
     xlab = 'x', ylab = 'f(x)', main = 'binomial')

# using base R function
x <- rbinom(10000, n, p) # change sample number
mean(x); var(x); sd(x)
## [1] 250.121
## [1] 246.1566
## [1] 15.68938
# Normal approximation
# Y ~ N(250, 249.9)

plot(0:500, dbinom(0:500, 500000, 0.0005), type = 'l', lty = 2, xlab = 'x', 
     ylab = 'f(x)', main = 'Normal approximation')
lines(0:500, dnorm(0:500, mean = Ex, sd = SD), lty = 2, col = 'cyan')
abline(v = 250, col = 'red') # mean
abline(v = 250 + SD, col = 'blue', lty = 2) # 1 SD
abline(v = 250 - SD, col = 'blue', lty = 2)
abline(v = 250 + (2 * SD), col = 'green', lty = 2) #2 SD
abline(v = 250 - (2 * SD), col = 'green', lty = 2)
abline(v = 250 + (3 * SD), col = 'purple', lty = 2) # 3 SD
abline(v = 250 - (3 * SD), col = 'purple', lty = 2)

legend(0, 0.025, legend = c("Binomial", "Normal"), 
       col = c("black", "cyan"), lty = c(rep(2,2)), cex = 0.8)

legend(430, 0.025, legend = c("mean", "1SD", "2SD", "3SD"), 
       col = c("red", "blue", "green", "purple"), lty = c(1,rep(2,3)), cex = 0.8)

Log normal Distribution (s16)

Suppose that the reaction time of a person (the time elapsing between a stimulus and a consequent action) can be modeled by a lognormal distribution with parameter values

  • \(\mu\) = -0.35 and \(\sigma\) = 0.2.

Then, what is the probability that a reaction time is less than 0.6 seconds? - Notice that the above parameters, \(\mu\) and \(\sigma\) are for Normal distribution of ln(X).

mu <- -0.35
sd <- 0.2

# mean reaction time
Ex <- exp(mu + (sd^2/2))
mode <- exp(mu - (sd^2)) 
median <- exp(mu) 

# variance of reaction time
Var <- exp((2 * mu) + (sd^2))  * (exp(sd^2) - 1)

# variance of reaction time
SD <- sqrt(Var)
Ex; mode; median
## [1] 0.7189237
## [1] 0.6770569
## [1] 0.7046881
Var; SD
## [1] 0.0210931
## [1] 0.1452346
# using base R function
x <- rlnorm(10000, mu, sd)
mean(x) # mean
## [1] 0.7187682
median(x)
## [1] 0.7057261
var(x)
## [1] 0.02061011
sd(x)
## [1] 0.1435622
# P(X <= 0.6)
pnorm((log(0.6) - mu)/sd)
## [1] 0.2106615
plnorm(0.6, mu, sd)
## [1] 0.2106615
plot(seq(0,1.5,0.01),dlnorm(seq(0,1.5,0.01),-0.35,0.2),type = "l",xlab = "x",ylab = "f(x)", 
     ylim = c(0,3), main = 'Log Normal Distribution')
abline(v = Ex, col = 'red') # mean
abline(v = mode, col = 'green', lty = 2)
abline(v = median, col = 'cyan', lty = 2)
abline(v = Ex + SD, col = 'blue', lty = 2) # 1 SD
abline(v = Ex - SD, col = 'blue', lty = 2)

legend(1.1, 3, legend = c("mean", "mode", "median", "1SD from mean"), 
       col = c("red", "green", "cyan", "blue"), lty = c(1,rep(2,3)), cex = 0.8)

Gumbel distribution (s17)

The Gumbel distribution is used to model the distribution of extremes (maximum or minimum) of a number of samples of various distributions. The most well-known biological utility of the Gumbel distribution is the significance test of BLAST, a sequence alignment algorithm. Here we examine how the distribution looks and changes as the parameters adjust. Here we compare pdf and cdf of the Gumbel distribution and other extreme value distribution called Weibull distribution.

# Gumbel 
library(ordinal)
mu <- 0
beta <- 1
x <- seq(-5,10,0.01)
# maximum
# pdf
plot(x, dgumbel(x, mu, beta), type = 'l', ylab = "f(x)", xlab = "x", col = 1)
lines(x, dgumbel(x, 0, 2), lty = 2, col = 2)
lines(x, dgumbel(x, 0, 3), lty = 3, col = 3)
lines(x, dgumbel(x, 0, 20), lty = 4, col = 4)
legend("topright", legend = c("b = 1", "b = 2", "b = 3", "b=20"), 
       lty = c(1,2,3,4), col = 1:4, cex = 0.8)

# cdf
plot(x, pgumbel(x, mu, beta), type = 'l', ylab = "f(x)", xlab = "x", col = 1)
lines(x, pgumbel(x, 0, 2), lty = 2, col = 2)
lines(x, pgumbel(x, 0, 3), lty = 3, col = 3)
lines(x, pgumbel(x, 0, 20), lty = 4, col = 4)
legend("topright", legend = c("b = 1", "b = 2", "b = 3", "b=20"), 
       lty = c(1,2,3,4), col = 1:4, cex = 0.8)

# minimum
x <- seq(-10,5,0.01)
# pdf
plot(x, dgumbel(x, loc = mu, scale = beta, max = FALSE), type = 'l', ylab = "f(x)", xlab = "x")

# cdf
plot(x, pgumbel(x, loc = mu, scale = beta, max = FALSE), type = 'l', ylab = "f(x)", xlab = "x")

# cf) Weibull distribution
x <- seq(0,2,0.01)
# pdf
plot(x, dweibull(x, 0.8, 1), type = 'l', ylab = "f(x)", xlab = "x", ylim = c(0,2), col = 1)
lines(x, dweibull(x, 2, 1), lty = 2, col = 2)
lines(x, dweibull(x, 3.5, 1), lty = 3, col = 3)
legend("topright", legend = c("a = 0.8", "a = 2", "a = 3.5"), 
       lty = c(1,2,3), col = 1:3, cex = 0.8)

# a = 20, ylimit is changed from 2 to 8
plot(x, dweibull(x, 20, 1), type = 'l', ylab = "f(x)", xlab = "x", ylim = c(0,8), col = 4)
legend("topright", legend = "a = 20", lty = 1, cex = 0.8, col = 4)

# cdf
plot(x, pweibull(x, 0.8, 1), type = 'l', ylab = "f(x)", xlab = "x", ylim = c(0,1), col = 1)
lines(x, pweibull(x, 2, 1), lty = 2, col = 2)
lines(x, pweibull(x, 3.5, 1), lty = 3, col = 3)
lines(x, pweibull(x, 20, 1), lty = 4, col = 4)
legend("bottomright", legend = c("a = 0.8", "a = 2", "a = 3.5", "a=20"), 
       lty = c(1,2,3,4), col = c(1:4), cex = 0.8)

Weibull Distribution - survival time of a bacterium (s19)

Suppose that the random variable \(X\) measures the lifetime of a bacterium at a certain high temperature. Calculate the expected survival time of a bacterium and the variance of the bacteria lifetime assuming this random variable follows a Weibull distribution with \(a = 2\) and \(\lambda = 0.1\).

a <- 2 # shape parameter
lambda <- 0.1 # scale parameter (1/b)
Ex <- 1/lambda * gamma(1 + 1/a)
Var <- 1/lambda^2 * (gamma(1 + 2/a) - (gamma(1 + 1/a))^2)
SD <- sqrt(Var)
Ex; Var; SD
## [1] 8.862269
## [1] 21.46018
## [1] 4.632514
plot(function(x) dweibull(x, 2, 1/lambda), 0, 25, xlab = 'x', ylab = 'probability density', 
     ylim = c(0,0.1), main = "Weibull distribution", yaxs = "i", col = "blue")
abline(v = Ex, lty = 1)
abline(v = Ex - SD, col = "red", lty = 2)
abline(v = Ex + SD, col = "red", lty = 2)

# define function for expectation, variance, and standard deviation
# using rweibull(n, a, b = 1/lambda) function
stat_weib <- function(n, a, lambda)
{
  stat1 <- mean(rweibull(n, a, 1/lambda))
  stat2 <- var(rweibull(n, a, 1/lambda)) 
  stat3 <- sd(rweibull(n, a, 1/lambda))
  return(c(stat1, stat2, stat3))
}

# n = 1000
stat_weib(n = 10, a, lambda)
## [1]  8.437083 26.462425  6.926064
stat_weib(n = 100, a, lambda)
## [1]  8.382041 28.303061  4.744195
stat_weib(n = 1000, a, lambda)
## [1]  8.873992 23.695264  4.755758
stat_weib(n = 10000, a, lambda)
## [1]  8.827301 21.045193  4.597417
stat_weib(n = 20000, a, lambda) # to get exactly the same result, using "set.seed" function
## [1]  8.863528 21.365495  4.636942
c(Ex, Var, SD)
## [1]  8.862269 21.460184  4.632514

Weibull Distribution - recovery after treatment (s19)

A group of 82 patients are checked for their recovery after treatment. After 7days, nine of them have recovered and another 14 patients are found to have recovered after 14 days. If the time to recovery is modeled with a Weibull distribution, estimate the median time to recovery.

Solution: The recovery fraction of patients within certain days (X) can be calculated in a CDF of Weibull model. With two set of results, we can calculate two parameters \(a\) and \(\lambda\).

\(F(7) = 1 - e^{-(7 \lambda)^a} = 9/82\)

\(F(14) = 1 - e^{-(14\lambda)^a} = 24/82\)

solve for a, then \(a = log_{2}(ln(\frac{1}{1-F(14)})) - log_{2}(ln(\frac{1}{1-F(7)}))\)

solve for \(\lambda\), then

\(\lambda = ln(\frac{1}{1-F(7)}) \times (\frac{1}{7^a})^{(\frac{1}{a})}\)

At a median time,

\(F(X) = 1 - e^{-(\lambda X)^a} = 0.5\)

# given conditions
x1 <- 7
fx1 <- 9/82
x2 <- 15
fx2 <- 24/82
fmid <- 0.5


# model with Weibull distribution
a <- log2((log(1/(1 - fx2)))) - log2((log(1/(1 - fx1))))
lambda <- 1/7*(log(1/(1 - fx1)))^(1/a)
weibull <- function(x) 1 - exp((-1*(lambda*x)^a))
x_mid_Weibull <-  (log(1/(1 - fmid))*(1/lambda^a))^(1/a)

# 1. median recovery time by Weibull distribution
x_mid_Weibull
## [1] 21.7544
# model with Gumbel distribution
beta <- x1/(log(log(1/fx1)) - log(log(1/fx2)))
mu <- x1 + beta*log(log(1/fx1))
gumbel <- function(x) exp(-exp((-(x - mu)/beta)))
x_mid_Gumbel <- mu - beta*log(log(2))

# 2. median recovery time by Gumbel distribution
x_mid_Gumbel
## [1] 20.82831
# Inspect the CDF
x <- seq(1,100, 0.01)

plot(x, weibull(x), type = 'l', col = "blue", ylab = "Density")
lines(x, gumbel(x), col = "red")
abline(v = x_mid_Weibull, lty = 2, col = "blue")
abline(v = x_mid_Gumbel, lty = 2, col = "red")
abline(h = fx1, lty = 3, col = "black")
abline(h = fx2, lty = 3, col = "cyan")
legend("bottomright", legend = c("Weibull", "Gumbel"), 
       lty = c(1,1), col = c("blue", "red"), cex = 0.8)

# 3. Error analysis
err_Weibull <- (weibull(7) - fx1) + (weibull(14) - fx2)
err_Gumbel <-  (gumbel(7) - fx1) + (gumbel(14) - fx2)
err_Weibull; err_Gumbel
## [1] 4.163336e-17
## [1] -1.110223e-16
d <- err_Weibull < err_Gumbel
diff <- round(abs(err_Gumbel/err_Weibull),2)

# 4. conclusion 
print(paste0("Model with ", ifelse(d,"Weibull", "Gumbel")," distribution is ", diff, " times more accurate than model with ", ifelse(d,"Gumbel", "Weibull")," distribution"))
## [1] "Model with Gumbel distribution is 2.67 times more accurate than model with Weibull distribution"

Chi-squared Distribution (s20)

Chi-square distribution can be described as a special case of Gamma distribution. (\(\beta\) = 2, \(\alpha\) = df/2)

x <- seq(0,30,0.1)
plot(x,dchisq(x, 1),type = "l",xlab = "X^2",ylab = "f(x)", ylim = c(0,0.2),
     main = 'Chi-squared Distribution')
lines(x, dchisq(x, 5), col = "red", lty = 2)
lines(x, dchisq(x, 10), col = "blue", lty = 3)
lines(x, dchisq(x, 20), col = "green", lty = 4)
lines(x, dchisq(x, 40), col = "purple", lty = 5)

legend(40, 0.2, legend = c("df=1", "df=5", "df=10", "df=20", "df=40"), 
       col = c("black", "red", "blue", "green", "purple"), lty = 1:5, cex = 0.8)

# relationship between Gamma and Chi-squared Distribution
df = 10
plot(x,dchisq(x, df),type = "l",xlab = "X^2",ylab = "f(x)", ylim = c(0,0.1), lty = 2,
     main = 'relationship between Gamma and Chi-squared Distribution', lwd = 2)
lines(x, dgamma(x, shape = df/2, scale = 2), col = "blue", lty = 2, lwd = 0.4)
legend("topright", legend = c("chisq", "gamma"), col = c("black", "blue"), 
       lty = c(2,2), lwd = c(2, 0.4), cex = 0.8)

Chi-squared Distribution - Mendel’s experiment (s21)

In the famous Mendel’s independent assortment experiment, the phenotypic ratio of offspring from the dihybrid cross \((RrYy \times RrYy)\) is expected as \(9(RY):3(Ry):3(rY):1(ry)\).
If we get the results shown in the table below, how chi-squared test can determine if the result still follows the Mendel’s raw?

RY Ry rY rY Total
219 81 69 31 400

Chi-squared test estimate how observed values are similar to the expected values (goodness of fit) based on the chi-squared distribution model. We need to choose the model corresponding to the degree of freedom and the threshold to accept or reject the similarity (significance level). The statistic can be calculated by the following equation. \(\chi^2 = \Sigma_{i=1}^{k} \frac {[n_i - E(n_i)]^2}{E(n_i)} = \Sigma_{i=1}^{k} \frac {[n_i - np_i]^2}{np_i}\)

The expected values: \(RY=225,\ Ry=75,\ rY=75,\ ry=25\). (assuming the expected ratio is 9:3:3:1) The chi-square of this observation: \(\chi^2 =2.56\)

As the difference becomes bigger, we can reject that the result follows Mendel’s raw. In the distribution with \(df. = 3\), the probability to find the following \(\chi^2\) values or higher: \(P(\chi^2 \geq 6.251) = 0.1, P(\chi^2 \geq 7.815) = 0.05,P(\chi^2 \geq 11.34) = 0.01\)

obs <- c(219, 81, 69, 31)
exp <- c(9/16, 3/16, 3/16, 1/16)
chisq <- chisq.test(x = obs, p = exp)
chisq$statistic
## X-squared 
##      2.56
chisq$parameter
## df 
##  3
chisq$expected
## [1] 225  75  75  25
qchisq(0.1, 3, lower.tail = FALSE) # alpha = 0.1
## [1] 6.251389
qchisq(0.05, 3, lower.tail = FALSE) # alpha = 0.05
## [1] 7.814728
qchisq(0.01, 3, lower.tail = FALSE) # alpha = 0.01
## [1] 11.34487

t-Distribution - Knee bending angle (s23)

Let’s say that we selected 10 knee arthritis patients and we want to know whether the average knee bending angle of this group is different from that of overall knee arthritis population which is \(120 ^\circ\). From the following test results, can we conclude that this group is different from the typical patients? If it is, how significantly is it different?

Patients 1 2 3 4 5 6 7 8 9 10
Angle 120.6 116.4 117.2 118.1 114.1 116.9 113.3 121.1 116.9 117
ang <- c(120.6, 116.4, 117.2, 118.1, 114.1, 116.9, 113.3, 121.1, 116.9, 117.0)
x_bar <- mean(ang)
mu <- 120
s <- sd(ang)
n <- length(ang)

t <- (x_bar - mu) / (s/sqrt(n)) ; t
## [1] -3.692362
df <- n - 1
  
# cumulative distribution function
pt(t, df)
## [1] 0.002489563
plot(function(x) dt(x, df), -5, 5, ylim = c(0, 0.5), xlab = 't', ylab = 'probability density',
     main = "t - Distribution", yaxs = "i")
abline(v = t, col = 'red', lty = 2)

t-Distribution (additional)

The CEO of light bulbs manufacturing company claims that an average light bulb lasts 300 days. A researcher randomly selects 15 bulbs for testing. The sampled bulbs last an average of 290 days, with a standard deviation of 50 days. If the CEO’s claim were true, what is the probability that 15 randomly selected bulbs would have an average life of no more than 290 days

N <- 15
df <- N - 1
mu <- 300
mu_s <- 290
sd_s <- 50

t <- (mu_s - mu) / (sd_s/sqrt(N))

# cumulative distribution function
pt(t, df)
## [1] 0.2257313
plot(function(x) dt(x, df), -5, 5, ylim = c(0, 0.5), xlab = 't', ylab = 'probability density',
     main = "t-Distribution", yaxs = "i")
abline(v = t, col = 'red', lty = 2)

F-Distribution (s24)

Let X be an F random variable with 4 numerator degrees of freedom and 5 denominator degrees of freedom. What is the first percentile?

r1 <- 4
r2 <- 5
alpha <- 0.01

# 1st percentile
1/qf(1 - alpha, r2, r1)
## [1] 0.06442528
qf(alpha, r1, r2)
## [1] 0.06442528
# plot
x <- seq(0,4,0.01)
plot(x, df(x, 1,2), xlab = 'F', ylab = 'probability density', type = 'l', ylim = c(0, 2.1),
     main = "F - Distribution", yaxs = "i", col = 1)
lines(x, df(x, 2,3), lty = 2, col = 2)
lines(x, df(x, 3,4), lty = 3, col = 3)
lines(x, df(x, 4,5), lty = 4, col = 4)
lines(x, df(x, 100,100), lty = 5, col = 5)
legend("topright", legend = c("r1=1, r2=2", "r1=2, r2=3", "r1=3, r2=4", "r1=4, r2=5", "r1=100, r2=100"), lty = c(1,2,3,4,5), col = 1:5, cex = 0.8)

Multinomial Distribution (s27)

For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability \(p_k\), the multinomial distribution gives the probability of any particular combination of numbers of successes \((x_1, x_2,...,x_k)\) for the various categories.

Its probability mass function is (for \(\Sigma_i X_i=n\))

\(f(x_1, x_2, ..., x_k | n, p_1, ..., p_k) = \frac{n!}{x_1! x_2! ... x_k!} p_1^{x_1}...p_k^{x_k}\)

Expected value for: \(E[X_i] = np_i\)

variance and convariance:

\(Var(X_i) = np_i(1-p_i)\)

\(Cov(X_i, X_j) = -np_ip_j(i\neq j)\)

  • The multinomial distribution is a generalization of binomial distribution
  • for the particular case of n=1 observation, this distribution is called categorical distribution
  • for the particular case of k=2 categories, this distribution is called binomial distribution.

Q: In a recent three-way election for a large country, candidate A received 20% of the votes, candidate B received 30% of the votes, and candidate C received 50% of the votes. If six voters are selected randomly, what is the probability that there will be exactly one supporter for candidate A, two supporters for candidate B and three supporters for candidate C in the sample?

A: \(p(X_1 = 1, X_2 = 2, X_3 = 3 | n=6, p_1 = 0.2, p_2 = 0.3, p_3 = 0.5) = \frac{6!}{1!2!3!}0.2^1, 0.3^2 0.5^3 = 0.135\)

factorial(6)/(factorial(1)*factorial(2)*factorial(3)) * 0.2 * 0.3^2 * 0.5^3
## [1] 0.135
dmultinom(x = c(1,2,3), prob = c(0.2, 0.3, 0.5))
## [1] 0.135

Multinomial Distribution (s27)

Number of patients exhibiting different responses for a drug (not just yes or no). P1(hyper) = 0.12 , P2(middle) = 0.28, P3(weak) = 0.33, P4(No) = 0.27 in general. How can the doctors predict the types of responses for 9 new patients? The expected number of each reactions: np1= 9x1.02=1.08, np2=9x0.28, …. The probability that there are none of hyper, one is middle, four patients are weak & no response

# P(X1=0, X2=1, X3=4, X4=4)
x <- c(0,1,4,4)
p <- c(0.12, 0.28, 0.33, 0.27) # hyper, middle, weak, no
dmultinom(x, prob = p)
## [1] 0.01111756

Multivariate Normal Distribution (s29)

Measurements were taken on n heart-attack patients on their cholesterol levels. For each patient, measurements were taken 0, 2, and 4 days following the attack. 3 three measurements have a multivariate normal distribution. Treatment was given to reduce cholesterol level. The sample mean vector is:

x
0-Day 259.5
2-Day 230.8
4-Day 221.5

The covariance matrix is:

0-Day 2-Day 4-Day
0-Day 2276 1508 813
2-Day 1508 2206 1349
4-Day 813 1349 1865

Then, Calculate Mean and variance of difference between the 0-day and the 2-day measurements, \(X_{1} - X_{2}\) and what kind of distribution \(X_{1} - X_{2}\) will have.

# If we assume the three measurements have a multivariate normal distribution,
# then the distribution of the difference X1 ??? X2 has a univariate normal
# distribution.

X <- c(259.5, 230.8, 221.5)
names(X) <- c("0-Day", "2-Day", "4-Day")
Cov_X <- matrix(c(2276, 1508, 813, 1508, 2206, 1349, 813, 1349, 1865), nrow = 3)
colnames(Cov_X) <- names(X)
rownames(Cov_X) <- names(X)

Diff <- c(1, -1, 0)
mu <- Diff %*% X
var <- Diff %*% Cov_X %*% Diff

# in 2D
library(MASS)
bivn12 <- mvrnorm(1000, mu = c(X[1], X[2]),
                Sigma = Cov_X[1:2, 1:2])
bivn23 <- mvrnorm(1000, mu = c(X[2], X[3]),
                Sigma = Cov_X[2:3, 2:3])
bivn13 <- mvrnorm(1000, mu = c(X[1], X[3]),
                Sigma = Cov_X[c(1,3), c(1,3)])


# now we do a kernel density estimate
bivn.kde12 <- kde2d(bivn12[,1], bivn12[,2], n = 50)
bivn.kde23 <- kde2d(bivn23[,1], bivn23[,2], n = 50)
bivn.kde13 <- kde2d(bivn13[,1], bivn13[,2], n = 50)

##creating contour plots and saving them to files
par(mfrow = c(2,3))
image(bivn.kde12); contour(bivn.kde12, add = T); title("bivn.kde12")
image(bivn.kde23); contour(bivn.kde23, add = T); title("bivn.kde23")
image(bivn.kde13); contour(bivn.kde13, add = T); title("bivn.kde13")

# in 3D
persp(bivn.kde12, phi = 45, theta = 30, shade = .1, border = NA)
persp(bivn.kde23, phi = 45, theta = 30, shade = .1, border = NA)
persp(bivn.kde13, phi = 45, theta = 30, shade = .1, border = NA)

par(mfrow = c(1,1))

# interactive
# install.packages("plotly")
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ordinal':
## 
##     slice
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = bivn.kde12$x, y = bivn.kde12$y, z = bivn.kde12$z) %>% add_surface()
plot_ly(x = bivn.kde23$x, y = bivn.kde23$y, z = bivn.kde23$z) %>% add_surface()
plot_ly(x = bivn.kde13$x, y = bivn.kde13$y, z = bivn.kde13$z) %>% add_surface()
# If we assume the three measurements have a multivariate normal distribution,
# then the distribution of the difference X1 - X2 has a univariate normal
# distribution.