Distributions (PartI)

Discrete Uniform Distribution

Dice

par(mfrow = c(1,3))

dice300 <- sample(6,size = 300,replace = TRUE) 
hist(dice300); abline(h = 50, col = "red")
dice3000 <- sample(6,size = 3000,replace = TRUE) 
hist(dice3000); abline(h = 500, col = "red")
dice30000 <- sample(6,size = 30000,replace = TRUE) 
hist(dice30000); abline(h = 5000, col = "red")

par(mfrow = c(1,1))

table(dice300)

## dice300
##  1  2  3  4  5  6 
## 43 51 64 46 46 50

table(dice30000)

## dice30000
##    1    2    3    4    5    6 
## 5017 5015 5058 4975 5005 4930

# random number
sample(30:70, size = 27, replace = TRUE)     #choose 27 random numbers from 30 to 70

##  [1] 35 39 49 33 50 70 47 42 64 62 61 48 33 70 42 36 57 59 67 31 34 52 46
## [24] 31 39 57 38

# flip coin
sample(c("H", "T"), size = 100, replace = TRUE)               #to flip a fair coin 100 times

##   [1] "T" "T" "H" "H" "H" "H" "T" "H" "T" "H" "H" "H" "H" "H" "H" "H" "H"
##  [18] "T" "H" "T" "T" "H" "T" "H" "H" "H" "H" "H" "T" "H" "T" "T" "H" "T"
##  [35] "H" "T" "T" "H" "H" "H" "H" "T" "H" "H" "H" "T" "H" "T" "H" "H" "H"
##  [52] "H" "H" "T" "H" "H" "H" "H" "H" "T" "H" "H" "H" "H" "H" "T" "T" "T"
##  [69] "H" "H" "T" "H" "T" "T" "H" "T" "T" "T" "H" "H" "H" "T" "H" "T" "H"
##  [86] "H" "H" "T" "H" "H" "H" "T" "H" "H" "T" "T" "T" "T" "T" "H"

Binomial Distribution

x : number of success
size : number of trials (zero or more)
prob : probability of success on each trial
pseudo code
- dbinom(x, size, prob)
  - get \(P(X=x)\) where \(X~b\)(size, prob)
- pbinom(x, size, prob)
  - get \(P(X \leq x)\) where \(X~b\)(size, prob)
- qbinom(quantile, size, prob, lower.tail=TRUE)
  - get \(x\) where \(P(X \leq x)\)=quantile if lower.tail=TRUE, otherwise \(P(X > x)\)=quantile. (upper tail)
- rbinom(n, size, prob)
  - n times observations of X

expmf <- dbinom(0:3, 3, 0.5)
excmf <- pbinom(0:3, 3, 0.5)
par(mfrow = c(1,2))
plot(0:3, expmf, type = "p") #point graph
plot(0:3, excmf, type = "p")

par(mfrow = c(1,1))
rbinom(15, 3, 0.5)

##  [1] 2 2 1 2 2 0 0 1 1 1 2 3 1 1 2

qbinom(0.25, 3, 0.5)

## [1] 1

qbinom(0.25, 3, 0.5, lower.tail = FALSE)

## [1] 2

The DNA sequence of an organism consists of a long sequence of nucleotides adenine, guanine, cytosine, and thymine. It is known that in a very long sequence of 100 nucleotides, 22 nucleotides are guanine. In a sequence of 10 nucleotides, what is the probability that

exactly three nucleotides will be guanine?
at least four nucleotides will be guanine?

# a
dbinom(3, 10, 0.22)

## [1] 0.2244458

# b
pbinom(3, 10, 0.22, lower.tail = FALSE)

## [1] 0.1586739

1 - pbinom(3, 10, 0.22)

## [1] 0.1586739

Negative Binomial Distribution

x : number of failures prior to getting r successes
size : number of successes needed (often denoted by r in negative binomial distribution)
prob : probability of success on each trial
pseudocode
- dnbinom(x, size, prob)
  - get P(X=x) where X~nb(size, prob)
- pnbinom(x, size, prob)
  - get \(P(X \leq x)\) where X~nb(size, prob)
- qnbinom(quantile, size, prob, lower.tail = TRUE)
  - get x where \(P(X \leq x)\)=quantile if lower. tail = TRUE, otherwise \(P(X>x)\)=quantile.
- rnbinom(n, size, prob)
  - n times observations of X

par(mfrow = c(1,2))
plot(0:10, dnbinom(0:10, 3, 0.6), xlab = 'x', ylab = 'f(x)')
plot(0:10, pnbinom(0:10, 3, 0.6), xlab = 'x', ylab = 'F(x)')

par(mfrow = c(1,1))

Geometric Distribution

x : number of failures prior to getting first successes
prob : probability of success on each trial
pseudocode
- dgeom(x, prob)
  - get P(X=x) where X~Geo(size, prob)
- pgeom(x, prob, lower.tail=TRUE)
  - get \(P(X \leq x)\) where X~Geo(size, prob) if lower.tail=TRUE, otherwise P(X>x).
- qgeom(quantile, prob, lower.tail=TRUE)
  - get x where \(P(X \leq x)\)=quantile if lower.tail=TRUE, otherwise P(X>x)=quantile.
- rgeom(n, prob)
  - n times observations of X

par(mfrow = c(1,2))
plot(0:10, dgeom(0:10, 0.6), xlab = 'x', ylab = 'f(x)')
plot(0:10, pgeom(0:10, 0.6), xlab = 'x', ylab = 'F(x)')

par(mfrow = c(1,2))

A scientist is trying to produce a mutation in the genes of Drosophila by treating them with ionizing radiations such as \(\alpha -, \beta -, \gamma -\) or X-ray, which induces small deletions in the DNA of the chromosome. Assume that the probability of producing mutation at each attempt is 0.10.

What is the probability that the scientist will be able to produce a mutation at the third trial?
What is the average number of attempts the scientist is likely to require before getting the first mutation?

# a
p <- 0.10
x <- 2
dgeom(x,p)

## [1] 0.081

# b
ex <- 1/0.10
ex

## [1] 10

Poisson Distribution

x : number of successes
lambda : parameter of the Poisson distribution
pseudocode
- dpois (x, lambda)
  - get P(X=x) where X~Poi(lambda)
- ppois(x, lambda, lower.tail=TRUE)
  - get \(P(X \leq x)\) where X~Poi(lambda) if lower.tail=TRUE, otherwise P(X>x)
- qpois(quantile, x, lambda, lower.tail=TRUE)
  - get x where \(P(X \leq x)\)=quantile if lower.tail=TRUE, otherwise P(X>x)=quantile.
- rpois(n, lambda)
  - n times observations of X

par(mfrow = c(1,3))

plot(0:10, dpois(0:10, 0.5), xlab = "", ylab = "Prob", type = "p", main = "Lambda 0.5", ylim = c(0, 0.6))
plot(0:10, dpois(0:10, 2), xlab = "", ylab = "Prob", type = "p", main = "Lambda 2", ylim = c(0, 0.6))
plot(0:10, dpois(0:10, 5), xlab = "", ylab = "Prob", type = "p", main = "Lambda 5", ylim = c(0, 0.6))

par(mfrow = c(1,1))

An organism has two genes that may or may not express in a certain period of time. The number of genes expressing in the organism in an interval of time is distributed as Poisson variates with mean 1.5. Find the probability that in an interval

Neither gene expressed
Both genes expressed

# a
lambda1 <- 1.5
x <- 0
dpois(x,lambda1)

## [1] 0.2231302

# b
x <- 2
dpois(x,lambda1)

## [1] 0.2510214

Continuous Uniform Distribution

x : value
min : lower limits of the distribution. Must be finite.
max : upper limits of the distribution. Must be finite.
pseudocode
- dunif(x, min, max)
  - get pdf value at x where X~U(min,max)
- punif(x, min, max, lower.tail=TRUE)
  - get \(P(X \leq x)\) where X~U(min,max) if lower.tail=TRUE, otherwise, P(X>x)
- qunif(quantile, min, max, lower.tail=TRUE)
  - get x where \(P(X \leq x)\)=quantile if lower.tail=TRUE, otherwise P(X>x)=quantile.
- runif(n, min, max)
  - n times observations of X

ex1 <- seq(-4, 4, length = 400)
unifex1 <- dunif(ex1, 0, 2)
plot(ex1, unifex1, type = "l")

The hydrogen bonds linking base pairs together are relatively weak During DNA replication, two strands in the DNA separate along this line of weakness. This mode of replication, in which each replicated double helix contains one original (parental) and one newly synthesized strand (daughter), is called a semiconservative replication. Suppose in an organism, semiconservative replication occurs every half hour. A scientist decides to observe the semiconservative replication randomly. What is the probability that the scientist will have to wait at least 20 minutes to observe the semiconservative replication?

punif(30, 0, 30) - punif(20, 0, 30)

## [1] 0.3333333

Gamma Distribution

x: value
shape: shape parameters, often denoted by \(\alpha\)
scale: scale parameters, often denoted by \(\beta\)
pseudo code
- dgamma(x, shape=\(\alpha\), scale=\(\beta\))
  - get pdf value at x where \(X~gamma(\alpha, \beta)\)
- pgamma(x, shape=\(\alpha\), scale=\(\beta\), lower.tail=TRUE)
  - get \(P(X \leq x)\) where \(X~gamma(\alpha, \beta)\) if lower.tail=TRUE, otherwise, P(X>x)
- qgamma(quantile, shape = \(\alpha\)), scale = \(\beta\), lower.tail=TRUE
  - get x where \(P(X \leq x)\) = quantile if lower.tail=TRUE, otherwise(PX>x)=quantile
- rgamma(n, shape=\(\alpha\), scale=\(\beta\))
  - n times observations of X

x <- seq(0, 30, length = 100)
plot(x,dgamma(x,shape = 2, scale = 1), type = 'l',xlab = "x", ylab = "f(x)",main = "Gamma pdf's")
lines(x,dgamma(x,shape = 2,scale = 2),lty = 2)
lines(x,dgamma(x,shape = 2,scale = 4),lty = 3)
lines(x,dgamma(x,shape = 2,scale = 8),lty = 4)
legend(x = 10,y = .3,paste("Scale=",c(1,2,4,8)),lty = 1:4)

Translation is the process in which mRNA codon sequences are converted into an amino acid sequence. Suppose that in an organism, on average 30 mRNA are being converted per hour in accordance with a Poisson process. What is the probability that it will take more than five minutes before first two mRNA translate?

pgamma(5,shape = 2, scale = 2, lower.tail = FALSE)

## [1] 0.2872975

A gene mutation is a permanent change in the DNA sequence that makes up a gene. Suppose that in an organism, gene mutation occurs at a mean rate of \(\lambda = 1 /3\) per unit time according to a Poisson process. What is the probability that it will take more than five units of time until the second gene mutation occurs?

pgamma(5, shape = 2, scale = 3, lower.tail = FALSE)

## [1] 0.5036683

Exponential Distribution

Exponential Distribution : special case of the gamma distribution where \(\alpha=1\)
x : value
rate : exponential rate, defaults as 1.
pseudocode
- dexp(x, rate)
  - get pdf value at x where X~exp(rate)
- pexp(x, rate, lower.tail=TRUE)
  - get \(P(X \leq x)\) where X~exp(rate) if lower.tail=TRUE, otherwise, P(X>x)
- qexp(quantile, rate, lower.tail=TRUE)
  - get x where \(P(X \leq x)\)=quantile if lower.tail=TRUE, otherwise P(X>x)=quantile.
- rexp(n, rate)
  - n times observations of

ex4 <- seq(0, 4, length = 100)

par(mfrow = c(1,2))

plot(ex4, dexp(ex4,1),xlab = "x", ylab = "f(x)", type = "l", main = "Exponential PDF")
plot(ex4, pexp(ex4,1), xlab = "x", ylab = "F(x)", type = "l", main = "Exponential CDF")

par(mfrow = c(1,1))

Cell division is a process by which a cell (parent cell) divides into two cells (daughter cells). Suppose that in an organism, the cell division occurs according to the Poisson process at a mean rate of two divisions per six minutes. What is the probability that it will take at least five minutes for the first division of the cell to take place?

pexp(5,2/6, lower.tail = FALSE)

## [1] 0.1888756

Beta Distribution

x : value
Shape1,2 : shape parameters
pseducode
- dbeta(x, shape1, shape2)
  - get pdf value at x where X~Beta(shape1, shape2)
- pbeta(x, shape1, shape2, lower.tail=TRUE)
  - get \(P(X \leq x)\) where X~Beta(shape1, shape2) if lower.tail=TRUE, otherwise, P(X>x)
- qbeta(quantile, shape1, shape2, lower.tail=TRUE)
  - get x where \(P(X \leq x)\)=quantile if lower.tail=TRUE, otherwise P(X>x)=quantile.
- rbeta(n, shape1, shape2)
  - n times observations of X

data <- c(0.11, 0.10, 0.10, 0.16, 0.20, 0.32, 0.01, 0.02, 0.07, 0.05, 0.25, 0.14, 
          0.11, 0.12, 0.08, 0.13, 0.08, 0.14, 0.09, 0.08)
n <- length(data)

x <- seq(0,1,length = 200)
plot(x, dbeta(x, 2, 10),xlab = "prop. of acidic residues", ylab = "f(x)", 
     main = "Beta PDF for residue data")
points(x = data, y = rep(0,n))

Problem - I

A recent national study showed that approximately 44.7% of college students have used Wikipedia as a source in at least one of their term papers. Let X equal the number of students in a random sample of size n=31 who have used Wikipedia as a source.

1-1. How is X distributed?

1-2. Sketch the p.m.f. using R.

1-3. Sketch the c.m.f. using R.

1-4. Find the probability that X is equal to 17.

1-5. Find the probability that X is at most 13.

1-6. Find the probability that X is bigger than 11.

1-7. Find the probability that X is at least 15.

1-8. Find the probability that X is between 16 and 19, inclusive.

1-9. Give the mean of X, denoted \(E(X)\)

1-10. Give the variance of X.

1-11. Give the standard deviation of X.

1-12. Find \(E(4X + 51.324)\)

#chap 4-1
#1
print("binomial distribution with n = 31, p = 0.447")

## [1] "binomial distribution with n = 31, p = 0.447"

#2
HWpmf <- dbinom(0:31, 31, 0.447)
plot(0:31,HWpmf,type = 'p')

#3
HWcmf <- pbinom(0:31,31,0.447)
plot(0:31,HWcmf,type = 'p')

#4
dbinom(17,31,0.447)

## [1] 0.07532248

#5
pbinom(13,31,0.447)

## [1] 0.451357

#6
pbinom(11,31,0.447,lower.tail = FALSE)

## [1] 0.8020339

#7
pbinom(14,31,0.447,lower.tail = FALSE)

## [1] 0.406024

#8
pb <- dbinom(16:19,31,0.447)
pX <- sum(pb)
pb; pX

## [1] 0.10560875 0.07532248 0.04735464 0.02618995

## [1] 0.2544758

#9
L <- c(0:31)
A <- dbinom(0:31,31,0.447)
Ex <- sum(L*A)
Ex

## [1] 13.857

#10
Varx <- sum(L^2*A) - Ex^2
Varx

## [1] 7.662921

#11
SD <- Varx^(1/2)
SD

## [1] 2.768198

#12
Ex2 <- sum((4*L + 51.32)*A)
Ex2

## [1] 106.748

Problem - II

A DNA sequence of organism A has 100 base pairs(bp), and during DNA replication, they have 50% of mutation probabilities uniformly. And organism B has 10000 bps and have 0.5% of mutation probabilities.

2-1. Draw the probability mass function of mutation probability of A and B. Calculate the probability that less than 40 mutations occur during replication process for each organism.
2-2. Assume that the time required for DNA replication is same for each organisms, approximate the probability distribution of mutation in 1 DNA replication cycle as Poisson distribution for each organisms. And using this model, calculate the probability that less than 40 mutations occur during replication process for each organisms.
2-3. Compare the probability distribution of models in A and B graphically, describe the relationship of binomial distribution and Poisson distribution using those models.

#1
A <- dbinom(0:100,100,0.5)
plot(0:100,A,type = 'p',xlim = range(0:100),ylim = range(seq(0,0.1,0.001)))

#0~100, n=100, p=0.5 binomial distribution plot 
B <- dbinom(0:10000,10000,0.005)
plot(0:10000,B,type = 'p',xlim = range(0:100),ylim = range(seq(0,0.1,0.001)))

#binomial distribution plot where n=10000, p=0.005, from 0~100
A40 <- pbinom(39,100,0.5)
B40 <- pbinom(39,10000,0.005)

#2
Lambda1 <- 100*0.5  #lambda = np
Lambda2 <- 10000*0.005
Po40 <- ppois(39,Lambda1)
Po40

## [1] 0.06457037

#3
Po <- dpois(0:100,Lambda1)
plot(0:100,Po,type = 'p',xlim = range(0:100),ylim = range(seq(0,0.1,0.001)))

Distributions (PartI) - Practice

Bis335