Paper: Advanced R
Module: Monte Carlo Method and EM algorithm
Advanced R-GRV 1 /33
Principal investigator:
Dr. Bhaswati Ganguli,
Professor, Department of Statistics, University of Calcutta
Paper co-ordinator:
Dr. Abhra Sarkar, Duke University
Content writer: Dr. Santu Ghosh,
Lecturer, Department of Environmental Health Engineering, Sri Ramachandra University, Chennai
Content reviewer: Dr. Bhaswati Ganguli,
Professor, Department of Statistics, University of Calcutta
Concepts of Monte Carlo methods Importance Sampling
Expectation Maximization (EM) Algorithm EM for Gaussian mixture models
Advanced R-GRV 3 /33
Suppose we wish to evaluate the following integration I =
Z
x
g(x)f(x)dx,
whereg(x)is complicated and unknown function andf(x)is known probability density function.
This integral can be thought of as the expectationE[g(X)]where X is a random variable with probability densityf(x).
According to the law of large numbers above the expected value I =E[g(X)]can be estimated by a sample mean.
Letx1,x2, . . . ,xn be a sequence of random numbers generated according to the known densityf(x). Then
I ≈1 n
n
X
j=1
g(xj)
This is Monte-Carlo integration.
f(x) = [cos(50x) +sin(20x)]2.
Let assumeI = Z1
0
f(x)g(x)dx, whereg(x)isU(0,1).
h = function(x) {
(cos(50 * x) + sin(20 * x))^2 } ## define the function f(x)
curve(h, xlab = "Function", ylab = "", lwd = 2)
0.0 0.2 0.4 0.6 0.8 1.0
0123
Function
Advanced R-GRV 5 /33
integrate(h, 0, 1)
## 0.9652009 with absolute error < 1.9e-10 x = h(runif(10^4))
estint = cumsum(x)/(1:10^4)
esterr = sqrt(cumsum((x - estint)^2))/(1:10^4)
plot(estint, xlab = "Mean and error range", type = "l", lwd = 2, ylim = mean(x) + 20 * c(-esterr[10^4], esterr[10^4]), ylab = "")
lines(estint + 2 * esterr, col = "gold", lwd = 2) lines(estint - 2 * esterr, col = "gold", lwd = 2)
0.80.91.01.1
h(x) = 0 only iff(x)φ(x) = 0, and such that forX ∼f, θ=E(φ(X)) =
Z
Rd
φ(x)f(x)dx
exists, consider a sampleZ1, . . . ,Zn∼h. Then the importance sampling estimator
θˆIS = 1 n
n
X
j=1
φ(Zj)f(Zj) h(Zj)
is an unbiased and consistent estimator ofθ.IfVar(φ(Z)f(Z)/h(Z))exists, then
E{(ˆθIS−θ)2}=Var(ˆθIS) = 1
nVar(φ(Z)f(Z) h(Z) ) = 1
n Z
Rd
φ2(x)f2(x) h(x)dx−1
nθ2
Advanced R-GRV 7 /33
Suppose we want to use importance sampling to estimate θ=P(X >c) =E(I{X>c})forX ∼N(0, σ2)andc>3σ.
The Monte-Carlo estimator θˆ= 1
n
n
X
j=1
I{X>c}= 1
n#{j ∈ {1,2, . . . ,n};Xj >c}
will be quite poor, because the vast majority of sample values is completely irrelevant,only the very few that exceedc matter.
We now use importance sampling to see more values abovec. Let assume hµ(x)as theN(µ, σ2),thenf(x) =h0(x)and hf(x)
µ(x) =expµ(µ−2x)
2σ2
. The important sampling estimator forµ∈Ris
θˆIS =1 n
n
X
j=1
I{X>c}exp
µ(µ−2x) 2σ2
# Importance sampling for P(X>c) for N(0,1) based on N(mu,1) function imp
# returns entries for a row of our tables
imp <- function(mu, c, n, X) {
Z = (mu + X > c) * exp(mu * (mu - 2 * (mu + X))/2)
IS <- mean(Z)
standev <- sd(Z)/sqrt(n)
c(Mu = mu, SE = standev, Est = IS, LCL.IS = IS - qnorm(0.975) * standev, UCL.LS = IS + qnorm(0.975) * standev)
}
n <- 1e+07 #we fix n and X in advance
X <- rnorm(n) #and use the same random numbers for all estimates
c <- 3
mu <- c(0, 1, 2, 3, 3.1, 3.15, 3.2, 4)
tab = NULL
for (i in 1:length(mu)) tab = rbind(tab, imp(mu[i], c, n, X))
Advanced R-GRV 9 /33
tab ## for c=3
## Mu SE Est LCL.IS UCL.LS
## [1,] 0.00 1.159176e-05 0.001345500 0.001322781 0.001368219
## [2,] 1.00 2.905203e-06 0.001351625 0.001345931 0.001357320
## [3,] 2.00 1.175589e-06 0.001349840 0.001347536 0.001352144
## [4,] 3.00 7.860725e-07 0.001350800 0.001349259 0.001352340
## [5,] 3.10 7.802970e-07 0.001350387 0.001348858 0.001351916
## [6,] 3.15 7.796503e-07 0.001350686 0.001349158 0.001352214
## [7,] 3.20 7.801656e-07 0.001350710 0.001349181 0.001352239
## [8,] 4.00 9.770570e-07 0.001349346 0.001347431 0.001351261
c <- 4.5
mu <- c(0, 4.5, 4.6, 4.7)
tab = NULL
for (i in 1:length(mu)) tab = rbind(tab, imp(mu[i], c, n, X))
tab ## for c=4.5
## Mu SE Est LCL.IS UCL.LS
## [1,] 0.0 5.291495e-07 2.800000e-06 1.762886e-06 3.837114e-06
## [2,] 4.5 2.425671e-09 3.400955e-06 3.396201e-06 3.405709e-06
## [3,] 4.6 2.415816e-09 3.399114e-06 3.394379e-06 3.403849e-06
## [4,] 4.7 2.422710e-09 3.400062e-06 3.395313e-06 3.404810e-06
In a simulation ofn= 10,000,000samples forσ= 1andc= 3, we see that µ≈3.15gives the best accuracy, three significant digits estimating
P(X >3)by0.00135, whereas only the first digit was significant for the Monte-Carlo estimator.
Going a bit further forc= 4.5, Monte-Carlo does not find any significant digits; importance sampling still finds two significant digits
P(X >4.5)≈0.0000034
Advanced R-GRV 11 /33
# To plot confidence intervals as mu varies
mu <- (0:120)/20 #choose range from 0 to 6 in steps of 0.05
impsave3 <- matrix(1:605, 121, 5) #initialize impsave3, to save imp for c=3 for (i in 1:121) {
impsave3[i, ] <- imp(mu[i], c, n, X) } #save imp in impsave3
plot(mu, impsave3[, 3], pch = "+") #estimates lines(mu, impsave3[, 4]) #lower CI bound lines(mu, impsave3[, 5]) #upper CI bound
lines(c(3.15, 3.15), c(impsave3[64, 4], impsave3[64, 5])) #emphasize mu=3.15 lines(c(3.14, 3.14), c(impsave3[64, 4], impsave3[64, 5])) #line too thin lines(c(3.16, 3.16), c(impsave3[64, 4], impsave3[64, 5])) #add a bit more lines(c(0, 6), c(impsave3[64, 3], impsave3[64, 3])) #horizontal line
+ + +
+ +
+ + + + +
+ + + + +
+
++
+ +++++
+++
++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
0 1 2 3 4 5 6
2.8e−063.0e−063.2e−063.4e−063.6e−06
mu
impsave3[, 3]
Advanced R-GRV 13 /33
The EM algorithm
The EM (which stands for expectationmaximization) algorithm is a deterministic optimization technique (Dempster et al., 1977) that takes advantage of the following representation
g(x|θ) = Z
z
f(x,z|θ)dz
to build a sequence of easier maximization problems whose limit is the answer to the original problem,whereZ is a latent variable, could represent as missing component.
We thus assume that we observeX1, ...,Xn, jointly distributed fromg(x|θ) that satisfies above representation and that we want to compute
θˆ= arg maxL(θ|x) = arg maxg(x|θ).
This representation occurs in many statistical settings, including censoring models and mixtures and latent variable models
Pick a starting value θˆ(0) Repeat
1 Compute (the E-step)
Q(θ|θˆ(m),x) =Eθˆ(m)[logLc(θ|x,Z)]
where the expectation is with respect tok(z|θˆ(m),x)and setm= 0.
2 MaximizeQ(θ|θˆ(m),x)inθ and take (the M-step) θˆ(m+1)=argmax
θ
Q(θ|θˆ(m),x)
and setm=m+ 1
until a fixed point is reached; i.e., θˆ(m+1)= ˆθ(m).
Advanced R-GRV 15 /33
Example-I
Letx={y,z}, wherey0= (y1,y2, . . . ,ym)0 denotes the observed data and z0 = (zm+1,zm+2, . . . ,zn)0 denotes the missing data.If the distribution f(x−θ) corresponds to theN(θ,1)distribution, the complete-data likelihood is
Lc(θ|y,z)∝
m
Y
i=1
exp{−(yi−θ)2/2}
n
Y
i=m+1
exp{−(zi−θ)2/2}
resulting in the expected complete-data log-likelihood Q(θ|θ0,y) =−1
2
m
X
i=1
(yi−θ)2−1 2
n
X
i=m+1
Eθ0[(Zi−θ)2]
where the missing observations Zi are distributed from a normal N(θ,1) distribution truncated in a.
Doing the M-step (i.e., differentiating the functionQ(θ|θ0,y)inθ) and setting it equal to0then leads to the EM update
θˆ=m¯y+ (n−m)Eθ0[Z1]
n .
SinceE[Z1] =θ+1−Φ(a−θ)φ(a−θ) , whereφandΦare the normal pdf and cdf, respectively, the EM sequence is
θˆ(j+1)= m
n +n−m n
"
θˆ(j)+ φ(a−θˆ(j)) 1−Φ(a−θˆ(j))
#
Advanced R-GRV 17 /33
R code
n<- 1000 # no of data points
X<- rnorm(n,2.5,1) # simulated data
X[sample(1:n, n *0.1)] <-NA # random missing y= na.omit(X) # complete data
m<- length(y)
ybar <- mean(y)
theta =rnorm(1,mean = ybar,sd =sd(y))
a<- 1.5
nonstop =iter =1 while (nonstop >10^(-4)) {
theta=c(theta, m*ybar/n+(n -m) *(theta[iter]+dnorm(a -theta[iter])/pnorm(a - theta[iter]))/n)
iter=iter +1
nonstop=abs(theta[iter]-theta[iter-1]) }
print(theta[length(theta)])
## [1] 2.736344
Weather Forecasting in Snoqualmie Falls
The data consist of daily records, from the beginning of 1948 to the end of 1983, of precipitation at Snoqualmie Falls, Washington. Each row of the data file is a different year; each column records, for that day of the year, the dayˆ as precipitation (rain or snow), in units of 1/100 inch.
Because of leap-days, there are 366 columns, with the last column having an NA value for three out of four years.
http://www.stat.cmu.edu/~cshalizi/402/lectures/
16-glm-practicals/snoqualmie.csv
Let’s look at the distribution of the amount of precipitation on the wet days.
Advanced R-GRV 19 /33
snoqualmie<- read.csv("snoqualmie.csv",header =FALSE) snoqualmie.vector <- na.omit(unlist(snoqualmie))
snoq <- snoqualmie.vector[snoqualmie.vector>0]
summary(snoq)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 19.00 32.28 44.00 463.00
hist(snoq,breaks = 101,col ="grey",border ="grey", freq=FALSE,xlab ="Precipitation (1/100 inch)", main="Precipitation in Snoqualmie Falls")
lines(density(snoq),lty=2)
Precipitation in Snoqualmie Falls
Density
0 100 200 300 400
0.000.010.020.030.040.05
with a simple kernel density estimate. This suggests that the distribution is rather skewed to the right.
Notice that the mean is larger than the median, and that the distance from the first quartile to the median is much smaller (13/100 of an inch of precipitation) than that from the median to the third quartile (25/100 of an inch).
One way this could arise, of course, is if there are multiple types of wet days, each with a different characteristic distribution of
precipitation.
We’ll look at this by trying to fit Gaussian mixture models with varying numbers of components.
Advanced R-GRV 21 /33
Two components
library(mixtools)
snoq.k2 <-normalmixEM(snoq, k=2,maxit =100,epsilon=0.01)
## number of iterations= 61 summary(snoq.k2)
## summary of normalmixEM object:
## comp 1 comp 2
## lambda 0.569091 0.430909
## mu 10.615148 60.916020
## sigma 8.836114 45.246735
## loglik at estimate: -32681.2 Lets define a function
plot.normal.components <- function(mixture,component.number,...){
curve(mixture$lambda[component.number] *dnorm(x,mean = mixture$mu[component.number], sd= mixture$sigma[component.number]), add=TRUE, ...)
}
hist(snoq,breaks = 101,col ="grey",border ="grey", freq=FALSE,
xlab="Precipitation (1/100 inch)",main ="Precipitation in Snoqualmie Falls") lines(density(snoq),lty=2)
sapply(1:2, plot.normal.components,mixture = snoq.k2)
Precipitation in Snoqualmie Falls
Precipitation (1/100 inch)
Density
0 100 200 300 400
0.000.010.020.030.040.05
Advanced R-GRV 23 /33
the Mixture
The CDF of a two-component mixture isF(x) =λ1F1(x) +λ2F2(x).
Similarly for k componentsF(x) =
k
X
i=1
λiFi(x), where
k
X
i=1
λi= 1
Lets define the following function
pnormmix <- function(x, mixture) { lambda <- mixture$lambda
k <- length(lambda)
pnorm.from.mix <- function(x, component) {
lambda[component] * pnorm(x, mean = mixture$mu[component], sd = mixture$sigma[component])
}
pnorms <- sapply(1:k, pnorm.from.mix, x = x) return(rowSums(pnorms))
}
distinct.snoq <- sort(unique(snoq))
tcdfs <- pnormmix(distinct.snoq, mixture = snoq.k2) ecdfs <- ecdf(snoq)(distinct.snoq)
plot(tcdfs, ecdfs, xlab = "Theoretical CDF", ylab = "Empirical CDF", xlim = c(0, 1), ylim = c(0, 1))
abline(0, 1)
●
●
●
●
●●●●●●●●●●●●●●●
●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Theoretical CDF
Empirical CDF
Two components Gaussian mixture fit is not satifactory, observed from the plot at the left panel.
One can select number of components by Cross Validation method or one can do Statistical test.
For Cross validation we have to calculate log-likelihood.
Advanced R-GRV 25 /33
Cross Validation method
dnormalmix <- function(x, mixture, log = FALSE) { lambda <- mixture$lambda
k <- length(lambda)
# Calculate share of likelihood for all data for one
# component
like.component <- function(x, component) {
lambda[component] * dnorm(x, mean = mixture$mu[component], sd = mixture$sigma[component])
}
# Create array with likelihood shares from all components
# over all data
likes <- sapply(1:k, like.component, x = x)
# Add up contributions from components
d <- rowSums(likes)
if (log) {
d <- log(d)
}
return(d) }
# Log likelihood function
loglike.normalmix <- function(x, mixture) { loglike <- dnormalmix(x, mixture, log = TRUE) return(sum(loglike))
}
One can do five-fold or ten-fold Cross Validation to select the components number.
But just to illustrate the approach we’ll do simple data-set splitting, where a randomly-selected half of the data is used to fit the model, and half to test.
n <- length(snoq)
data.points <- 1:n set.seed(154)
data.points <- sample(data.points) # Permute randomly
train <- data.points[1:floor(n/2)] # First random half is training
test <- data.points[-(1:floor(n/2))] # 2nd random half is testing
candidate.component.numbers <- 2:10
loglikes <- vector(length = 1 + length(candidate.component.numbers))
# k=1 needs special handling
mu <- mean(snoq[train]) # MLE of mean
sigma <- sd(snoq[train]) * sqrt((n - 1)/n) # MLE of standard deviation
Advanced R-GRV 27 /33
Cross Validation method
loglikes[1] <- sum(dnorm(snoq[test], mu, sigma, log = TRUE)) for (k in candidate.component.numbers) {
mixture <- normalmixEM(snoq[train],k = k, maxit = 400, epsilon = 0.01) loglikes[k] <- loglike.normalmix(snoq[test], mixture = mixture) }
plot(x = 1:10, y = loglikes, xlab = "Number of mixture components", cex = 2, pch = 16, ylab = "Log-likelihood on testing data")
●
●
●
●
● ● ● ● ● ●
2 4 6 8 10
−17500−17000−16500−16000−15500
Log−likelihood on testing data
The figure shows Log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and
evaluated on the other half of the data for testing.
The figure estimates nine components to the Gaussian
snoq.k9 <- normalmixEM(snoq, k = 9, maxit = 400, epsilon = 0.01)
## number of iterations= 291 summary(snoq.k9)
## summary of normalmixEM object:
## comp 1 comp 2 comp 3 comp 4 comp 5 comp 6 comp 7
## lambda 0.10807 0.112529 0.166482 0.13387 0.152741 0.126211 0.0504072
## mu 1.42988 3.433292 40.955614 7.48726 14.968790 26.099948 122.6941756
## sigma 0.51102 1.193913 11.785947 2.46536 4.582579 6.586004 40.9237907
## comp 8 comp 9
## lambda 0.140001 9.68975e-03
## mu 72.069995 2.12038e+02
## sigma 22.417844 8.43885e+01
## loglik at estimate: -30502.51
Advanced R-GRV 29 /33
Nine components
hist(snoq, breaks = 101, col = "grey", border = "grey", freq = FALSE, xlab = "Precipitation (1/100 inch)", main = "Precipitation in Snoqualmie Falls")
lines(density(snoq), lty = 2)
sapply(1:9, plot.normal.components, mixture = snoq.k9)
Precipitation in Snoqualmie Falls
Density
0 100 200 300 400
0.000.010.020.030.040.05
distinct.snoq <- sort(unique(snoq))
tcdfs <- pnormmix(distinct.snoq, mixture = snoq.k9) ecdfs <- ecdf(snoq)(distinct.snoq)
plot(tcdfs, ecdfs, xlab = "Theoretical CDF", ylab = "Empirical CDF", xlim = c(0, 1), ylim = c(0, 1))
abline(0, 1)
●
●
●
●
●
●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Theoretical CDF
Empirical CDF
Advanced R-GRV 31 /33
1
Chirstian P. Robert & George Casella(2010).
Introducing Monte Carlo Methods with R. Springer.
2
http://www.stat.cmu.edu/~cshalizi/uADA/12/
Advanced R-GRV 33 /33