• No results found

Concepts of Monte Carlo methods Importance Sampling

N/A
N/A
Protected

Academic year: 2022

Share "Concepts of Monte Carlo methods Importance Sampling"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Paper: Advanced R

Module: Monte Carlo Method and EM algorithm

Advanced R-GRV 1 /33

(2)

Principal investigator:

Dr. Bhaswati Ganguli,

Professor, Department of Statistics, University of Calcutta

Paper co-ordinator:

Dr. Abhra Sarkar, Duke University

Content writer: Dr. Santu Ghosh,

Lecturer, Department of Environmental Health Engineering, Sri Ramachandra University, Chennai

Content reviewer: Dr. Bhaswati Ganguli,

Professor, Department of Statistics, University of Calcutta

(3)

Concepts of Monte Carlo methods Importance Sampling

Expectation Maximization (EM) Algorithm EM for Gaussian mixture models

Advanced R-GRV 3 /33

(4)

Suppose we wish to evaluate the following integration I =

Z

x

g(x)f(x)dx,

whereg(x)is complicated and unknown function andf(x)is known probability density function.

This integral can be thought of as the expectationE[g(X)]where X is a random variable with probability densityf(x).

According to the law of large numbers above the expected value I =E[g(X)]can be estimated by a sample mean.

Letx1,x2, . . . ,xn be a sequence of random numbers generated according to the known densityf(x). Then

I ≈1 n

n

X

j=1

g(xj)

This is Monte-Carlo integration.

(5)

f(x) = [cos(50x) +sin(20x)]2.

Let assumeI = Z1

0

f(x)g(x)dx, whereg(x)isU(0,1).

h = function(x) {

(cos(50 * x) + sin(20 * x))^2 } ## define the function f(x)

curve(h, xlab = "Function", ylab = "", lwd = 2)

0.0 0.2 0.4 0.6 0.8 1.0

0123

Function

Advanced R-GRV 5 /33

(6)

integrate(h, 0, 1)

## 0.9652009 with absolute error < 1.9e-10 x = h(runif(10^4))

estint = cumsum(x)/(1:10^4)

esterr = sqrt(cumsum((x - estint)^2))/(1:10^4)

plot(estint, xlab = "Mean and error range", type = "l", lwd = 2, ylim = mean(x) + 20 * c(-esterr[10^4], esterr[10^4]), ylab = "")

lines(estint + 2 * esterr, col = "gold", lwd = 2) lines(estint - 2 * esterr, col = "gold", lwd = 2)

0.80.91.01.1

(7)

h(x) = 0 only iff(x)φ(x) = 0, and such that forX ∼f, θ=E(φ(X)) =

Z

Rd

φ(x)f(x)dx

exists, consider a sampleZ1, . . . ,Zn∼h. Then the importance sampling estimator

θˆIS = 1 n

n

X

j=1

φ(Zj)f(Zj) h(Zj)

is an unbiased and consistent estimator ofθ.IfVar(φ(Z)f(Z)/h(Z))exists, then

E{(ˆθIS−θ)2}=Var(ˆθIS) = 1

nVar(φ(Z)f(Z) h(Z) ) = 1

n Z

Rd

φ2(x)f2(x) h(x)dx−1

2

Advanced R-GRV 7 /33

(8)

Suppose we want to use importance sampling to estimate θ=P(X >c) =E(I{X>c})forX ∼N(0, σ2)andc>3σ.

The Monte-Carlo estimator θˆ= 1

n

n

X

j=1

I{X>c}= 1

n#{j ∈ {1,2, . . . ,n};Xj >c}

will be quite poor, because the vast majority of sample values is completely irrelevant,only the very few that exceedc matter.

We now use importance sampling to see more values abovec. Let assume hµ(x)as theN(µ, σ2),thenf(x) =h0(x)and hf(x)

µ(x) =expµ(µ−2x)

2

. The important sampling estimator forµ∈Ris

θˆIS =1 n

n

X

j=1

I{X>c}exp

µ(µ−2x) 2σ2

(9)

# Importance sampling for P(X>c) for N(0,1) based on N(mu,1) function imp

# returns entries for a row of our tables

imp <- function(mu, c, n, X) {

Z = (mu + X > c) * exp(mu * (mu - 2 * (mu + X))/2)

IS <- mean(Z)

standev <- sd(Z)/sqrt(n)

c(Mu = mu, SE = standev, Est = IS, LCL.IS = IS - qnorm(0.975) * standev, UCL.LS = IS + qnorm(0.975) * standev)

}

n <- 1e+07 #we fix n and X in advance

X <- rnorm(n) #and use the same random numbers for all estimates

c <- 3

mu <- c(0, 1, 2, 3, 3.1, 3.15, 3.2, 4)

tab = NULL

for (i in 1:length(mu)) tab = rbind(tab, imp(mu[i], c, n, X))

Advanced R-GRV 9 /33

(10)

tab ## for c=3

## Mu SE Est LCL.IS UCL.LS

## [1,] 0.00 1.159176e-05 0.001345500 0.001322781 0.001368219

## [2,] 1.00 2.905203e-06 0.001351625 0.001345931 0.001357320

## [3,] 2.00 1.175589e-06 0.001349840 0.001347536 0.001352144

## [4,] 3.00 7.860725e-07 0.001350800 0.001349259 0.001352340

## [5,] 3.10 7.802970e-07 0.001350387 0.001348858 0.001351916

## [6,] 3.15 7.796503e-07 0.001350686 0.001349158 0.001352214

## [7,] 3.20 7.801656e-07 0.001350710 0.001349181 0.001352239

## [8,] 4.00 9.770570e-07 0.001349346 0.001347431 0.001351261

c <- 4.5

mu <- c(0, 4.5, 4.6, 4.7)

tab = NULL

for (i in 1:length(mu)) tab = rbind(tab, imp(mu[i], c, n, X))

(11)

tab ## for c=4.5

## Mu SE Est LCL.IS UCL.LS

## [1,] 0.0 5.291495e-07 2.800000e-06 1.762886e-06 3.837114e-06

## [2,] 4.5 2.425671e-09 3.400955e-06 3.396201e-06 3.405709e-06

## [3,] 4.6 2.415816e-09 3.399114e-06 3.394379e-06 3.403849e-06

## [4,] 4.7 2.422710e-09 3.400062e-06 3.395313e-06 3.404810e-06

In a simulation ofn= 10,000,000samples forσ= 1andc= 3, we see that µ≈3.15gives the best accuracy, three significant digits estimating

P(X >3)by0.00135, whereas only the first digit was significant for the Monte-Carlo estimator.

Going a bit further forc= 4.5, Monte-Carlo does not find any significant digits; importance sampling still finds two significant digits

P(X >4.5)≈0.0000034

Advanced R-GRV 11 /33

(12)

# To plot confidence intervals as mu varies

mu <- (0:120)/20 #choose range from 0 to 6 in steps of 0.05

impsave3 <- matrix(1:605, 121, 5) #initialize impsave3, to save imp for c=3 for (i in 1:121) {

impsave3[i, ] <- imp(mu[i], c, n, X) } #save imp in impsave3

plot(mu, impsave3[, 3], pch = "+") #estimates lines(mu, impsave3[, 4]) #lower CI bound lines(mu, impsave3[, 5]) #upper CI bound

lines(c(3.15, 3.15), c(impsave3[64, 4], impsave3[64, 5])) #emphasize mu=3.15 lines(c(3.14, 3.14), c(impsave3[64, 4], impsave3[64, 5])) #line too thin lines(c(3.16, 3.16), c(impsave3[64, 4], impsave3[64, 5])) #add a bit more lines(c(0, 6), c(impsave3[64, 3], impsave3[64, 3])) #horizontal line

(13)

+ + +

+ +

+ + + + +

+ + + + +

+

++

+ +++++

+++

++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

0 1 2 3 4 5 6

2.8e−063.0e−063.2e−063.4e−063.6e−06

mu

impsave3[, 3]

Advanced R-GRV 13 /33

(14)

The EM algorithm

The EM (which stands for expectationmaximization) algorithm is a deterministic optimization technique (Dempster et al., 1977) that takes advantage of the following representation

g(x|θ) = Z

z

f(x,z|θ)dz

to build a sequence of easier maximization problems whose limit is the answer to the original problem,whereZ is a latent variable, could represent as missing component.

We thus assume that we observeX1, ...,Xn, jointly distributed fromg(x|θ) that satisfies above representation and that we want to compute

θˆ= arg maxL(θ|x) = arg maxg(x|θ).

This representation occurs in many statistical settings, including censoring models and mixtures and latent variable models

(15)

Pick a starting value θˆ(0) Repeat

1 Compute (the E-step)

Q(θ|θˆ(m),x) =Eθˆ(m)[logLc(θ|x,Z)]

where the expectation is with respect tok(z|θˆ(m),x)and setm= 0.

2 MaximizeQ(θ|θˆ(m),x)inθ and take (the M-step) θˆ(m+1)=argmax

θ

Q(θ|θˆ(m),x)

and setm=m+ 1

until a fixed point is reached; i.e., θˆ(m+1)= ˆθ(m).

Advanced R-GRV 15 /33

(16)

Example-I

Letx={y,z}, wherey0= (y1,y2, . . . ,ym)0 denotes the observed data and z0 = (zm+1,zm+2, . . . ,zn)0 denotes the missing data.If the distribution f(x−θ) corresponds to theN(θ,1)distribution, the complete-data likelihood is

Lc(θ|y,z)∝

m

Y

i=1

exp{−(yi−θ)2/2}

n

Y

i=m+1

exp{−(zi−θ)2/2}

resulting in the expected complete-data log-likelihood Q(θ|θ0,y) =−1

2

m

X

i=1

(yi−θ)2−1 2

n

X

i=m+1

Eθ0[(Zi−θ)2]

where the missing observations Zi are distributed from a normal N(θ,1) distribution truncated in a.

(17)

Doing the M-step (i.e., differentiating the functionQ(θ|θ0,y)inθ) and setting it equal to0then leads to the EM update

θˆ=m¯y+ (n−m)Eθ0[Z1]

n .

SinceE[Z1] =θ+1−Φ(a−θ)φ(a−θ) , whereφandΦare the normal pdf and cdf, respectively, the EM sequence is

θˆ(j+1)= m

n +n−m n

"

θˆ(j)+ φ(a−θˆ(j)) 1−Φ(a−θˆ(j))

#

Advanced R-GRV 17 /33

(18)

R code

n<- 1000 # no of data points

X<- rnorm(n,2.5,1) # simulated data

X[sample(1:n, n *0.1)] <-NA # random missing y= na.omit(X) # complete data

m<- length(y)

ybar <- mean(y)

theta =rnorm(1,mean = ybar,sd =sd(y))

a<- 1.5

nonstop =iter =1 while (nonstop >10^(-4)) {

theta=c(theta, m*ybar/n+(n -m) *(theta[iter]+dnorm(a -theta[iter])/pnorm(a - theta[iter]))/n)

iter=iter +1

nonstop=abs(theta[iter]-theta[iter-1]) }

print(theta[length(theta)])

## [1] 2.736344

(19)

Weather Forecasting in Snoqualmie Falls

The data consist of daily records, from the beginning of 1948 to the end of 1983, of precipitation at Snoqualmie Falls, Washington. Each row of the data file is a different year; each column records, for that day of the year, the dayˆ as precipitation (rain or snow), in units of 1/100 inch.

Because of leap-days, there are 366 columns, with the last column having an NA value for three out of four years.

http://www.stat.cmu.edu/~cshalizi/402/lectures/

16-glm-practicals/snoqualmie.csv

Let’s look at the distribution of the amount of precipitation on the wet days.

Advanced R-GRV 19 /33

(20)

snoqualmie<- read.csv("snoqualmie.csv",header =FALSE) snoqualmie.vector <- na.omit(unlist(snoqualmie))

snoq <- snoqualmie.vector[snoqualmie.vector>0]

summary(snoq)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1.00 6.00 19.00 32.28 44.00 463.00

hist(snoq,breaks = 101,col ="grey",border ="grey", freq=FALSE,xlab ="Precipitation (1/100 inch)", main="Precipitation in Snoqualmie Falls")

lines(density(snoq),lty=2)

Precipitation in Snoqualmie Falls

Density

0 100 200 300 400

0.000.010.020.030.040.05

(21)

with a simple kernel density estimate. This suggests that the distribution is rather skewed to the right.

Notice that the mean is larger than the median, and that the distance from the first quartile to the median is much smaller (13/100 of an inch of precipitation) than that from the median to the third quartile (25/100 of an inch).

One way this could arise, of course, is if there are multiple types of wet days, each with a different characteristic distribution of

precipitation.

We’ll look at this by trying to fit Gaussian mixture models with varying numbers of components.

Advanced R-GRV 21 /33

(22)

Two components

library(mixtools)

snoq.k2 <-normalmixEM(snoq, k=2,maxit =100,epsilon=0.01)

## number of iterations= 61 summary(snoq.k2)

## summary of normalmixEM object:

## comp 1 comp 2

## lambda 0.569091 0.430909

## mu 10.615148 60.916020

## sigma 8.836114 45.246735

## loglik at estimate: -32681.2 Lets define a function

plot.normal.components <- function(mixture,component.number,...){

curve(mixture$lambda[component.number] *dnorm(x,mean = mixture$mu[component.number], sd= mixture$sigma[component.number]), add=TRUE, ...)

}

(23)

hist(snoq,breaks = 101,col ="grey",border ="grey", freq=FALSE,

xlab="Precipitation (1/100 inch)",main ="Precipitation in Snoqualmie Falls") lines(density(snoq),lty=2)

sapply(1:2, plot.normal.components,mixture = snoq.k2)

Precipitation in Snoqualmie Falls

Precipitation (1/100 inch)

Density

0 100 200 300 400

0.000.010.020.030.040.05

Advanced R-GRV 23 /33

(24)

the Mixture

The CDF of a two-component mixture isF(x) =λ1F1(x) +λ2F2(x).

Similarly for k componentsF(x) =

k

X

i=1

λiFi(x), where

k

X

i=1

λi= 1

Lets define the following function

pnormmix <- function(x, mixture) { lambda <- mixture$lambda

k <- length(lambda)

pnorm.from.mix <- function(x, component) {

lambda[component] * pnorm(x, mean = mixture$mu[component], sd = mixture$sigma[component])

}

pnorms <- sapply(1:k, pnorm.from.mix, x = x) return(rowSums(pnorms))

}

(25)

distinct.snoq <- sort(unique(snoq))

tcdfs <- pnormmix(distinct.snoq, mixture = snoq.k2) ecdfs <- ecdf(snoq)(distinct.snoq)

plot(tcdfs, ecdfs, xlab = "Theoretical CDF", ylab = "Empirical CDF", xlim = c(0, 1), ylim = c(0, 1))

abline(0, 1)

●●

●●●●●●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

Theoretical CDF

Empirical CDF

Two components Gaussian mixture fit is not satifactory, observed from the plot at the left panel.

One can select number of components by Cross Validation method or one can do Statistical test.

For Cross validation we have to calculate log-likelihood.

Advanced R-GRV 25 /33

(26)

Cross Validation method

dnormalmix <- function(x, mixture, log = FALSE) { lambda <- mixture$lambda

k <- length(lambda)

# Calculate share of likelihood for all data for one

# component

like.component <- function(x, component) {

lambda[component] * dnorm(x, mean = mixture$mu[component], sd = mixture$sigma[component])

}

# Create array with likelihood shares from all components

# over all data

likes <- sapply(1:k, like.component, x = x)

# Add up contributions from components

d <- rowSums(likes)

if (log) {

d <- log(d)

}

return(d) }

# Log likelihood function

loglike.normalmix <- function(x, mixture) { loglike <- dnormalmix(x, mixture, log = TRUE) return(sum(loglike))

}

(27)

One can do five-fold or ten-fold Cross Validation to select the components number.

But just to illustrate the approach we’ll do simple data-set splitting, where a randomly-selected half of the data is used to fit the model, and half to test.

n <- length(snoq)

data.points <- 1:n set.seed(154)

data.points <- sample(data.points) # Permute randomly

train <- data.points[1:floor(n/2)] # First random half is training

test <- data.points[-(1:floor(n/2))] # 2nd random half is testing

candidate.component.numbers <- 2:10

loglikes <- vector(length = 1 + length(candidate.component.numbers))

# k=1 needs special handling

mu <- mean(snoq[train]) # MLE of mean

sigma <- sd(snoq[train]) * sqrt((n - 1)/n) # MLE of standard deviation

Advanced R-GRV 27 /33

(28)

Cross Validation method

loglikes[1] <- sum(dnorm(snoq[test], mu, sigma, log = TRUE)) for (k in candidate.component.numbers) {

mixture <- normalmixEM(snoq[train],k = k, maxit = 400, epsilon = 0.01) loglikes[k] <- loglike.normalmix(snoq[test], mixture = mixture) }

plot(x = 1:10, y = loglikes, xlab = "Number of mixture components", cex = 2, pch = 16, ylab = "Log-likelihood on testing data")

2 4 6 8 10

−17500−17000−16500−16000−15500

Log−likelihood on testing data

The figure shows Log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and

evaluated on the other half of the data for testing.

The figure estimates nine components to the Gaussian

(29)

snoq.k9 <- normalmixEM(snoq, k = 9, maxit = 400, epsilon = 0.01)

## number of iterations= 291 summary(snoq.k9)

## summary of normalmixEM object:

## comp 1 comp 2 comp 3 comp 4 comp 5 comp 6 comp 7

## lambda 0.10807 0.112529 0.166482 0.13387 0.152741 0.126211 0.0504072

## mu 1.42988 3.433292 40.955614 7.48726 14.968790 26.099948 122.6941756

## sigma 0.51102 1.193913 11.785947 2.46536 4.582579 6.586004 40.9237907

## comp 8 comp 9

## lambda 0.140001 9.68975e-03

## mu 72.069995 2.12038e+02

## sigma 22.417844 8.43885e+01

## loglik at estimate: -30502.51

Advanced R-GRV 29 /33

(30)

Nine components

hist(snoq, breaks = 101, col = "grey", border = "grey", freq = FALSE, xlab = "Precipitation (1/100 inch)", main = "Precipitation in Snoqualmie Falls")

lines(density(snoq), lty = 2)

sapply(1:9, plot.normal.components, mixture = snoq.k9)

Precipitation in Snoqualmie Falls

Density

0 100 200 300 400

0.000.010.020.030.040.05

(31)

distinct.snoq <- sort(unique(snoq))

tcdfs <- pnormmix(distinct.snoq, mixture = snoq.k9) ecdfs <- ecdf(snoq)(distinct.snoq)

plot(tcdfs, ecdfs, xlab = "Theoretical CDF", ylab = "Empirical CDF", xlim = c(0, 1), ylim = c(0, 1))

abline(0, 1)

●●●●●●●●●●●●●●●●●●●●●●●●

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

Theoretical CDF

Empirical CDF

Advanced R-GRV 31 /33

(32)

1

Chirstian P. Robert & George Casella(2010).

Introducing Monte Carlo Methods with R. Springer.

2

http://www.stat.cmu.edu/~cshalizi/uADA/12/

(33)

Advanced R-GRV 33 /33

References

Related documents

The Congo has ratified CITES and other international conventions relevant to shark conservation and management, notably the Convention on the Conservation of Migratory

SaLt MaRSheS The latest data indicates salt marshes may be unable to keep pace with sea-level rise and drown, transforming the coastal landscape and depriv- ing us of a

Although a refined source apportionment study is needed to quantify the contribution of each source to the pollution level, road transport stands out as a key source of PM 2.5

These gains in crop production are unprecedented which is why 5 million small farmers in India in 2008 elected to plant 7.6 million hectares of Bt cotton which

INDEPENDENT MONITORING BOARD | RECOMMENDED ACTION.. Rationale: Repeatedly, in field surveys, from front-line polio workers, and in meeting after meeting, it has become clear that

With respect to other government schemes, only 3.7 per cent of waste workers said that they were enrolled in ICDS, out of which 50 per cent could access it after lockdown, 11 per

Planned relocation is recognized as a possible response to rising climate risks in the Cancun Adaptation Framework under the United Nations Framework Convention for Climate Change

Of those who have used the internet to access information and advice about health, the most trustworthy sources are considered to be the NHS website (81 per cent), charity