• No results found

Lecture 38: Unsupervised learning in HMM, PCFG; Baum Welch

N/A
N/A
Protected

Academic year: 2022

Share "Lecture 38: Unsupervised learning in HMM, PCFG; Baum Welch"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Speech, NLP and the Web

Pushpak Bhattacharyya CSE Dept.,

IIT Bombay

Lecture 38: Unsupervised learning in HMM, PCFG; Baum Welch

(lecture 37 was on cognitive NLP by Abhijit Mishra)

Baum Welch, Pushpak

(2)

HMM

Algorithm

Problem

Language

Hindi

Marathi

English

French

Morph Analysis

Part of Speech Tagging Parsing

Semantics

CRF

HMM

MEMM

NLP Trinity

(3)

Classic problems with respect to HMM

1.Given the observation sequence, find the possible state sequences- Viterbi

2.Given the observation sequence, find its probability- forward/backward algorithm 3.Given the observation sequence find the

HMM prameters.- Baum-Welch algorithm

Baum Welch, Pushpak

(4)

Probabilistic FSM

(a1:0.3)

(a2:0.4)

(a1:0.2)

(a2:0.3) (a1:0.1)

(a2:0.2)

(a1:0.3)

(a2:0.2)

The question here is:

“what is the most likely state sequence given the output sequence seen”

S 1 S 2

(5)

Developing the tree

Start

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

1.0 0.0

0.1 0.3 0.2 0.3

1*0.1=0.1 0.3 0.0 0.0

0.1*0.2=0.02 0.1*0.4=0.04 0.3*0.3=0.09 0.3*0.2=0.06

. .

. .

a 1

a 2

Choose the winning sequence per state per iteration

0.2 0.4 0.3 0.2

Baum Welch, Pushpak

(6)

Tree structure contd…

S1 S2

S1 S2 S1 S2

0.1 0.3 0.2 0.3

0.027 . 0.012 .

0.09 0.06

0.09*0.1=0.009 0.018

S1

0.3

0.0081

S2 0.2

0.0054

S2 0.4

0.0048 S1

0.2

0.0024

.

a 1

a 2

The problem being addressed by this tree is S * arg max P ( S | a

1

a

2

a

1

a

2,

)

s

a1-a2-a1-a2 is the output sequence and µ the model or the machine

(7)

Path found :

(working backward)

S 1 S 2 S 1 S 2 S 1 a 2

a 1 a 1 a 2

Problem statement : Find the best possible sequence

) ,

| max (

* arg P S O

S

s

Machine or

Model Seq,

Output Seq,

State

, SO

where

} , , , { Machine

or

Model  S 0 S A T

Start symbol State collection Alphabet set

Transitions

T is defined as P ( S i   a k S j )  i , j , k

Baum Welch, Pushpak

(8)

How to compute P(o 0 o 1 o 2 o 3… o m )

ation Marginaliz

)

, ( )

(

S

S O P O

P

Consider the observation sequence,

1 3

2 1

0

2 1 0

..

...

m m S S

S S

S S

Om O

O O

Where S

i

s represent the state sequences.

(9)

Computing P(o 0 o 1 o 2 o 3… o m )

)]

| (

).

| ( )]...[

| ( ).

| ( )[

(

)

| ( )...

| ( ).

| (

).

| (

)....

| ( ).

| ( ).

(

)

| ...

( ) ...

(

)

| ( ) ( )

, (

1 0

1 0

0 0

1 1 0

0

1 1

2 0

1 0

2 1 0 1

2 1 0

m m

m m

m m

m m

m m

S S

P S

O P S

S P S

O P S

P

S O P S

O P S

O P

S S

P S

S P S

S P S

P

S O O

O O P S

S S S P

S O P S P S

O P

Baum Welch, Pushpak

(10)

Forward and Backward

Probability Calculation

(11)

Forward probability F(k,i)

 Define F(k,i)= Probability of being in state S i having seen o 0 o 1 o 2 …o k

F(k,i)=P(o 0 o 1 o 2 …o k , S i )

 With m as the length of the observed sequence

 There are N states

P(observed sequence)=P(o 0 o 1 o 2 ..o m )

= Σ p=0,N P(o 0 o 1 o 2 ..o m , S p )

= Σ p=0,N F(m , p)

Baum Welch, Pushpak

(12)

Forward probability (contd.)

F(k , q)

= P(o

0

o

1

o

2

..o

k

, S

q

)

= P(o

0

o

1

o

2

..o

k

, S

q

)

= P(o

0

o

1

o

2

..o

k-1

, o

k

,S

q

)

= Σ

p=0,N

P(o

0

o

1

o

2

..o

k-1

, S

p

, o

k

,S

q

)

= Σ

p=0,N

P(o

0

o

1

o

2

..o

k-1

, S

p

).

P(o

k

,S

q

|o

0

o

1

o

2

..o

k-1

, S

p

)

= Σ

p=0,N

F(k-1,p). P(o

k

,S

q

|S

p

)

= Σ

p=0,N

F(k-1,p). P(S

p

o

k

S

q

)

O

0

O

1

O

2

O

3

… O

k

O

k+1

… O

m-1

O

m

S

0

S

1

S

2

S

3

… S

p

S

q

… S

m

S

final

(13)

Backward probability B(k,i)

 Define B(k,i)= Probability of seeing

o k o k+1 o k+2 …o m given that the state was S i

B(k,i)=P(o k o k+1 o k+2 …o m \ S i )

 With m as the length of the whole observed sequence

P(observed sequence)=P(o 0 o 1 o 2 ..o m )

= P(o 0 o 1 o 2 ..o m | S 0 )

=B(0,0)

Baum Welch, Pushpak

(14)

Backward probability (contd.)

B(k , p)

= P(o k o k+1 o k+2 …o m \ S p )

= P(o k+1 o k+2 …o m , o k |S p )

= Σ q=0,N P(o k+1 o k+2 …o m , o k , S q |S p )

= Σ q=0,N P(o k ,S q |S p )

P(o k+1 o k+2 …o m |o k ,S q ,S p )

= Σ q=0,N P(o k+1 o k+2 …o m |S q ). P(o k , S q |S p )

= Σ q=0,N B(k+1,q). P(S p S q )

o

k

O

0

O

1

O

2

O

3

… O

k

O

k+1

… O

m-1

O

m

S

0

S

1

S

2

S

3

… S

p

S

q

… S

m

S

final

(15)

HMM Training

Baum Welch or Forward Backward Algorithm

Baum Welch, Pushpak

(16)

Key Intuition

Given: Training sequence Initialization: Probability values

Compute: Pr (state seq | training seq)

get expected count of transition compute rule probabilities

Approach: Initialize the probabilities and recompute them…

a

b

a

b a

b

a

b

q r

(17)

Baum-Welch algorithm: counts

String = abb aaa bbb aaa

Sequence of states with respect to input symbols

a, b

a,b

q r

a,b

r q

r q

q q

r q

r q

q r

q  

a

 

b

 

b

 

a

 

a

 

a

 

b

 

b

 

b

 

a

 

a

 

a

o/p seq State seq

a,b

Baum Welch, Pushpak

(18)

Calculating probabilities from table

Table of counts

T=#states

A=#alphabet symbols

Now if we have a non-deterministic transitions then

multiple state seq possible for the given o/p seq (ref. to previous slide’s feature). Our aim is to find expected

8 / 3 )

( q   b

P

b

Src Dest O/P Cou nt

q r a 5

q q b 3

r q a 3

r q b 2

8 / 5 )

( q   r

P

a



 

 

 

 

T

l

A

m

l i

j i

j i

w s s

c w s s

s c s w

P

m k k

1 1

) (

) ) (

(

(19)

Interplay Between Two Equations

 

 





 



T

l

A

m

Wm l i

j W

i j

W i

s s

c

s s

s c s

P

k k

0 0

) (

) ) (

(

   



1 , 0

) ,

, (

)

| (

) (

, 0 1

, 0 ,

0 1

, 0

n

k k

s

n n

j W

i n

n

j W

i

w S

s s

n W

S P

s s

C

wk

No. of times the transitions s

Baum Welch, Pushpak i

s

j

occurs in the string

(20)

Illustration

a:0.67

b:1.0 b:0.17

a:0.16

q r

a:0.04

b:1.0 b:0.48

a:0.48

q r

Actual (Desired) HMM

Initial guess

(21)

One run of Baum-Welch algorithm: string ababb

P(path)

q r q r q q 0.00077 0.00154 0.00154 0 0.0007

7

q r q q q q 0.00442 0.00442 0.00442 0.0044

2

0.0088 4

q q q r q q 0.00442 0.00442 0.00442 0.0044

2

0.0088 4

q q q q q q 0.02548 0.0 0.000 0.0509

6

0.0764 4

Rounded Total  0.035 0.01 0.01 0.06 0.095

New Probabilities (P)  0.06

=(0.01/(0.

01+0.06+

0.095)

1.0 0.36 0.581

b q q

a q q 

a r

q   b q

r  

a



ab

ba ab bb b 

* ε is considered as starting and ending symbol of the input sequence string.

State sequences

Through multiple iterations the probability values will converge.

Baum Welch, Pushpak

(22)

Related example: Word alignment

English (1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French (1) trois lapins

w x

(2) lapins de Grenoble

x y z

(23)

Initial Probabilities:

each cell denotes t(a  w), t(a  x) etc.

a b c d

w 1/4 1/4 1/4 1/4

x 1/4 1/4 1/4 1/4

y 1/4 1/4 1/4 1/4

z 1/4 1/4 1/4 1/4

(24)

“counts”

b c d



x y z

a b c d

w 0 0 0 0

x 0 1/3 1/3 1/3

y 0 1/3 1/3 1/3

z 0 1/3 1/3 1/3

a b



w x

a b c d

w 1/2 1/2 0 0

x 1/2 1/2 0 0

y 0 0 0 0

z 0 0 0 0

(25)

Revised probabilities table

a b c d

w 1/2 1/4 0 0

x 1/2 5/12 1/3 1/3

y 0 1/6 1/3 1/3

z 0 1/6 1/3 1/3

(26)

“revised counts”

b c d



x y z

a b c d

w 0 0 0 0

x 0 5/9 1/3 1/3

y 0 2/9 1/3 1/3

z 0 2/9 1/3 1/3

a b



w x

a b c d

w 1/2 3/8 0 0

x 1/2 5/8 0 0

y 0 0 0 0

z 0 0 0 0

(27)

Re-Revised probabilities table

a b c d

w 1/2 3/16 0 0

x 1/2 85/144 1/3 1/3

y 0 1/9 1/3 1/3

z 0 1/9 1/3 1/3

Continue until convergence; notice that (b,x) binding gets progressively stronger;

b=rabbits, x=lapins

(28)

Computational part (1/2)

 







n t

n j

t k t

i t

n

n

t s

n n

j t

k t

i t

n s

n n

W j i

n n

n

s

n n

j W

i n

n j

W i

W s

S w W

s S

W P P

W S

s S

w W

s S

W P P

W S

s s

n W

S W P

P

W S

s s

n W

S P s

s C

n n

k n

k k

, 0

, 0 1

, 0

, 0

, 0 1 , 0 1

, 0

, 0 1 , 0 ,

0 1 , 0 ,

0

, 0 1 , 0 ,

0 1

, 0

)]

, ,

, (

) [ (

1

)]

, ,

, ,

( ) [

( 1

)]

, ,

( )

, (

) [ (

1

)]

, ,

( )

| (

[ )

(

1 , 0 1 , 0

1 , 0

(29)

Computational part (2/2)

) , 1 ( ) (

) , 1 (

) , 1 ( )

| ,

( ) , 1 (

) , 1 ( )

| ,

( ) , 1 (

)

| (

) ,

| ,

( ) ,

(

) ,

, ,

, (

) ,

, ,

(

0 0

1 0

1

1 0

, 1 1

, 1 0

1 , 0 0

, 1 1

1 , 0 0

, 1 0

j t

B w s

s P i t

F

j t

B s S w W

s S

P i t

F

j t

B s S w W

s S

P i t

F

s S

W P s S W

w W

s S

P s S W

P

W w W

s S

s S W

P

W w W

s S

s S P

n

t

j i

n

t

t i t k

t j n

t

t i t k

t j

t j n

t

n t t i

t t k

t j t i

t n

t

n t t k

t j t i

t n

t

n t k

t j t i

k

 

w0 w1 w2 wk wn-1 wn

S0  S1  S1  … Si  Sj … 

Baum Welch, Pushpak

Sn-1  Sn Sn+1

(30)

Discussions

1. Symmetry breaking:

Example: Symmetry breaking leads to no change in initial values

2 Struck in Local maxima 3. Label bias problem

Probabilities have to sum to 1.

s

s s

b:1.0

b:0.5 a:0.5

a:1.0

s

s s

a:0.5

b:0.5

a:0.25 a:0.5 b:0.5

a:0.25

b:0.25

b:0.5

Desired Initialized

(31)

HMM ↔ PCFG

 O observed sequence ↔ w 1m sentence

 X state sequence ↔ t parse tree

  model ↔ G grammar

 Three fundamental questions

Baum Welch, Pushpak

(32)

HMM ↔ PCFG

 How likely is a certain observation given the model? ↔ How likely is a sentence given the grammar?

 How to choose a state sequence which best

explains the observations? ↔ How to choose a parse which best supports the sentence?

arg max ( | , )

X

P X O arg max ( | 1 m , )

t

P t w G

( 1 m | ) P w G ( | )

P O

(33)

HMM ↔ PCFG

 How to choose the model parameters that best explain the observed data? ↔ How to choose rule probabilities which maximize the probabilities of the observed sentences?

arg max ( P O | )

arg max ( 1 m | )

G

P w G

Baum Welch, Pushpak

(34)

Interesting Probabilities

The gunman sprayed the building with bullets 1 2 3 4 5 6 7

(4, 5)

N

1

NP

(4, 5)

NP

What is the probability of having a NP at this position such that it will derive

“the building” ? -

What is the probability of starting from N

1

and deriving “The gunman sprayed”, a NP and “with bullets” ? -

Inside Probabilities

Outside Probabilities

(35)

Interesting Probabilities

Random variables to be considered

The non-terminal being expanded.

E.g., NP

The word-span covered by the non-terminal.

E.g., (4,5) refers to words “the building”

While calculating probabilities, consider:

The rule to be used for expansion : E.g., NP  DT NN

The probabilities associated with the RHS non- terminals : E.g., DT subtree’s inside/outside probabilities & NN subtree’s inside/outside probabilities

Baum Welch, Pushpak

(36)

Outside Probability

  j (p,q) : The probability of beginning with N 1

& generating the non-terminal N j pq and all words outside w p ..w q

1( 1) ( 1)

( , ) ( , j , | )

j p q P w p N pq w q m G

w

1

………w

p-1

w

p

…w

q

w

q+1

……… w

m

N

1

N

j

(37)

Inside Probabilities

  j (p,q) : The probability of generating the words w p ..w q starting with the non-terminal

N j pq . ( , ) ( |

j

, )

j

p q P w

pq

N

pq

G

w

1

………w

p-1

w

p

…w

q

w

q+1

……… w

m

 N

1

N

j

Baum Welch, Pushpak

(38)

Outside & Inside Probabilities:

example

The gunman sprayed the building with bullets

4,5

(4, 5) for "the building"

(The gunman sprayed, , with bullets | )

NP

P NP G

N

1

NP

(4, 5) for "the building" (the building |

4,5

, )

NP

P NP G

(39)

PCFG Training

Baum Welch, Pushpak

(40)

EM Algorithm for training

References

Related documents

The Congo has ratified CITES and other international conventions relevant to shark conservation and management, notably the Convention on the Conservation of Migratory

Although a refined source apportionment study is needed to quantify the contribution of each source to the pollution level, road transport stands out as a key source of PM 2.5

INDEPENDENT MONITORING BOARD | RECOMMENDED ACTION.. Rationale: Repeatedly, in field surveys, from front-line polio workers, and in meeting after meeting, it has become clear that

With an aim to conduct a multi-round study across 18 states of India, we conducted a pilot study of 177 sample workers of 15 districts of Bihar, 96 per cent of whom were

Of those who have used the internet to access information and advice about health, the most trustworthy sources are considered to be the NHS website (81 per cent), charity

Red circles denote the sources with H − K > 1, sources with 1 H − K > 0.6 are shown with magenta crosses, blue circles denote the Class II-like objects detected from our

The background values at 1100 Å, or rather the best-fit model values at 1100 Å assuming a B star template, are plotted in Figure 4, highlighting both the faintest observations

wide companions to TYC 2930, Section 6 describes our search for photometric variability and any potential transits of the inner companion, Section 7 discusses the tidal evolution of