Pushpak Bhattacharyya Pushpak Bhattacharyya

(1)

CS460/626 : Natural Language CS460/626 : Natural Language

Processing/Speech, NLP and the Web

(Lecture 17 Alignment in SMT) (Lecture 17– Alignment in SMT)

Pushpak Bhattacharyya Pushpak Bhattacharyya

CSE Dept.,

IIT Bombay

14

^th

F b 2011

14

^th

Feb, 2011

(2)

Language Divergence Theory: ^Lexico-

Semantic Divergences (ref: Dave, Parikh, Bhattacharyya, g ( yy Journal of MT, 2002)

Conflational divergence Co at o a d e ge ce

F: vomir; E: to be sick

E: stab; H: churaa se maaranaa (knife-with hit)

S: Utrymningsplan; E: escape plan

Structural divergence

E: SVO; H: SOV

Categorial divergence

Change is in POS category (many examples discussed)

Head swapping divergence

E: Prime Minister of India; H: bhaarat ke pradhaan mantrii (I di f P i Mi i t )

(India-of Prime Minister)

Lexical divergence

E: advise; H: paraamarsh denaa (advice give): Noun Incorporation- very common Indian Language Phenomenon

Incorporation very common Indian Language Phenomenon

(3)

Language Divergence Theory: ^Syntactic

Divergences Divergences

Constituent Order divergence

E: Singh, the PM of India, will address the nation today; H:

bh t k dh t ii i h (I di f PM Si h )

bhaarat ke pradhaan mantrii, singh, … (India-of PM, Singh…)

Adjunction Divergence

E: She will visit here in the summer; H: vah yahaa garmii meM

ii ( h h i ill )

aayegii (she here summer-in will come)

Preposition-Stranding divergence

E: Who do you want to go with?; H: kisake saath aap jaanaa h h t h ? ( h ith )

chaahate ho? (who with…)

Null Subject Divergence

E: I will go; H: jaauMgaa (subject dropped)

Pleonastic Divergence

E: It is raining; H: baarish ho rahii haai (rain happening is: no translation of it)

(4)

Alignment

Completely aligned

Your answer is right

Votre response est just

Problematic alignment

We first met in Paris

N t l

Nous nous sommes rencontres pour la

premiere fois a Paris

(5)

Th St ti ti l MT d l t ti

The Statistical MT model: notation

Source language: Source language: F F

Target Language: E

Source language sentence: g g f

Target language sentence: e

Source language word: w

^f

Target language word: w

^e

(6)

The Statistical MT model The Statistical MT model

To translate f:

h ll i

1. Assume that all sentences in E are

translations of f with some probability!

2. Choose the translation with the highest probability

ˆ arg max ( | )

e

e = P e f

(7)

SMT M d l SMT Model

• What is a good translation? What is a good translation?

– Faithful to source Fluent in target – Fluent in target

faithfulness

ˆ arg max ( ) ( | )

e

e = P e P f e

e

fluency

(8)

Language Modeling Language Modeling

• Task to find Task to find P(e) P(e) (assigning probabilities to (assigning probabilities to sentences)

1 2

If ,e = w w Kw_n

1 2 1 2 1 3 1 2 1 2 1

( ) ( ... ) ( ) ( | ) ( | ) ( | )

( )

n n n

P e P w w w P w P w w P w w w P w w w w count w w w

= = K K −

1 2

1 2 1

( )

( | )

( )

n

n n

n w

count w w w P w w w w

count w w w w

−

=

∑

^K

K K

(9)

Language Modeling: The N -gram i ti

approximation

• Probability of the word given the previous Probability of the word given the previous N-1 N-1 words

• N 2: bigram approximation

• N=2: bigram approximation

• N=3: trigram approximation

• Bigram approximation:

1 2 1 2 1 3 2 1

( ) ( ... _n) ( ) ( | ) ( | ) ( _n | _n )

P e = P w w w = P w P w w P w w KP w w ₋

1 1

1

( )

( | )

( )

n n

n

count w w P w w

count w w

− = −

∑

⁽ ⁿ ¹ ⁾

w

∑

−

(10)

Translation Modeling Translation Modeling

Task: to find P(f|e) ( | )

Cannot use the counts of f and e

Approximate: P(f|e) using the product of word translation probabilities (IBM model 1)

translation probabilities (IBM model 1)

Problem: How to calculate word translation probabilities?

Note: We do not have counts – training corpus is

sentence-aligned, not word-aligned

(11)

Word alignment example Word-alignment example

(1) (2) (3) (4)

(1) (2) (3) (4) Ram has an apple

राम के पास एक सेब है

राम क पास एक सब ह

(1) (2)(3) (4) (5) (6)

(12)

E t ti M i i ti f

Expectation Maximization for the translation model

the translation model

(13)

E t ti M i i ti l ith Expectation-Maximization algorithm

1.

Start with uniform word translation probabilities p

2.

Use these probabilities to find the counts (fractional)

3.

Use these new counts to recompute the word translation probabilities

translation probabilities

4.

Repeat the above steps till values converge

Works because of the co-occurrence of words that are actually translations

b h

It can be proven that EM converges

(14)

The counts in IBM Model 1 The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus Wo s by a g (f|e) ove t e e t e co pus For IBM Model 1, we get the following relationship: g g p

0

( | )

( | ; , )

( | ) ( | ^l )

f e

e e

f f

t w w c w w f e

t w w t w w

= + +K Α.Β

( | ; , ) is the fractional count of the alignment of

with in and

f e f

e

c w w f e w

w f e

with in and

( | ) is the probability of being the translation of is the count of

f e f e

w f e

t w w w w

Α w^fin f is the count of ^ein

f

w e

Β

(15)

The translation probabilities in IBM M d l 1

Model 1

( ) ( )

( | ) ( | ; )

S

f e f e s s

t ′ ∑ f

^{( )} ^{( )}

1

( | ) ( | ; , )

To get ( | ) normalize such that

f e f e s s

s

f e

t w w c w w f e

t w w

=

′ = ∑

To get ( | ), normalize such that

( | ) 1

f

f e

t w w t w w =

∑

f

∑

(16)

English-French example of alignment

Completely aligned

Fertilty?: yes

(17)

EM for word alignment from sentence alignment: example

English (1) three rabits

French (1) trois lapins (1) three rabits

a b

(1) trois lapins

x y

(2) rabbits of Grenoble

b d

(2) lapins de Grenoble

b c d x y z

(18)

Initial Probabilities:

h ll d ( ) ( )

each cell denotes t(aÅÆ w), t(aÅÆ x) etc.

a b c d

w 1/4 1/4 1/4 1/4

x 1/4 1/4 1/4 1/4

y 1/4 1/4 1/4 1/4

y

z 1/4 1/4 1/4 1/4

(19)

The counts in IBM Model 1 The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus Wo s by a g (f|e) ove t e e t e co pus For IBM Model 1, we get the following relationship: g g p

0

( | )

( | ; , )

( | ) ( | ^l )

f e

e e

f f

t w w c w w f e

t w w t w w

= + +K Α.Β

( | ; , ) is the fractional count of the alignment of

with in and

f e f

e

c w w f e w

w f e

with in and

( | ) is the probability of being the translation of is the count of

f e f e

w f e

t w w w w

Α w^fin f is the count of ^ein

f

w e

Β

(20)

Example of expected count Example of expected count

C[aÅÆw; (a b)ÅÆ(w x)]

t(aÅÆw)

= --- X #(a in ‘a b’) X #(w in ‘w x’)

( ) ( )

t(aÅÆw)+t(aÅÆx) 1/4

= --- X 1 X 1= 1/2 1/4+1/4

(21)

“counts”

b c d a b c d

a b a b c d

ÅÆ x y z

w 0 0 0 0

ÅÆ w x

w 1/2 1/2 0 0 w 0 0 0 0

x 0 1/3 1/3 1/3

w 1/2 1/2 0 0

x 1/2 1/2 0 0 0 /3 /3 /3

y 0 1/3 1/3 1/3

/ / 0 0

y 0 0 0 0

z 0 1/3 1/3 1/3

z 0 0 0 0

(22)

Revised probability: example

t

_revised

(aÅÆ w)

1/2

= ---

(½ 1/2 0 0 ) (0 0 0 0 )

(½+1/2 +0+0 )

(a b)ÅÆ( w x)

+(0+0+0+0 )

_{(b c d)ÅÆ} _{(x y z)}

(23)

Revised probabilities table Revised probabilities table

a b c d

w 1/2 1/4 0 0

x 1/2 5/12 1/3 1/3

y 0 1/6 1/3 1/3

y

z 0 1/6 1/3 1/3

(24)

“revised counts”

b c d a b c d

a b a b c d

ÅÆ x y z

w 0 0 0 0

ÅÆ w x

w 1/2 3/8 0 0 w 0 0 0 0

x 0 5/9 1/3 1/3

w 1/2 3/8 0 0

x 1/2 5/8 0 0 0 5/9 /3 /3

y 0 2/9 1/3 1/3

/ 5/8 0 0

y 0 0 0 0

z 0 2/9 1/3 1/3

z 0 0 0 0

(25)

R R i d b biliti t bl Re-Revised probabilities table

a b c d

w 1/2 3/16 0 0

x 1/2 85/144 1/3 1/3

y 0 1/9 1/3 1/3

y

z 0 1/9 1/3 1/3

Continue until convergence; notice that (b,x) binding gets progressively stronger

(26)

Another Example

A four-sentence corpus: p

a b ↔ x y ( illustrated book ↔ livre illustrie ) b c ↔ x z ( book shop ↔ livre magasin )

Assuming no null alignments. Possible alignments:

a b a b b c b c

x y x y x z x z

(27)

Iteration 1

Initialize: uniform probabilities to all word translations ( | ) ( | ) ( | ) 1

t a x( | ) = t b x( | ) = t c x( | ) = 3 ( | ) ( | ) ( | ) 1

3 t a x t b x t c x

t a y = t b y = t c y = ( | ) ( | ) ( | ) 1

3 Compute the fractional counts:

t a z = t b z = t c z =

Compute the fractional counts:

( | ; , ) 1 , ( | ; , ) 0 c a x ab xy = 2 c a x bc xz =

( | ; , 1

c a y ab xy 1

) , ( | ; , ) 0 2

1 1

( | ; ) ( | ; )

c a y bc xz c b x ab xy c b x bc xz

= =

( | ; , ) , ( | ; , )

2 2

c b x ab xy = c b x bc xz =

(28)

Iteration 2 Iteration 2

From these counts, recomputing the probabilities:

1 1 1 1

( | ) 0 ; ( | ) 0 ; ( | ) 0

t a x′( | ) = + =0 ; ( | )t a y′ = + =0 ; ( | )t a z′ = 0

2 2 2 2

1 1 1 1 1 1

( | ) 1; ( | ) 0 ; ( | ) 0

2 2 2 2 2 2

t a x t a y t a z

t b x t b y t b z

+ +

′ = + = ′ = + = ′ = + =

1 1 1 1

( | ) 0 ; ( | ) 0; ( | ) 0

2 2 2 2

t c x′ = + = t c y′ = t c z′ = + =

These probabilities are not normalized (indicated by )t′

(29)

Normalized probabilities: after iteration 22

1 1

2 1 2 1

2 2

( | ) ; ( | ) ; ( | ) 0

1 1 4 1 1 2

1 0

2 2 2 2

1 1 1

t a x = = t a y = = t a z =

+ + + +

1 1 1

( | ) ; ( | ) ; ( | ) ;

2 2 2

1 1

( | ) ; ( | ) 0; ( | ) ;

t b x t b y t b z t c x t c y t c z

= = =

( | ) ; ( | ) 0; ( | ) ;

4 2

t c x t c y t c z

(30)

Normalized probabilities: after iteration 33

( | ) 0.15; ( | ) 0.64; ( | ) 0 t a x = t a y = t a z =

( | ) 0.70; ( | ) 0.36; ( | ) 0.36 t b x = t b y = t b z =

( | ) 0.15; ( | ) 0; ( | ) 0.64

The probabilities (after a few iterations) converge t c x = t c y = t c z =

The probabilities (after a few iterations) converge as expected (a ⇔ y; b ⇔ x; c ⇔ z)

(31)

Translation Model: Exact expression

Choose alignment

i d Choose the identity

f f i d

Choose the length f f i l

d l f h [ ]

given e and m of foreign word

given e, m, a of foreign language

string given e

Five models for estimating parameters in the expression [2]

Model‐1, Model‐2, Model‐3, Model‐4, Model‐5

(32)

Proof of Translation Model: Exact expression

∑

= f a e e

f | ) Pr( , | ) Pr(

expression

; marginalization

∑

a

f f | ) ( , | ) (

∑

=

m

e m a f e

a

f, | ) Pr( , , | ) Pr(

∑

; marginalization

; g

∑

=

m

e m a f e m e

m a

f, , | ) Pr( | )Pr( , | , ) Pr(

∑

=

m

e m a f e

m| )Pr( , | , ) Pr(

∑ ∏

=

−

= − m

m

j

j j j

j a a f m e

f e

m

1

1 1 1

1 , , , )

| , Pr(

)

| Pr(

∏

∑ ⁻ ⁻ ⁻

= ∑Pr(m|e)∏^m Pr(a |a^j ¹ f ^j ¹ m e)Pr(f |a^j f ^j ¹ m e)

=

j

j j j j

j j m

e m f

a f e m f

a a e

m

1

1 1 1

1 , , , )Pr( | , , , )

| Pr(

)

| Pr(

)

|

P (f P ( | )

∏

^m ^{P (} ^| ^j⁻¹ ^f ^j⁻¹ ⁾^{P (}^f ^| ^j ^f ^j⁻¹ ⁾

m is fixed for a particular f, hence

)

| , ,

Pr(f a m e =Pr(m|e)

∏

= j

j j j j

j

j a f me f a f me

a

1

1 1 1 1

1 1

1 , , , )Pr( | , , , )

| Pr(

(33)

Model-1

Simplest model

Assumptions

Pr(m|e) is independent of m and e and is equal to ε

Alignment of foreign language words (FLWs) depends only on length of English sentence

= (l+1) ( )

^-1

lis the length of English sentence

The likelihood function will be

M i i th lik lih d f ti t i d t

Maximize the likelihood function constrained to

(34)

Pushpak Bhattacharyya Pushpak Bhattacharyya

CS460/626 : Natural Language CS460/626 : Natural Language

Processing/Speech, NLP and the Web

(Lecture 17 Alignment in SMT) (Lecture 17– Alignment in SMT)

Pushpak Bhattacharyya Pushpak Bhattacharyya

CSE Dept.,

IIT Bombay

14

F b 2011

14

Feb, 2011

Language Divergence Theory: Lexico-

Semantic Divergences (ref: Dave, Parikh, Bhattacharyya, g ( yy Journal of MT, 2002)

Conflational divergence Co at o a d e ge ce

Structural divergence

Categorial divergence

Head swapping divergence

Lexical divergence

Language Divergence Theory: Syntactic

Divergences Divergences

Constituent Order divergence

Adjunction Divergence

Preposition-Stranding divergence

Null Subject Divergence

Pleonastic Divergence

Alignment

Completely aligned

Your answer is right

Your answer is right

Votre response est just

Problematic alignment

Problematic alignment

We first met in Paris

N t l

Nous nous sommes rencontres pour la

premiere fois a Paris

Th St ti ti l MT d l t ti

The Statistical MT model: notation

Source language: Source language: F F

Target Language: E

Source language sentence: g g f

Target language sentence: e

Source language word: w

Target language word: w

The Statistical MT model The Statistical MT model

To translate f:

To translate f:

h ll i

1. Assume that all sentences in E are

translations of f with some probability!

2. Choose the translation with the highest probability

ˆ arg max ( | )

e

e = P e f

SMT M d l SMT Model

• What is a good translation? What is a good translation?

– Faithful to source Fluent in target – Fluent in target

ˆ arg max ( ) ( | )

e = P e P f e

Language Modeling Language Modeling

• Task to find Task to find P(e) P(e) (assigning probabilities to (assigning probabilities to sentences)

∑

Language Modeling: The N -gram i ti

approximation

• Probability of the word given the previous Probability of the word given the previous N-1 N-1 words

• N 2: bigram approximation

• N=2: bigram approximation

• N=3: trigram approximation

• Bigram approximation:

∑

∑

Translation Modeling Translation Modeling

Task: to find P(f|e) ( | )

Cannot use the counts of f and e

Approximate: P(f|e) using the product of word translation probabilities (IBM model 1)

translation probabilities (IBM model 1)

Problem: How to calculate word translation probabilities?

Problem: How to calculate word translation probabilities?

Note: We do not have counts – training corpus is

sentence-aligned, not word-aligned

Language Divergence Theory: ^Lexico-

Language Divergence Theory: ^Syntactic