Statistical Alignment Models

(1)

A Systematic Comparison of Various

Statistical Alignment Models

Och and Ney ACL 2003 Presented by:

Ankit Ramteke Ankur Aher

Piyush Dungarwal Mandar Joshi

(2)

PART I

Ankit Ramteke Piyush Dungarwal

(3)

Introduction

(4)

Noisy channel model

(5)

Calculating Probability

● Alignment probability:

Pr(a|f, e) = Pr(f, a|e) / Pr(f|e)

● Two birds, one stone

• Compute Pr(f, a|e) using generative story

• Compute Pr(f|e) = Σ_a Pr(f, a|e)

(6)

Motivation

● Possible alignments:

b c b c

x y x y

b c b c

x y x y

(7)

Outline of the Paper

● Review various statistical alignment models and heuristic models and Training of the alignment models.

● IBM models 1-5 , HMM Model, Model 6 ( 4+HMM)

● Heuristic based Models

● Some heuristic methods for improving alignment quality

● Evaluation methodology for word alignment methods

● Systematic comparison of the various statistical alignment models

(8)

Topics for Today

● Introduction

● IBM Model 1

● IBM Model 2

● IBM Model 3

(9)

IBM Model 1

(10)

Description

(11)

Derivation

(12)

Learning Parameters

EM Algorithm consists of two steps:

• Expectation-Step: Apply model to the data

• parts of the model are hidden (here: alignments)

• using the model, assign probabilities to possible values

• Maximization-Step: Estimate model from data

• take assigned values as fact

• collect counts (weighted by probabilities)

• estimate model from counts

• Iterate these steps until convergence

(13)

Learning Parameters

Example taken from SMT by Koehn

(14)

(15)

(16)

(17)

(18)

(19)

EM Algorithm

(20)

IBM Model 2

(21)

Motivation

● Why model 2 when we have 1?

<NULL> राम पाठशाला गया

Ram went to school

school Ram to went

(22)

IBM Model 2

● Focuses on absolute alignment

● Assumptions

1. Pr(m|e) is independent of e & m

• New parameter ϵ = Pr(m|e)

2. Uniform distribution over l+1 (null included)

• Alignment probability is Pr(a_j|j, m, l)

3. Pr(f_j|a₁^j, f₁^j-1,m,e) depends only on f_j and e_aj

• Translation probability, t(fj|eaj) = Pr(fj|a1-j, f1-j-1, m, e) – Number of new parameters: m (a_j for j=1 to m)

– Training

(23)

IBM Model 3

(24)

Why model 3

b c b c

x y x y

b c b c

x y x y

(25)

Terminologies

● Adds fertility model

● Fertility probability

– Eg. n(2|house) = prob. of generating 2 words for the word ‘house’

● Translation probability: same as model 1

– Eg. t(maison|house) = prob. of ‘maison’ being translation of ‘house’

● Distortion probability

– Eg. d(5|2) = prob. that word at position 2 goes to position 5

(26)

Derivation from Noisy Channel

(27)

Example

This city is famous for its flora.

This city is famous for its flora flora ^fertility

step

This city is famous ^NULL for its flora flora ^null

insertion step

यह शहर है मशहू र के लए अपने पेड़ पौध ^lexical

translation

यह शहर अपने पेड़ पौध के लए मशहू र है distortion step

(28)

Modifications Required

(29)

Final Probability

(30)

Generative Process

(31)

Generative Process

(32)

Training

• Training flow diagram from Kevin Knight SMT tutorial

(33)

Deficiency

• Distortion probabilities do not depend on the earlier words

• Model 3 wastes some of its probability on generalized strings

• Strings that have some positions with several words and others with none.

• When a model has this property of not

concentrating all of its probability on events of interest, it is said to be deficient.

(34)

Example

<Null> राम पाठशाला गया

Ram

<Null> <Null> went school to

(35)

Alignment Model

Fertility Model E-Step Deficient

Model 1 Uniform No Exact No

Model 2 Zero order No Exact No

Model 3 Zero order Yes Approximate Yes

Comparison of Statistical Models

(36)

Heuristic Models

(37)

PART II

Ankur Aher Mandar Joshi

(38)

Outline

Recap : IBM Model 1,2 & 3

HMM based alignment approach IBM Model 4

(39)

MT Phenomenon

Word-to-word translation Fertility

Re-ordering

राम पाठशाला गया

Ram school went Ram school went to Ram went to school

(40)

SMT

(41)

Noisy channel model

argmax_e Pr(e|f) = argmax_e Pr(e) . Pr(f|e)

language model translation model

Pr(f|e) = Σ_a Pr(f, a|e)

alignment model

Alignment Example:

Ram went to school

Positional Vector: a{1-1,2-3,3-0,4-2}

(42)

Translation model

(43)

Model 1

(44)

Model 1

(45)

Model 2

● Why model 2 when we have 1?

Ram went to school

school Ram to went

(46)

Model 2

(47)

Model 2

(48)

Model 3

b c b c

x y x y

b c b c

x y x y

(49)

Terminologies

(50)

Terminologies

(51)

Example

(52)

Model 3

(53)

Deficiency

<Null> राम पाठशाला गया

Ram

<Null> <Null> went school to

(54)

Comparison of Statistical Models

(55)

Hidden Markov Alignment Model

(56)

Motivation

In the translation process, large phrases tend to move together.

Words that are adjacent in the source language tend to be next to each other in the target

language.

Strong localization effect is observed in alignment.

(57)

Motivation

Hindi-English Alignment Example

times *

three *

cup *

world *

cricket *

won *

team *

Indian *

भारतीय ट म केट व व कप तीन बार जीती

(58)

What is Hidden?

(59)

Capturing Locality

HMM captures the locality of English sentence.

(60)

Homogenous HMM

To make the alignment parameters

independent of absolute word positions, we assume that the alignment probabilities p(i | i’, m) depend only on the jump width(i − i’).

(61)

HMM First-order No Exact No

Comparison of Statistical Models

(62)

IBM Model 4

(63)

Motivation

Large phrases tend to move together.

This intuition is captured through relative distortion.

The placement of the translation of an input word is typically based on the placement of the translation of the proceeding input word.

(64)

Example

The Indian team played well for 90 minutes.

भारतीय ट म ९० मनट अ छ खेल .

(65)

Terminology

Cept

– Each input word f_j that is aligned to at least one output word forms a cept.

हंद क कताब अलमार म थी.

The Hindi book was in the cupboard.

The underlined words are cepts

(66)

Terminology

We define the operator [i] to map the cept with index i back to its corresponding

foreign input word position.

(67)

Center of a cept

Center of a cept is defined as the ceiling of

the average of the output word positions for that cept.

We use ⃝_i to indicate centre of cept i.

(68)

Table 1

cept π_i π₁ π₂ π₃ π₄ π₅

foreign position [i]

1 3 4 5 6

foreign word f[i]

हंद कताब अलमार म थी

English words {ej }

Hindi book cupboard in the was

English

positions {j}

2 3 7 5, 6 4

center of cept ⃝_i

2 3 7 6 4

(69)

Relative Distortion

Three cases:

 words that are generated by the NULL token,

 These are uniformly distributed.

 the first word in a cept, and

 subsequent words in a cept

(70)

The first word in a cept

For the likelihood of placements of the first word of a cept, we use the distribution

d₁( j−⃝_i−1⁾

Example: See table 2

(71)

Subsequent words in a cept

The distribution d_>1( j − π_i,k−1)

Example: See table 2

(72)

Table 2

j 1 2 3 4 5 6 7

e_j The Hindi book was In the cupboard

in cept π_i,k

π_0,0 π_1,0 π_2,0 π_5,0 π_4,0 π_4,1 π_3,0

⃝_i−1 - 0 2 6 7 - 3

j - ⃝_i−1 - +2 +1 -2 -2 - +4

distortion 1 D₁(+2) D₁(+1) D₁(-2) D₁(-2) D_>1(+1) D₁(+4)

(73)

Model 4

(74)

Model 4 First-order Yes Approximate Yes

Comparison of Statistical Models

(75)

Word Classes

Smarter way to look at re-ordering

Some words tend to get reordered while some do not.

Examples

– Good man  भला आदमी

– external affairs  affaires extérieur

(76)

Word Classes

Conditioning on words will not work due to sparsity.

Instead, create word classes by dividing the vocabulary e.g. POS tags.

PART III

Ankit Ramteke Ankur Aher

(78)

Outline

Topics covered

– IBM Models 1, 2, 3, 4 – HMM

Today’s topics

– Heuristic Models – IBM Model 5

– Model 6

– Comparison of various methods based on

• Size of training corpus

• EM training methodology

(79)

Heuristic Models

(80)

Dice

(81)

Example

Parallel corpus

Those boys play cricket. वह लड़के केट खेलते ह.

That boy plays cricket. वह लड़का केट खेलता है.

I saw that house. मने वह घर देखा.

I slept in that house. मै उस घर म सोया.

I played in that house. मै उस घर म खेला.

I liked that house. मुझे वह घर भाया.

That boy went in the house. वह लड़का घर म गया.

Example: That is my house. वह मेरा घर है. dice(that, वह) = (2*4)/(6*5) = 0.266

dice(that, घर) = (2*5)/(6*5) = 0.333 indirect association dice(house, घर) = (2*5)/(5*5) = 0.4

(82)

Competitive linking algorithm

• Iteratively align the highest ranking word position (i,j)

• Advantage: indirect associations occur less often

• House  घर

• The Dice coefficient results in a worse alignment quality than the statistical models.

वह मेरा घर है

That 0.26 0.0 0.33 0.16

is 0.0 0.0 0.0 0.0

my 0.0 0.0 0.0 0.0

house 0.24 0.0 0.4 0.0

(83)

Competitive linking algorithm

• That  वह

That 0.26 0.0 0.33 0.16

is 0.0 0.0 0.0 0.0

my 0.0 0.0 0.0 0.0

house 0.24 0.0 0.4 0.0

(84)

Competitive linking algorithm

• My / is  मेरा / है

That 0.26 0.0 0.33 0.16

is 0.0 0.0 0.0 0.0

my 0.0 0.0 0.0 0.0

house 0.24 0.0 0.4 0.0

(85)

Distortion Probabilities

Where,

vj is the number of vacancies in the English output interval [1; j]

(86)

Null हंद क कताब अलमार म थी.

The Hindi book was in the cupboard.

Cept Vacancies Parameters of

d1

f[i] π_i,k v1 v2 v3 v4 v5 v6 V7 j vj V-

max v

⃝

i−1

The Hindi book was in the Cupboard

NULL π_0,1 1 2 3 4 5 6 7 1 - - -

हंद π_1,1 - 1 2 3 4 5 6 2 1 6 0

कताब π_2,1 - - 1 2 3 4 5 3 1 5 0

अलमार π_3,1 - - - 1 2 3 4 7 4 4 0

म π_4,1 - - - 1 2 - - 5 2 2 3

π_4,2 - - - 1 - 2 - 6 2 - -

थी π_5,1 - - - 1 - - - 4 1 1 0

(87)

Model 5

(88)

Model 6

(89)

Motivation

• HMM makes use of locality in the source language

• Model 4 makes use of locality in the target language

• How can we use both ?

(90)

Formulation

• Combine HMM and Model 4 in a log-linear way (Och &

Ney 2003)

α is interpolation parameter

(91)

General Formulation

• A log-linear combination of several models pk(f, a | e), k = 1, . . . ,K is given by

(92)

Model 5 First-order Yes Approximate No

Comparison of Statistical Models

(93)

Summary

Alignment Model Description

Model 1 Uniform Uniform probability

distribution (l+1)

Model 2 Zero order a(a_j|j, m, l)

HMM First-order

Model 3 Zero order ( | _ , , )

Model 4 First-order

Model 5 First-order

Model 6 First-order Log linear combination of HMM and Model 4

(94)

Corpus

(95)

Verbmobil task

German-English Speech translation task

German English

Training Corpus Sentences 34,446 ~ 34K

Words 329,625 343,076

Vocabulary 5,936 3,505

Singletons 2,600 1,305

Bilingual Dictionary Entries 4,404

Words 4,758 5,543

Test Corpus Sentences 354

Words 3,233 3,109

(96)

Hansards Task

The French-English Hansards task consists of the debates in the Canadian parliament.

French English

Training Corpus Sentences 1470K

Words 24.33M 22.16M

Vocabulary 100,269 78,332

Singletons 40,199 31,319

Bilingual Dictionary Entries 28,701

Words 28,702 30,186

Test Corpus Sentences 500

Words 8,749 7,946

(97)

Comparison based on various

metrics

(98)

Based on training schemes

(Vermobil)

(99)

Based on training schemes

(Hansards)

(100)

Observations - 1

Refined Models (4,5,6) perform better in all corpus sizes

As the corpus size increases the improvement is mush more significant

Statistical models perform better than

heuristic models due to a well-founded mathematical theory that underlies their parameter estimation

(101)

Observations - 2

Hidden Markov alignment model achieves significantly better results than Model 2

– Homogeneous first-order alignment model

– better represent the locality and monotonicity properties of natural languages

Alignment quality depends heavily on the method used to bootstrap the model

– Bootstrapping with HMM performs better than that with Model 2

(102)

EM training

Models 1,2 and HMM are exact Models 3,4,5,6 are approximate

– Select smaller subset of alignments

Three methods to select subset

(103)

Viterbi alignment(Brown et al.

2003)

(104)

Viterbi + Neighborhood

(Al-Onaizan et al. 1999)

(105)

Pegged Alignments (Brown et al.

2003)

(106)

Effect of more alignments

(Vermobil)

(107)

Effect of more alignments

(Hansards)

(108)

Observations

The effect of pegging strongly depends on the quality of the starting point used

If Model 2 is used as starting point,

– A significant improvement with the neighborhood alignments and the pegged alignments.

Only the Viterbi alignment

– The results are significantly worse than using

additionally the neighborhood of the Viterbi alignment.

HMM as the starting point has a much smaller effect

Using more alignments in training is a way to avoid a poor local optimum.

(109)

Computational times

• Using the pegging alignments yields only a moderate

improvement in performance with a considerable increase in computational time

(110)

PART IV

Piyush Dungarwal Mandar Joshi

(111)

Effect of Smoothing

Using linear interpolation For Alignment probability:

For Fertility probability:

(112)

Results (Verbmobil task)

(113)

Results (Hansard task)

(114)

Direction

(115)

Direction

(116)

Symmetrization

Intersection Union

Refined method

– In a first step, the intersection A = A1 ∩ A2 is determined – extend the alignment A by adding alignments (i, j)

occurring only in the A1 or in the alignment A2 if

• neither f^jj nor ei has an alignment in A, or

• if both of the following conditions hold

– (i,j) has a horizontal or vertical neighbour

– A ∪ {(i, j)} does not contain alignments with both horizontal and vertical neighbours

(117)

e1 e2 e3 e4

f1 f2 f3 f4 f1 f2 f3 f4

f1 f2 f3 f4 f1 f2 f3 f4 f1 f2 f3 f4

F -> E alignments E -> F

alignments

Intersection Union Refined method

(118)

Symmetrisation

(119)

Symmetrisation

(120)

Conclusion

This paper compares heuristic and statistical methods (IBM models 1-5, HMM, Model 6) Statistical alignment models outperform the

simple Dice coefficient

Model 6 outperforms all other models Smoothing and symmetrisation have a

significant effect on the alignment quality achieved by a particular model.

(121)

Conclusion

Bilingual dictionary and word classes have minor effect on alignment quality

In general, important ingredients of a good

model seem to be a first-order dependence between word positions and a fertility model

(122)

References

Franz Josef Och and Hermann Ney. A systematic

comparison of various statistical alignment models.

Association of Computational Linguistics . March 2003.

pages 19-51.

Peter F. Brown , Vincent J.Della Pietra , Stephen A. Della Pietra , Robert. L. Mercer.The Mathematics of Statistical Machine Translation: Parameter Estimation. Association of Computational Linguistics .1993.

Philipp Koehn. Statistical Machine Translation.2010.

Chapter-4(Word-Based Models)

Vogel, Stephan, Hermann Ney, and Christoph Tillmann.

HMM-based word alignment in statistical translation.

COLING. 1996, pages 836–841.