• No results found

Statistical Alignment Models

N/A
N/A
Protected

Academic year: 2022

Share "Statistical Alignment Models"

Copied!
122
0
0

Loading.... (view fulltext now)

Full text

(1)

A Systematic Comparison of Various

Statistical Alignment Models

Och and Ney ACL 2003 Presented by:

Ankit Ramteke Ankur Aher

Piyush Dungarwal Mandar Joshi

(2)

PART I

Ankit Ramteke Piyush Dungarwal

(3)

Introduction

(4)

Noisy channel model

(5)

Calculating Probability

Alignment probability:

Pr(a|f, e) = Pr(f, a|e) / Pr(f|e)

Two birds, one stone

Compute Pr(f, a|e) using generative story

Compute Pr(f|e) = Σa Pr(f, a|e)

(6)

Motivation

Possible alignments:

b c b c

x y x y

b c b c

x y x y

(7)

Outline of the Paper

Review various statistical alignment models and heuristic models and Training of the alignment models.

IBM models 1-5 , HMM Model, Model 6 ( 4+HMM)

Heuristic based Models

Some heuristic methods for improving alignment quality

Evaluation methodology for word alignment methods

Systematic comparison of the various statistical alignment models

(8)

Topics for Today

Introduction

IBM Model 1

IBM Model 2

IBM Model 3

(9)

IBM Model 1

(10)

Description

(11)

Derivation

(12)

Learning Parameters

EM Algorithm consists of two steps:

Expectation-Step: Apply model to the data

parts of the model are hidden (here: alignments)

using the model, assign probabilities to possible values

Maximization-Step: Estimate model from data

take assigned values as fact

collect counts (weighted by probabilities)

estimate model from counts

Iterate these steps until convergence

(13)

Learning Parameters

Example taken from SMT by Koehn

(14)

Example taken from SMT by Koehn

(15)

Example taken from SMT by Koehn

(16)

Example taken from SMT by Koehn

(17)

Example taken from SMT by Koehn

(18)
(19)

EM Algorithm

(20)

IBM Model 2

(21)

Motivation

Why model 2 when we have 1?

<NULL> राम पाठशाला गया

Ram went to school

<NULL> राम पाठशाला गया

school Ram to went

(22)

IBM Model 2

Focuses on absolute alignment

Assumptions

1. Pr(m|e) is independent of e & m

New parameter ϵ = Pr(m|e)

2. Uniform distribution over l+1 (null included)

Alignment probability is Pr(aj|j, m, l)

3. Pr(fj|a1j, f1j-1,m,e) depends only on fj and eaj

Translation probability, t(fj|eaj) = Pr(fj|a1-j, f1-j-1, m, e) Number of new parameters: m (aj for j=1 to m)

Training

(23)

IBM Model 3

(24)

Why model 3

Possible alignments:

b c b c

x y x y

b c b c

x y x y

(25)

Terminologies

Adds fertility model

Fertility probability

Eg. n(2|house) = prob. of generating 2 words for the word ‘house’

Translation probability: same as model 1

Eg. t(maison|house) = prob. of ‘maison’ being translation of ‘house’

Distortion probability

Eg. d(5|2) = prob. that word at position 2 goes to position 5

(26)

Derivation from Noisy Channel

(27)

Example

This city is famous for its flora.

This city is famous for its flora flora fertility

step

This city is famous NULL for its flora flora null

insertion step

यह शहर है मशहू र के लए अपने पेड़ पौध lexical

translation

यह शहर अपने पेड़ पौध के लए मशहू र है distortion step

(28)

Modifications Required

(29)

Final Probability

(30)

Generative Process

(31)

Generative Process

(32)

Training

• Training flow diagram from Kevin Knight SMT tutorial

(33)

Deficiency

• Distortion probabilities do not depend on the earlier words

• Model 3 wastes some of its probability on generalized strings

Strings that have some positions with several words and others with none.

• When a model has this property of not

concentrating all of its probability on events of interest, it is said to be deficient.

(34)

Example

<Null> राम पाठशाला गया

Ram

<Null> <Null> went school to

(35)

Alignment Model

Fertility Model E-Step Deficient

Model 1 Uniform No Exact No

Model 2 Zero order No Exact No

Model 3 Zero order Yes Approximate Yes

Comparison of Statistical Models

(36)

Heuristic Models

(37)

PART II

Ankur Aher Mandar Joshi

(38)

Outline

Recap : IBM Model 1,2 & 3

HMM based alignment approach IBM Model 4

(39)

MT Phenomenon

Word-to-word translation Fertility

Re-ordering

राम पाठशाला गया

Ram school went Ram school went to Ram went to school

(40)

SMT

(41)

Noisy channel model

argmaxe Pr(e|f) = argmaxe Pr(e) . Pr(f|e)

language model translation model

Pr(f|e) = Σa Pr(f, a|e)

alignment model

Alignment Example:

<NULL> राम पाठशाला गया

Ram went to school

Positional Vector: a{1-1,2-3,3-0,4-2}

(42)

Translation model

(43)

Model 1

(44)

Model 1

(45)

Model 2

Why model 2 when we have 1?

<NULL> राम पाठशाला गया

Ram went to school

<NULL> राम पाठशाला गया

school Ram to went

(46)

Model 2

(47)

Model 2

(48)

Model 3

Possible alignments:

b c b c

x y x y

b c b c

x y x y

(49)

Terminologies

(50)

Terminologies

(51)

Example

Example taken from SMT by Koehn

(52)

Model 3

(53)

Deficiency

<Null> राम पाठशाला गया

Ram

<Null> <Null> went school to

(54)

Alignment Model

Fertility Model E-Step Deficient

Model 1 Uniform No Exact No

Model 2 Zero order No Exact No

Model 3 Zero order Yes Approximate Yes

Comparison of Statistical Models

(55)

Hidden Markov Alignment Model

(56)

Motivation

In the translation process, large phrases tend to move together.

Words that are adjacent in the source language tend to be next to each other in the target

language.

Strong localization effect is observed in alignment.

(57)

Motivation

Hindi-English Alignment Example

times *

three *

cup *

world *

cricket *

won *

team *

Indian *

भारतीय ट म केट व व कप तीन बार जीती

(58)

What is Hidden?

(59)

Capturing Locality

HMM captures the locality of English sentence.

(60)

Homogenous HMM

To make the alignment parameters

independent of absolute word positions, we assume that the alignment probabilities p(i | i’, m) depend only on the jump width(i − i’).

(61)

Alignment Model

Fertility Model E-Step Deficient

Model 1 Uniform No Exact No

Model 2 Zero order No Exact No

Model 3 Zero order Yes Approximate Yes

HMM First-order No Exact No

Comparison of Statistical Models

(62)

IBM Model 4

(63)

Motivation

Large phrases tend to move together.

This intuition is captured through relative distortion.

The placement of the translation of an input word is typically based on the placement of the translation of the proceeding input word.

(64)

Example

The Indian team played well for 90 minutes.

भारतीय ट म ९० मनट अ छ खेल .

(65)

Terminology

Cept

– Each input word fj that is aligned to at least one output word forms a cept.

हंद क कताब अलमार म थी.

The Hindi book was in the cupboard.

The underlined words are cepts

(66)

Terminology

We define the operator [i] to map the cept with index i back to its corresponding

foreign input word position.

(67)

Center of a cept

Center of a cept is defined as the ceiling of

the average of the output word positions for that cept.

We use ⃝i to indicate centre of cept i.

(68)

Table 1

cept πi π1 π2 π3 π4 π5

foreign position [i]

1 3 4 5 6

foreign word f[i]

हंद कताब अलमार थी

English words {ej }

Hindi book cupboard in the was

English

positions {j}

2 3 7 5, 6 4

center of cept i

2 3 7 6 4

(69)

Relative Distortion

Three cases:

 words that are generated by the NULL token,

These are uniformly distributed.

 the first word in a cept, and

 subsequent words in a cept

(70)

The first word in a cept

For the likelihood of placements of the first word of a cept, we use the distribution

d1( j−⃝i−1)

Example: See table 2

(71)

Subsequent words in a cept

The distribution d>1( j − πi,k−1)

Example: See table 2

(72)

Table 2

j 1 2 3 4 5 6 7

ej The Hindi book was In the cupboard

in cept πi,k

π0,0 π1,0 π2,0 π5,0 π4,0 π4,1 π3,0

i−1 - 0 2 6 7 - 3

j - i−1 - +2 +1 -2 -2 - +4

distortion 1 D1(+2) D1(+1) D1(-2) D1(-2) D>1(+1) D1(+4)

(73)

Model 4

(74)

Alignment Model

Fertility Model E-Step Deficient

Model 1 Uniform No Exact No

Model 2 Zero order No Exact No

Model 3 Zero order Yes Approximate Yes

HMM First-order No Exact No

Model 4 First-order Yes Approximate Yes

Comparison of Statistical Models

(75)

Word Classes

Smarter way to look at re-ordering

Some words tend to get reordered while some do not.

Examples

Good man भला आदमी

external affairs affaires extérieur

(76)

Word Classes

Conditioning on words will not work due to sparsity.

Instead, create word classes by dividing the vocabulary e.g. POS tags.

More on this later

(77)

PART III

Ankit Ramteke Ankur Aher

(78)

Outline

Topics covered

– IBM Models 1, 2, 3, 4 – HMM

Today’s topics

– Heuristic Models – IBM Model 5

– Model 6

– Comparison of various methods based on

Size of training corpus

EM training methodology

(79)

Heuristic Models

(80)

Dice

(81)

Example

Parallel corpus

Those boys play cricket. वह लड़के केट खेलते ह.

That boy plays cricket. वह लड़का केट खेलता है.

I saw that house. मने वह घर देखा.

I slept in that house. मै उस घर म सोया.

I played in that house. मै उस घर म खेला.

I liked that house. मुझे वह घर भाया.

That boy went in the house. वह लड़का घर म गया.

Example: That is my house. वह मेरा घर है. dice(that, वह) = (2*4)/(6*5) = 0.266

dice(that, घर) = (2*5)/(6*5) = 0.333 indirect association dice(house, घर) = (2*5)/(5*5) = 0.4

(82)

Competitive linking algorithm

Iteratively align the highest ranking word position (i,j)

Advantage: indirect associations occur less often

House  घर

The Dice coefficient results in a worse alignment quality than the statistical models.

वह मेरा घर है

That 0.26 0.0 0.33 0.16

is 0.0 0.0 0.0 0.0

my 0.0 0.0 0.0 0.0

house 0.24 0.0 0.4 0.0

(83)

Competitive linking algorithm

Iteratively align the highest ranking word position (i,j)

Advantage: indirect associations occur less often

That  वह

The Dice coefficient results in a worse alignment quality than the statistical models.

वह मेरा घर है

That 0.26 0.0 0.33 0.16

is 0.0 0.0 0.0 0.0

my 0.0 0.0 0.0 0.0

house 0.24 0.0 0.4 0.0

(84)

Competitive linking algorithm

Iteratively align the highest ranking word position (i,j)

Advantage: indirect associations occur less often

My / is  मेरा / है

The Dice coefficient results in a worse alignment quality than the statistical models.

वह मेरा घर है

That 0.26 0.0 0.33 0.16

is 0.0 0.0 0.0 0.0

my 0.0 0.0 0.0 0.0

house 0.24 0.0 0.4 0.0

(85)

Distortion Probabilities

Where,

vj is the number of vacancies in the English output interval [1; j]

(86)

Null हंद क कताब अलमार म थी.

The Hindi book was in the cupboard.

Cept Vacancies Parameters of

d1

f[i] πi,k v1 v2 v3 v4 v5 v6 V7 j vj V-

max v

i−1

The Hindi book was in the Cupboard

NULL π0,1 1 2 3 4 5 6 7 1 - - -

हंद π1,1 - 1 2 3 4 5 6 2 1 6 0

कताब π2,1 - - 1 2 3 4 5 3 1 5 0

अलमार π3,1 - - - 1 2 3 4 7 4 4 0

π4,1 - - - 1 2 - - 5 2 2 3

π4,2 - - - 1 - 2 - 6 2 - -

थी π5,1 - - - 1 - - - 4 1 1 0

(87)

Model 5

(88)

Model 6

(89)

Motivation

HMM makes use of locality in the source language

Model 4 makes use of locality in the target language

How can we use both ?

(90)

Formulation

Combine HMM and Model 4 in a log-linear way (Och &

Ney 2003)

α is interpolation parameter

(91)

General Formulation

A log-linear combination of several models pk(f, a | e), k = 1, . . . ,K is given by

(92)

Alignment Model

Fertility Model E-Step Deficient

Model 1 Uniform No Exact No

Model 2 Zero order No Exact No

HMM First-order No Exact No

Model 3 Zero order Yes Approximate Yes

Model 4 First-order Yes Approximate Yes

Model 5 First-order Yes Approximate No

Model 6 First-order Yes Approximate Yes

Comparison of Statistical Models

(93)

Summary

Alignment Model Description

Model 1 Uniform Uniform probability

distribution (l+1)

Model 2 Zero order a(aj|j, m, l)

HMM First-order

Model 3 Zero order ( | _ , , )

Model 4 First-order

Model 5 First-order

Model 6 First-order Log linear combination of HMM and Model 4

(94)

Corpus

(95)

Verbmobil task

German-English Speech translation task

German English

Training Corpus Sentences 34,446 ~ 34K

Words 329,625 343,076

Vocabulary 5,936 3,505

Singletons 2,600 1,305

Bilingual Dictionary Entries 4,404

Words 4,758 5,543

Test Corpus Sentences 354

Words 3,233 3,109

(96)

Hansards Task

The French-English Hansards task consists of the debates in the Canadian parliament.

French English

Training Corpus Sentences 1470K

Words 24.33M 22.16M

Vocabulary 100,269 78,332

Singletons 40,199 31,319

Bilingual Dictionary Entries 28,701

Words 28,702 30,186

Test Corpus Sentences 500

Words 8,749 7,946

(97)

Comparison based on various

metrics

(98)

Based on training schemes

(Vermobil)

(99)

Based on training schemes

(Hansards)

(100)

Observations - 1

Refined Models (4,5,6) perform better in all corpus sizes

As the corpus size increases the improvement is mush more significant

Statistical models perform better than

heuristic models due to a well-founded mathematical theory that underlies their parameter estimation

(101)

Observations - 2

Hidden Markov alignment model achieves significantly better results than Model 2

– Homogeneous first-order alignment model

– better represent the locality and monotonicity properties of natural languages

Alignment quality depends heavily on the method used to bootstrap the model

– Bootstrapping with HMM performs better than that with Model 2

(102)

EM training

Models 1,2 and HMM are exact Models 3,4,5,6 are approximate

– Select smaller subset of alignments

Three methods to select subset

(103)

Viterbi alignment(Brown et al.

2003)

(104)

Viterbi + Neighborhood

(Al-Onaizan et al. 1999)

(105)

Pegged Alignments (Brown et al.

2003)

(106)

Effect of more alignments

(Vermobil)

(107)

Effect of more alignments

(Hansards)

(108)

Observations

The effect of pegging strongly depends on the quality of the starting point used

If Model 2 is used as starting point,

– A significant improvement with the neighborhood alignments and the pegged alignments.

Only the Viterbi alignment

– The results are significantly worse than using

additionally the neighborhood of the Viterbi alignment.

HMM as the starting point has a much smaller effect

Using more alignments in training is a way to avoid a poor local optimum.

(109)

Computational times

Using the pegging alignments yields only a moderate

improvement in performance with a considerable increase in computational time

(110)

PART IV

Piyush Dungarwal Mandar Joshi

(111)

Effect of Smoothing

Using linear interpolation For Alignment probability:

For Fertility probability:

(112)

Results (Verbmobil task)

(113)

Results (Hansard task)

(114)

Direction

(115)

Direction

(116)

Symmetrization

Intersection Union

Refined method

– In a first step, the intersection A = A1 ∩ A2 is determined – extend the alignment A by adding alignments (i, j)

occurring only in the A1 or in the alignment A2 if

• neither fjj nor ei has an alignment in A, or

• if both of the following conditions hold

– (i,j) has a horizontal or vertical neighbour

A {(i, j)} does not contain alignments with both horizontal and vertical neighbours

(117)

e1 e2 e3 e4

e1 e2 e3 e4

e1 e2 e3 e4

e1 e2 e3 e4

e1 e2 e3 e4

f1 f2 f3 f4 f1 f2 f3 f4

f1 f2 f3 f4 f1 f2 f3 f4 f1 f2 f3 f4

F -> E alignments E -> F

alignments

Intersection Union Refined method

(118)

Symmetrisation

(119)

Symmetrisation

(120)

Conclusion

This paper compares heuristic and statistical methods (IBM models 1-5, HMM, Model 6) Statistical alignment models outperform the

simple Dice coefficient

Model 6 outperforms all other models Smoothing and symmetrisation have a

significant effect on the alignment quality achieved by a particular model.

(121)

Conclusion

Bilingual dictionary and word classes have minor effect on alignment quality

In general, important ingredients of a good

model seem to be a first-order dependence between word positions and a fertility model

(122)

References

Franz Josef Och and Hermann Ney. A systematic

comparison of various statistical alignment models.

Association of Computational Linguistics . March 2003.

pages 19-51.

Peter F. Brown , Vincent J.Della Pietra , Stephen A. Della Pietra , Robert. L. Mercer.The Mathematics of Statistical Machine Translation: Parameter Estimation. Association of Computational Linguistics .1993.

Philipp Koehn. Statistical Machine Translation.2010.

Chapter-4(Word-Based Models)

Vogel, Stephan, Hermann Ney, and Christoph Tillmann.

HMM-based word alignment in statistical translation.

COLING. 1996, pages 836–841.

References

Related documents

Integrated assessment models (IAMs) are, as the name would suggest, models that integrate projections of emissions pathways and a simple climate model with a damage function:

A very simple application of the concepts of revealed preference was used to demonstrate that the two choices made in the example of inconsistent choices could not be produced by

An alignment model and a training technique mostly suited for statistical machine 

This research project examines the construction of cultural based model, architectural model and models evolved from various historic and linguistic analysis of

The statistical analysis indicated that GEP based models work better than currently used models for density and viscosity of bitumen... With no predictive model that relates

Amorphous semiconductors ; structural models ; continuous random net- work ; microcrystallite

In the Study four econometrics models namely constant hedge Models like OLS Model and VECM Model and time varying hedge Models such as GARCH and EGARCH has been used to determine

11 Types of Models - Models are of different types like Solid Model, Working Model or a Sailing Model as explained below:-.. (a)