Statistical Machine Translation

(1)

Decoding

Presented By Bibek Behera Rucha Kulkarni Govind Chintapalli

21^st March 2013 2^ndApril 2013 4^thApril 2013 8^thApril 2013

(2)

Outline

1. Translation Process

2. Decoding using Beam Search

• Hypothesis Expansion

• Hypothesis Recombination

• Stack Decoding

• Pruning

• Reordering Limits

3. Future Cost Estimation 4. Decoding using A*

5. Decoding Complexity

(3)

Statistical Machine Translation

Overview

Hindi-English Parallel corpus

English monolingual

corpus

Translation Model Language Model

Decoding Algorithm Input Source

Sentence

Output Target Sentence

(4)

Decoding

● The task of decoding is to find the best scoring translation according to the mathematical formulae of the translation model.

● This is a hard problem.

● Heuristic Search methods are efficient but do not guarantee the best translation.

● Two types of errors may occur:

– Search Error: Failure to find the best translation

– Model Error: Highest probability translation may not be good.

(5)

Decoding Process

● Select the word to be translated.

● Get translation of selected word.

● Build output sentence in sequence, so input words may be selected out of sequence.

● Continue till all words of input are translated.

अब तुम घर जाओ

home go

you now

(6)

Sentence Translation Probability

●

Purpose of Decoding : to find target sentence with maximum probability

●

Several components contribute to the overall score:

–

the phrase translation probability φ

–

the reordering model d

–

the language model p

_LM

Where for all i = 1, ..., I input phrases are f and

(7)

Partial Scores

● During the translation process, we can compute partial score for the partial translations we have constructed.

● So, instead of computing the translation probability for the whole translation only when it is completed, we can incrementally add to the score as we are constructing it.

● The three model components are:

– phrase translation

– Reordering

– language model

● So, whenever we add a new translated phrase to our partially constructed translation, we can already compute its model cost.

● We only need to know the identity of the phrase pair, its location in the input, and some information about the sequence we have constructed so far.

(8)

Partial Scores

●

Example

अब तुम घर जाओ

^

now

Right now

you You have

You have

you

go

Go to

Get the

Get

To go

. home

house dwelling

home

mansion 0.01

0.009

0.56

0.32 0.45

0.15

0.48 0.27

0.02 0.43

0.25 0.33

0.36 0.38

0.02 0.003 0.005 0.01 0.003

0.025 0.01 0.015 0.0023 0.006

0.001256 0.00215

0.00023 0.00015

0.0032

(9)

Decoding using

Beam Search

(10)

Translation Options

● For any given source phrase or word, the system is confronted with a large number of options from the phrase translation table.

Approx. 9K choices given by the Gyan Nidhi phrase translation table.

वह घर नह ं जाता

it

no return

go house

home not

It has He was

he

No homes

Come back Did not go

No home

Not visited my house return homes

dwelling That house

(11)

Translation Options

●

There are many good translations in this collection. But the choices are too many.

●

The decoding problem: to find adequate phrase translations and place them together in a fluent English sentence.

●

It is essentially a search problem.

(12)

Hypothesis Expansion

Empty hypothesis

p=1

it

he Empty

hypothesis p=1

● Starting with the initial empty hypothesis.

● Hypothesis are expanded by picking translation options (indicated by the lines).

● This results in a new hypothesis in the search Graph.

● Each hypothesis is pictured as a box containing the most recently added English word, a coverage vector of translated Hindi words and a pointer from its parent hypothesis.

(13)

Hypothesis Expansion

● This process continues until all hypothesis have been expanded

it

he

has

was not

Went home

home Empty

hypothesis p=1

(14)

Hypothesis Expansion

● A hypothesis that covers all input words cannot be expanded any further and forms an end point in the search graph.

End

● Given all such completed hypotheses, the one with highest probability score is the best path.

● Tracking back, the best scoring translation is obtained.

was not home

(15)

Computational Complexity

● Machine translation decoding has been shown to be NP- complete[Kevin knight 1996].

● The size of the search space grows roughly exponentially in the length of the input sentence.

● POS tagging has an exponential search space but can be processed in linear time.

● What is the reason that translation cannot be processed in polynomial time.

(16)

Computational Complexity

● The POS tagging problem is very similar to a translation problem.

Consider this tagging problem:-

^ He laughs loud $

^ N N A $

V V V

Tagging problem can be solved by Viterbi in linear time.Tagging problem is parsed linearly. The tag of present word depends only on the previous word due to Markov assumption. So viterbi can solve tagging problem in linear time.

(17)

Computational Complexity

Consider this translation problem :-

^ He laughs loud $

^ वह हंसता जोर से $ यह मु कुराता कड़ी मेहनत से

But translation takes exponential time. What is the reason?

Translation cannot be parsed linearly .Translation faces the problem of reordering . We cannot consider the next word only but the entire sentence has to be considered .

(18)

Computational Complexity

^ He laughs loud $

हंसता जोर से

मु कुराता कड़ी मेहनत से

जोर से हंसता

कड़ी मेहनत से मु कुराता $

^ वह

(19)

Computational Complexity

^ He laughs loud $

कड़ी मेहनत से

हंसता कड़ी मेहनत से $

^ वह

Here we dont know how many next words we need to consider for translation . This was fixed in POS tagging to be 1.

Translation complexity arises due to such factors.

(20)

Computational Complexity

● Two ways to address the complexity problem.

➢ Reduce the number of hypothesis by hypothesis recombination.

➢ Pruning methods that make estimates of which hypotheses to drop early,at the risk of failing to find the best path.

(21)

Decoding using

Beam Search

Hypothesis Expansion

Hypothesis Recombination

(22)

Hypothesis Recombination

it

has

It has

it

has

● Two hypothesis paths lead to two matching hypotheses: they have the same number of foreign words translated,and the same English words in the output, but different scores.

● In the heuristic search they can be recombined, and the worse hypothesis is dropped because the worse hypothesis can never be part of the best overall path.

(23)

Hypothesis Recombination

● It does not matter that the two hypothesis paths differ in their first English word.

● All subsequent phrase translation costs are independent of already generated English words.

● The language model is sensitive to only a few of the latest words in the output. A typical trigram language model uses the two latest words.

● In conclusion, from a subsequent search point of view the two hypothesis are identical and can be recombined.

it

he Was

not Empty

hypothesis

p=1 Was

not it

he Was

not Empty

hypothesis p=1

(24)

Some Conclusions

● Recombining hypotheses is very helpful in reducing spurious ambiguity in the search.

● The method reduces the problem of having to consider two hypothesis paths that differ only in internal representation such as different phrase segmentation.

● Leads to a more efficient search.

● It does not solve the problem that machine translation decoding has exponential complexity.

● To address this problem, we have to employ riskier methods for search space reduction.

(25)

Decoding using

Beam Search

Hypothesis Expansion

Hypothesis Recombination

Stack Decoding

(26)

Stack Decoding

● If it appears early on that a hypothesis is bad, we may want to drop it and ignore all future expansions.

● But the hypothesis may redeem itself later and lead to the best overall translation.

For example :-

He drinks strong tea.

Strong -> मजबूत or तेज

वह तेज may not be as probable as वह मजबूत

but वह तेज चाय पीता है is a more appropriate translation than वह मजबूत चाय पीता है

(27)

Stack Decoding

● Dropping hypotheses early leads to the problem of search error.

● In heuristic search, search error may surface but it is reduced as much as possible, although never eliminated.

● These search errors arise due to long distance aggrement and collocation.

For ex:-

Eng:- The tea that was served to me on my friend's birthday party was strong.

Hindi:- चाय जो मुझे मेरे दो त के ज म दन क पाट पर परोसा गया

था मजबूत था.

(28)

Stack Decoding

● To find out the best hypothesis, one needs to find its path to completion, but this is too expensive.

● Question :- How can one make guesses at which hypotheses are likely to be too bad to lead to the best translation.

● The exponential explosion of the search space forces us to take a set of hypotheses, compare them and drop the bad ones.

● We want to consider hypotheses that are at least comparable.

● So hypotheses are organized into hypothesis stacks based on the number of foreign words translated.

● If the stacks get too large, worst hypothesis is pruned out.

(29)

Stack Decoding

it

he

has

was not

Went home

s=0 s=1 s=2 s=3

(30)

Stack Decoding Heuristic

1: place empty hypothesis into stack 0 2: for all stacks 0...n − 1 do

3: for all hypotheses in stack do 4: for all translation options do 5: if applicable then

6: create new hypothesis 7: place in stack

8: recombine with existing hypothesis if possible 9: prune stack if too big

10: end if 11: end for 12: end for

13: end for

(31)

Stack Decoding Heuristic

● The exponential nature of machine translation decoding leads to very large numbers of hypotheses in the stacks.

● Since each stack contains comparable hypotheses (they cover the same number of foreign input words), we use pruning to drop bad hypotheses based on their partial score.

(32)

Histogram Pruning

● Keep a maximum number n of hypotheses in each stack.

● The stack size n has a direct relation to decoding speed.

● Under the assumption that all stacks are filled and all translation options are applicable all the time, the number of hypothesis expansions is equal to the maximum stack size times the number of translation options times the length of the input sentence.

● Computational time complexity :

Ο(max stack size× translation options× sentence length)

(33)

Histogram Pruning

● The number of translation options is linear with sentence length.

● Computational complexity:-

● The cost of stack decoding is quadratic with sentence length compared to original exponential cost.

● This improvement comes, however, at the risk of not finding the best translation according to the model.

Ο(max stack size× sentence length²)

(34)

Threshold Pruning

● The difference between best and worst hypothesis in stack may be large or small.

● Histogram pruning is very inconsitent wrt pruning out bad hypotheses:sometimes relatively bad hypotheses are allowed to survive, sometimes hypotheses that are only slightly worse than the best hypothesis are pruned out.

● If the score of a hypothesis is α times worse than the best hypothesis, it is pruned out.

● The threshold is like a beam of light which follows the best hypothesis path, but with a certain width also illuminates the neighboring hypotheses, hence the name beam search.

● Computational complexity is roughly quadratic here too.

(35)

Pruning

● Today's MT decoders use both pruning methods.

● Threshold pruning has some nicer theoretical properties, while histogram pruning is more reliable in terms of computational cost.

● Keeping pruning in mind, we have added two more parameters

– Stack size n

– Beam threshold α

(36)

Reordering

● Decoding seems similar to problems like POS tagging (input is mapped to output in a sequential word-by-word process).

● But, a distinct property of MT decoding makes it much harder - the input may be processed not linearly, but in any order.

● Reordering : A hard problem of MT from

– Computational point of view

– Modelling point of view

● Language pair French-English: Limited reordering suffices.

Language pair Hindi-English: Sentence undergoes a major surgery (Hindi is SOV, English is SVO).

(37)

Reordering

●

Monotonic

–

No reordering allowed

–

Output is not correct

●

All reorderings allowed

–

Increases complexity

●

So, we need to find a mean between the two.

(38)

Reordering Limits

●

Reordering limit: When taking phrases out of sequence, a maximum of 'd' words may be skipped.

●

In practice, beyond a certain reordering limit 'd' (typically 5–8 words), machine translation performance does not improve and may even deteriorate.

●

Deterioration may reflect in

–

more search errors

–

A weak reordering model that does not properly score

bad large-scale reordering

(39)

Complexity of Reordering

●

With limited reordering, only a limited number of translation options is available at any step

●

The number of translation options available no longer grows with sentence length.

●

So, now the complexity is

O (max stack size × sentence length)

●

With decoding speed linear in sentence length, we can

comfortably translate even very long sentences.

(40)

Complete example

●

सीता ने उसको चाकू से मारा

s=0 s=1

Sita Sita

Sitahas

Sitahad

(41)

●

सीता ने उसको चाकू से मारा

0 Sita 1

has

had

Pruning

0 Sita 1

(42)

Complete example

●

सीता ने उसको चाकू से मारा

s=0 s=1 s=2

Sita Sita

Sita killed

Sitahas

Sitahad

(43)

●

सीता ने उसको चाकू से मारा

0 Sita 1

has had 1

2

3 2

(44)

Complete example

●

सीता ने उसको चाकू से मारा

s=0 s=1 s=2

Sita Sita

Sitahas

Sitahad

s=3 Sita killed

Sita him

(45)

Complete example

●

सीता ने उसको चाकू से मारा

0 Sita 1

has had 1

2

3

2 2

5 killed 4

killed

(46)

Complete example

●

सीता ने उसको चाकू से मारा

s=2 Sitahas

Sitahad

s=3 Sita killed

s=4 Sita him Sita knife

(47)

Recombination takes place

●

सीता ने उसको चाकू से मारा

0 Sita 1

has had 1

2

3

2 2

5 killed 4

killed

2 6 him

him

(48)

Complete example

●

सीता ने उसको चाकू से मारा

s=2 Sitahas

Sitahad

s=3 Sita killed

s=4 Sita

him Sitato

Sita from

Sita From the

s=5

(49)

●

सीता ने उसको चाकू से मारा

0 Sita 1

has had 1

2

3

2 2

5 killed 4

killed

2 6 him

him

2 7

2 8

2 9 from

from the to

(50)

Complete example

●

सीता ने उसको चाकू से मारा

s=2 Sitahas

Sitahad

s=3 Sita killed

s=4 Sita

him Sitato

Sita from From Sita

the

s=5

Knives at Sita

the

Sita

Knives at

Sita knives

s=5

Kukris at Sita

the

Sita

Kukris at

(51)

Entire graph

●

सीता ने उसको चाकू से मारा

0 1

Sita

has

had

1

2

3

2 2

5

killed 4

2 6 him

him

2 7

2 8

2 9 from

from the to

killed

10

Kukris at knives Knives at Knives at the Kukris at the

knives Knives at Knives at the Kukris at the

Kukris at knives Knives at Knives at the Kukris at the

knives

knives Knives at Knives at the Kukris at the

knives

(52)

Future Cost Estimation

(53)

Other Decoding Algorithms

Beam Search using Coverage Stacks

A* Search

(54)

A* Search

●

A* 's pruning of the search space is risk free.

●

Constraint on the heuristic used to estimate the future cost

●

Admissible heuristic: The estimated cost is never an overestimate.

●

Safe pruning of hypothesis: If partial score + estimated

future cost for hypothesis A is worse than score for the

cheapest completed hypothesis path, then hypothesis A can

be safely pruned.

(55)

Decoding using A*

● First, establish an early candidate for cheapest actual completed cost using a DFS.

– Expand a hypothesis all the way to completion by performing all possible hypothesis expansions of that hypothesis.

– Then proceed with the path of this hypothesis that has the cheapest cost.

● Then, apply A* search for decoding.

– For A* cost estimation: g = partial score of hypothesis h = future cost estimate

– Now, explore alternative hypotheses expansions using A*.

– But during the search, we discard those hypothesis whose cost estimate is worse than the score for the cheapest completed hypothesis.

(56)

Search Algorithm

no. of words

covered probability+

heuristic estimate

Cheapest score

1. Depth - first expansion to completed path

2. Alternative path leads to hypothesis beyond threshold

prune

Newer cheaper hypothesis tightens the cheapest score threshold.

(57)

Example

Strong tea available

बलवान मज़बूत

कड़क

चाय

वाथ

उपल ध

ा त -5

-7

-1

-0.75

DFS

(58)

Other Decoding Algorithms

Beam Search using Coverage Stacks A* Search

Greedy Hill-Climbing

Decoding

(59)

Decoding Complexity

(60)

Complexity Classes

● P: Solvable in polynomial time wrt input length.

● NP: (Non Deterministic Polynomial time) Solvable in polynomial time by a non deterministic Turing machine. But, if a solution is known, it can be verified in polynomial time.

● NP-Hard: Any problem that cant be solved or verified in polynomial time is NP-Hard.

● NP-Complete: A problem is NP complete if it is both NP and NP hard.

(61)

Two Problems

M1 Optimize

M1 Decide

(62)

M1 Optimize

● Given:

– sentence 'f' to be translated

– the parameter tables for LM (b), length probability (Є), translation probability(s)

– Len (f) = m

● Problem:

– Find a sentence 'e' of length l<=2m that maximizes the probability P(e) . P(f|e)

(63)

M1 Decide

●

Given:

–

sentence 'f' to be translated

–

the parameter tables for LM (b), length probability (Є), translation probability (s)

–

Len (f) = m

–

A real number 'k' ( a threshold probability value ).

●

Problem:

–

Decide whether there exists a sentence 'e' such that

● Length of e is l <= 2m and

(64)

Relation Between the Two Problems

●

Every Optimize problem can be reformulated as a Decision problem by introducing a bound on the score to be maximized of minimized.

●

If a decision problem is hard, it implies that the corresponding optimize problem is also hard.

●

To prove : Decoding is NP complete problem

●

Method of proof:

– We give seperate polynomial-time reductions of two known NP-complete problems to the translation problem.

– Each reduction highlights a different source of complexity.

(65)

Reduction 1

●

Reducing a Hamilton Circuit Problem to the Decoding Problem.

●

Assumptions:

–

Lexical Translation Determinism: Each source word has only a single translation option on the target side.

–

Length of sentence to be translated and its translation in the target language have equal length

–

Bigram language model has a uniform probability

distribution of 1/n over valid n-grams.

(66)

Example

Selecting a good word order is like solving the Hamilton Circuit Problem which is NP complete.

Under the assumptions of determinism and same length of source and target sentence, only the bigram source model takes the responsibility of word ordering and hence, its NP completeness.

Boundary

Weekend

is

just around

the

corner

!

(67)

Reduction 1: Hamilton Circuit Problem

●

Hamilton Circuit Problem :

Given a directed graph G, does G have a path that visits each vertex exactly once and return to starting point?

●

Proof: To transform Hamilton Circuit problem to M1 Decide.

–

Step 1. Let f

₁

,f

₂

....f

_n

be foreign (source) words vocabulary. Each word f

_i

is a vertex i in the graph G.

Let e

₀

,e

₁

,e

₂

,...e

_n

be English (target) words vocabulary.

e

₀

will be used as boundary word for scoring.

–

Step 2. Create channel model:

(68)

Reduction 1: Hamilton Circuit Problem

–

Є(m|l) = 1 if l==m

= 0 otherwise

–

This indicates that input and output should be of same length.

●

Step 3: Create source model

–

b(e

_j

|e

_i

) = 1/n if graph G has an edge from i to j

= 0 otherwise

●

Step 4: Set k = 0, indicating that we are looking for

translations with non zero probability.

(69)

Reduction 1: Hamilton Circuit Problem

● To solve this Hamilton Circuit Problem, invoke M1 decide.

● Solution of the problem : If M1 Decide returns YES:

– It indicates that there exists a sentence 'e' such that P(e).P(f|e) > 0.

– So, both P(f|e) and P(e) are non zero.

– e = e₁,e₂,...e_n and P(e) is non zero implies that each bigram in the sequence has non zero probability including boundary word.

– This ordering 'e' can be corresponded to each of the vertices in G. So, using the ordering of words in 'e', we can produce an ordering of vertices in G.

– This order of vertices in G is nothing but the order in which source words are picked for translation.

– We append vertex 0 to beginning and end of ordering to complete the circuit.

(70)

Reduction 1: Hamilton Circuit Problem

●

Solution of the problem:

–

If M1 Decide returns NO:

● That implies that P(e).P(f|e) is 0.

● So, there is at least one zero value in the computation of P(e) or P(f|e).

● According to channel model, P(f|e) = 1. So, P(e) = 0.

● This can happen only if the circuit of ordering is broken.

● So, no hamilton circuit exists.

(71)

Implications

●

The results imply that:

–

If answer is YES: there exists a Hamilton Circuit and hence a good word order that can be applied to the source side words.

–

If answer is NO: there is no Hamilton Circuit and hence

no good word order that can be applied to the source

side words.

(72)

Example

● Selecting a good word order is like solving the Hamilton Circuit Problem which is NP complete.

● Under the assumptions of determinism and same length of source and target sentence, only the bigram source model takes the responsibility of word ordering and hence, its NP completeness.

● The bigram model makes the graph directed. And a legal word order is like completing the Hamilton circuit.

Boundary

Weekend

is

just around

the

corner

!

(73)

Corresponding M1 Optimize

●

The M1 Optimize problem corresponding to Hamilton Circuit problem is the Travelling Salesperson problem.

●

It introduces edge costs. We can view edge costs as the log probabilities and can cast the TSP as an M1 Optimize problem to find the least cost path.

●

Problem would be to find the best source word order by

optimizing P(e).

(74)

Reduction 2

●

Reducing the Minimum Set Cover Problem to the Decoding Problem.

●

Assumptions:

–

Neutral Bigram Language Model: All permutations of given length 'n' are possible. Bigram language model has a uniform probability distribution of 1/vocab_size.

–

Length of sentence to be translated is 'm' and its

translation in the target language can have length at

most 'n'.

(75)

Minumin set cover problem

● Let the hindi sentence be:- "स ताह का अंत आने ह वाला है |”

● We have to pick minimum number of english words to cover entire hindi sentence. This is similar to minimum set cover problem.

● How do we transform a set cover problem to translation problem?

● Consider the sets as english words:-

C = {weekend, week, of, end, around, coming, the, is, corner, approaching}

● We keep a naive language/source model i.e. equal prob to all possible combinations.

b(e_i|e_j) = 1/(vocab size) = (1/9)

(76)

Minumin set cover problem

●

For the channel model, we give equal prob to any hindi word that falls in the set of english word.

s(f_j|e_i) = 1/(g_i) Therefore s( स ताह | weekend) = 1/3

●

Another constraint we set is the length of output translation to be less than or equal to n i.e. Length of hindi sentence.

i.e. ε(m|l) = 1 if l<=n

0 otherwise

●

This completes the reduction.

(77)

weekend

just

ताहस का ^अं_त

आने ^वाला

week of end

around

है is

coming

ह approaching

कोने

corner

Conversion of set cover problem to translation problem

(78)

Conversion to M-1 decide problem

● CASE 1:-

– If M-1 decide returns yes , then some translation has both P(e).P(f|e) > 0

– This implies the translation has n or fewer words from the length constraint.

– Secondly every word in hindi sentence is covered by atleast one word in that translation.

– Therefore we conclude that there is a set cover of size <=n for S.

● CASE 2:-

– On the contrary if M-1 decide returns no, it means that all translation have P(e).P(f|e) = 0

– Either a translation does not meet the length criteria or it did not cover a hindi word.

– In both cases, we conclude that there is no set cover of size <= n for S.

(79)

Points to note

● The two proofs point at two seperate factors in decoding.

– Word order selection -> tied to the source model (Language Model)

– Picking a set of target words to cover the source words -> tied to the channel model (Lexical Translation Probability)

● More complexity arises from the interaction of the two.

● The proof is constructed on IBM model 1.

(80)

Conclusion

●

Decoding is complex because of

– Word order selection problem

– Picking a set of target words to cover the source words

●

In practical scenarios, it is solved using heuristic

approaches like Beam Search and A* using future cost

estimation.

(81)

Reference

●

Philipp Koehn. Statistical Machine Translation.

School of Informatics, University of Edinburgh.

Cambridge University Press. 2010

●