Decoding
Presented By Bibek Behera Rucha Kulkarni Govind Chintapalli
21st March 2013 2ndApril 2013 4thApril 2013 8thApril 2013
Outline
1. Translation Process
2. Decoding using Beam Search
• Hypothesis Expansion
• Hypothesis Recombination
• Stack Decoding
• Pruning
• Reordering Limits
3. Future Cost Estimation 4. Decoding using A*
5. Decoding Complexity
Statistical Machine Translation
Overview
Hindi-English Parallel corpus
English monolingual
corpus
Translation Model Language Model
Decoding Algorithm Input Source
Sentence
Output Target Sentence
Decoding
● The task of decoding is to find the best scoring translation according to the mathematical formulae of the translation model.
● This is a hard problem.
● Heuristic Search methods are efficient but do not guarantee the best translation.
● Two types of errors may occur:
– Search Error: Failure to find the best translation
– Model Error: Highest probability translation may not be good.
Decoding Process
● Select the word to be translated.
● Get translation of selected word.
● Build output sentence in sequence, so input words may be selected out of sequence.
● Continue till all words of input are translated.
अब तुम घर जाओ
home go
you now
Sentence Translation Probability
●
Purpose of Decoding : to find target sentence with maximum probability
●
Several components contribute to the overall score:
–
the phrase translation probability φ
–
the reordering model d
–
the language model p
LMWhere for all i = 1, ..., I input phrases are f and
Partial Scores
● During the translation process, we can compute partial score for the partial translations we have constructed.
● So, instead of computing the translation probability for the whole translation only when it is completed, we can incrementally add to the score as we are constructing it.
● The three model components are:
– phrase translation
– Reordering
– language model
● So, whenever we add a new translated phrase to our partially constructed translation, we can already compute its model cost.
● We only need to know the identity of the phrase pair, its location in the input, and some information about the sequence we have constructed so far.
Partial Scores
●
Example
अब तुम घर जाओ
^
now
Right now
you You have
You have
you
go
Go to
Get the
Get
To go
. home
house dwelling
home
mansion 0.01
0.009
0.56
0.32 0.45
0.15
0.48 0.27
0.02 0.43
0.25 0.33
0.36 0.38
0.02 0.003 0.005 0.01 0.003
0.025 0.01 0.015 0.0023 0.006
0.001256 0.00215
0.00023 0.00015
0.0032
Decoding using
Beam Search
Translation Options
● For any given source phrase or word, the system is confronted with a large number of options from the phrase translation table.
Approx. 9K choices given by the Gyan Nidhi phrase translation table.
वह घर नह ं जाता
it
no return
go house
home not
It has He was
he
No homes
Come back Did not go
No home
Not visited my house return homes
dwelling That house
Translation Options
●
There are many good translations in this collection. But the choices are too many.
●
The decoding problem: to find adequate phrase translations and place them together in a fluent English sentence.
●
It is essentially a search problem.
Hypothesis Expansion
Empty hypothesis
p=1
it
he Empty
hypothesis p=1
● Starting with the initial empty hypothesis.
● Hypothesis are expanded by picking translation options (indicated by the lines).
● This results in a new hypothesis in the search Graph.
● Each hypothesis is pictured as a box containing the most recently added English word, a coverage vector of translated Hindi words and a pointer from its parent hypothesis.
Hypothesis Expansion
● This process continues until all hypothesis have been expanded
it
he
has
was not
Went home
home Empty
hypothesis p=1
Hypothesis Expansion
● A hypothesis that covers all input words cannot be expanded any further and forms an end point in the search graph.
End
● Given all such completed hypotheses, the one with highest probability score is the best path.
● Tracking back, the best scoring translation is obtained.
was not home
Computational Complexity
● Machine translation decoding has been shown to be NP- complete[Kevin knight 1996].
● The size of the search space grows roughly exponentially in the length of the input sentence.
● POS tagging has an exponential search space but can be processed in linear time.
● What is the reason that translation cannot be processed in polynomial time.
Computational Complexity
● The POS tagging problem is very similar to a translation problem.
Consider this tagging problem:-
^ He laughs loud $
^ N N A $
V V V
Tagging problem can be solved by Viterbi in linear time.Tagging problem is parsed linearly. The tag of present word depends only on the previous word due to Markov assumption. So viterbi can solve tagging problem in linear time.
Computational Complexity
Consider this translation problem :-
^ He laughs loud $
^ वह हंसता जोर से $ यह मु कुराता कड़ी मेहनत से
But translation takes exponential time. What is the reason?
Translation cannot be parsed linearly .Translation faces the problem of reordering . We cannot consider the next word only but the entire sentence has to be considered .
Computational Complexity
^ He laughs loud $
हंसता जोर से
मु कुराता कड़ी मेहनत से
जोर से हंसता
कड़ी मेहनत से मु कुराता $
^ वह
Computational Complexity
^ He laughs loud $
हंसता जोर से
कड़ी मेहनत से
हंसता जोर से
हंसता कड़ी मेहनत से $
^ वह
Here we dont know how many next words we need to consider for translation . This was fixed in POS tagging to be 1.
Translation complexity arises due to such factors.
Computational Complexity
● Two ways to address the complexity problem.
➢ Reduce the number of hypothesis by hypothesis recombination.
➢ Pruning methods that make estimates of which hypotheses to drop early,at the risk of failing to find the best path.
Decoding using
Beam Search
Hypothesis Expansion
Hypothesis Recombination
Hypothesis Recombination
it
has
It has
it
has
● Two hypothesis paths lead to two matching hypotheses: they have the same number of foreign words translated,and the same English words in the output, but different scores.
● In the heuristic search they can be recombined, and the worse hypothesis is dropped because the worse hypothesis can never be part of the best overall path.
Hypothesis Recombination
● It does not matter that the two hypothesis paths differ in their first English word.
● All subsequent phrase translation costs are independent of already generated English words.
● The language model is sensitive to only a few of the latest words in the output. A typical trigram language model uses the two latest words.
● In conclusion, from a subsequent search point of view the two hypothesis are identical and can be recombined.
it
he Was
not Empty
hypothesis
p=1 Was
not it
he Was
not Empty
hypothesis p=1
Some Conclusions
● Recombining hypotheses is very helpful in reducing spurious ambiguity in the search.
● The method reduces the problem of having to consider two hypothesis paths that differ only in internal representation such as different phrase segmentation.
● Leads to a more efficient search.
● It does not solve the problem that machine translation decoding has exponential complexity.
● To address this problem, we have to employ riskier methods for search space reduction.
Decoding using
Beam Search
Hypothesis Expansion
Hypothesis Recombination
Stack Decoding
Stack Decoding
● If it appears early on that a hypothesis is bad, we may want to drop it and ignore all future expansions.
● But the hypothesis may redeem itself later and lead to the best overall translation.
For example :-
He drinks strong tea.
Strong -> मजबूत or तेज
वह तेज may not be as probable as वह मजबूत
but वह तेज चाय पीता है is a more appropriate translation than वह मजबूत चाय पीता है
Stack Decoding
● Dropping hypotheses early leads to the problem of search error.
● In heuristic search, search error may surface but it is reduced as much as possible, although never eliminated.
● These search errors arise due to long distance aggrement and collocation.
For ex:-
Eng:- The tea that was served to me on my friend's birthday party was strong.
Hindi:- चाय जो मुझे मेरे दो त के ज म दन क पाट पर परोसा गया
था मजबूत था.
Stack Decoding
● To find out the best hypothesis, one needs to find its path to completion, but this is too expensive.
● Question :- How can one make guesses at which hypotheses are likely to be too bad to lead to the best translation.
● The exponential explosion of the search space forces us to take a set of hypotheses, compare them and drop the bad ones.
● We want to consider hypotheses that are at least comparable.
● So hypotheses are organized into hypothesis stacks based on the number of foreign words translated.
● If the stacks get too large, worst hypothesis is pruned out.
Stack Decoding
it
he
has
was not
Went home
s=0 s=1 s=2 s=3
Stack Decoding Heuristic
1: place empty hypothesis into stack 0 2: for all stacks 0...n − 1 do
3: for all hypotheses in stack do 4: for all translation options do 5: if applicable then
6: create new hypothesis 7: place in stack
8: recombine with existing hypothesis if possible 9: prune stack if too big
10: end if 11: end for 12: end for
13: end for
Stack Decoding Heuristic
● The exponential nature of machine translation decoding leads to very large numbers of hypotheses in the stacks.
● Since each stack contains comparable hypotheses (they cover the same number of foreign input words), we use pruning to drop bad hypotheses based on their partial score.
Histogram Pruning
● Keep a maximum number n of hypotheses in each stack.
● The stack size n has a direct relation to decoding speed.
● Under the assumption that all stacks are filled and all translation options are applicable all the time, the number of hypothesis expansions is equal to the maximum stack size times the number of translation options times the length of the input sentence.
● Computational time complexity :
Ο(max stack size× translation options× sentence length)
Histogram Pruning
● The number of translation options is linear with sentence length.
● Computational complexity:-
● The cost of stack decoding is quadratic with sentence length compared to original exponential cost.
● This improvement comes, however, at the risk of not finding the best translation according to the model.
Ο(max stack size× sentence length2)
Threshold Pruning
● The difference between best and worst hypothesis in stack may be large or small.
● Histogram pruning is very inconsitent wrt pruning out bad hypotheses:sometimes relatively bad hypotheses are allowed to survive, sometimes hypotheses that are only slightly worse than the best hypothesis are pruned out.
● If the score of a hypothesis is α times worse than the best hypothesis, it is pruned out.
● The threshold is like a beam of light which follows the best hypothesis path, but with a certain width also illuminates the neighboring hypotheses, hence the name beam search.
● Computational complexity is roughly quadratic here too.
Pruning
● Today's MT decoders use both pruning methods.
● Threshold pruning has some nicer theoretical properties, while histogram pruning is more reliable in terms of computational cost.
● Keeping pruning in mind, we have added two more parameters
– Stack size n
– Beam threshold α
Reordering
● Decoding seems similar to problems like POS tagging (input is mapped to output in a sequential word-by-word process).
● But, a distinct property of MT decoding makes it much harder - the input may be processed not linearly, but in any order.
● Reordering : A hard problem of MT from
– Computational point of view
– Modelling point of view
● Language pair French-English: Limited reordering suffices.
Language pair Hindi-English: Sentence undergoes a major surgery (Hindi is SOV, English is SVO).
Reordering
●
Monotonic
–
No reordering allowed
–
Output is not correct
●
All reorderings allowed
–
Increases complexity
●
So, we need to find a mean between the two.
Reordering Limits
●
Reordering limit: When taking phrases out of sequence, a maximum of 'd' words may be skipped.
●
In practice, beyond a certain reordering limit 'd' (typically 5–8 words), machine translation performance does not improve and may even deteriorate.
●
Deterioration may reflect in
–
more search errors
–
A weak reordering model that does not properly score
bad large-scale reordering
Complexity of Reordering
●
With limited reordering, only a limited number of translation options is available at any step
●
The number of translation options available no longer grows with sentence length.
●
So, now the complexity is
O (max stack size × sentence length)
●
With decoding speed linear in sentence length, we can
comfortably translate even very long sentences.
Complete example
●
सीता ने उसको चाकू से मारा
s=0 s=1
Sita Sita
Sitahas
Sitahad
●
सीता ने उसको चाकू से मारा
0 Sita 1
has
had
Pruning
0 Sita 1
Complete example
●
सीता ने उसको चाकू से मारा
s=0 s=1 s=2
Sita Sita
Sita killed
Sitahas
Sitahad
●
सीता ने उसको चाकू से मारा
0 Sita 1
has had 1
2
3 2
Complete example
●
सीता ने उसको चाकू से मारा
s=0 s=1 s=2
Sita Sita
Sitahas
Sitahad
s=3 Sita killed
Sita him
Complete example
●
सीता ने उसको चाकू से मारा
0 Sita 1
has had 1
2
3
2 2
5 killed 4
killed
Complete example
●
सीता ने उसको चाकू से मारा
s=2 Sitahas
Sitahad
s=3 Sita killed
s=4 Sita him Sita knife
Recombination takes place
●
सीता ने उसको चाकू से मारा
0 Sita 1
has had 1
2
3
2 2
5 killed 4
killed
2 6 him
him
Complete example
●
सीता ने उसको चाकू से मारा
s=2 Sitahas
Sitahad
s=3 Sita killed
s=4 Sita
him Sitato
Sita from
Sita From the
s=5
●
सीता ने उसको चाकू से मारा
0 Sita 1
has had 1
2
3
2 2
5 killed 4
killed
2 6 him
him
2 7
2 8
2 9 from
from the to
Complete example
●
सीता ने उसको चाकू से मारा
s=2 Sitahas
Sitahad
s=3 Sita killed
s=4 Sita
him Sitato
Sita from From Sita
the
s=5
Knives at Sita
the
Sita
Knives at
Sita knives
s=5
Kukris at Sita
the
Sita
Kukris at
Entire graph
●
सीता ने उसको चाकू से मारा
0 1
Sita
has
had
1
2
3
2 2
5
killed 4
2 6 him
him
2 7
2 8
2 9 from
from the to
killed
10
Kukris at knives Knives at Knives at the Kukris at the
knives Knives at Knives at the Kukris at the
Kukris at knives Knives at Knives at the Kukris at the
knives
knives Knives at Knives at the Kukris at the
knives
Future Cost Estimation
Other Decoding Algorithms
Beam Search using Coverage Stacks
A* Search
A* Search
●
A* 's pruning of the search space is risk free.
●
Constraint on the heuristic used to estimate the future cost
●
Admissible heuristic: The estimated cost is never an overestimate.
●
Safe pruning of hypothesis: If partial score + estimated
future cost for hypothesis A is worse than score for the
cheapest completed hypothesis path, then hypothesis A can
be safely pruned.
Decoding using A*
● First, establish an early candidate for cheapest actual completed cost using a DFS.
– Expand a hypothesis all the way to completion by performing all possible hypothesis expansions of that hypothesis.
– Then proceed with the path of this hypothesis that has the cheapest cost.
● Then, apply A* search for decoding.
– For A* cost estimation: g = partial score of hypothesis h = future cost estimate
– Now, explore alternative hypotheses expansions using A*.
– But during the search, we discard those hypothesis whose cost estimate is worse than the score for the cheapest completed hypothesis.
Search Algorithm
no. of words
covered probability+
heuristic estimate
Cheapest score
1. Depth - first expansion to completed path
2. Alternative path leads to hypothesis beyond threshold
prune
Newer cheaper hypothesis tightens the cheapest score threshold.
Example
Strong tea available
बलवान मज़बूत
कड़क
चाय
वाथ
उपल ध
ा त -5
-7
-1
-0.75
DFS
Other Decoding Algorithms
Beam Search using Coverage Stacks A* Search
Greedy Hill-Climbing
Decoding
Decoding Complexity
Complexity Classes
● P: Solvable in polynomial time wrt input length.
● NP: (Non Deterministic Polynomial time) Solvable in polynomial time by a non deterministic Turing machine. But, if a solution is known, it can be verified in polynomial time.
● NP-Hard: Any problem that cant be solved or verified in polynomial time is NP-Hard.
● NP-Complete: A problem is NP complete if it is both NP and NP hard.
Two Problems
M1 Optimize
M1 Decide
M1 Optimize
● Given:
– sentence 'f' to be translated
– the parameter tables for LM (b), length probability (Є), translation probability(s)
– Len (f) = m
● Problem:
– Find a sentence 'e' of length l<=2m that maximizes the probability P(e) . P(f|e)
M1 Decide
●
Given:
–
sentence 'f' to be translated
–
the parameter tables for LM (b), length probability (Є), translation probability (s)
–
Len (f) = m
–
A real number 'k' ( a threshold probability value ).
●
Problem:
–
Decide whether there exists a sentence 'e' such that
● Length of e is l <= 2m and
Relation Between the Two Problems
●
Every Optimize problem can be reformulated as a Decision problem by introducing a bound on the score to be maximized of minimized.
●
If a decision problem is hard, it implies that the corresponding optimize problem is also hard.
●
To prove : Decoding is NP complete problem
●
Method of proof:
– We give seperate polynomial-time reductions of two known NP-complete problems to the translation problem.
– Each reduction highlights a different source of complexity.
Reduction 1
●
Reducing a Hamilton Circuit Problem to the Decoding Problem.
●
Assumptions:
–
Lexical Translation Determinism: Each source word has only a single translation option on the target side.
–
Length of sentence to be translated and its translation in the target language have equal length
–
Bigram language model has a uniform probability
distribution of 1/n over valid n-grams.
Example
Selecting a good word order is like solving the Hamilton Circuit Problem which is NP complete.
Under the assumptions of determinism and same length of source and target sentence, only the bigram source model takes the responsibility of word ordering and hence, its NP completeness.
Boundary
Weekend
is
just around
the
corner
!
Reduction 1: Hamilton Circuit Problem
●
Hamilton Circuit Problem :
Given a directed graph G, does G have a path that visits each vertex exactly once and return to starting point?
●
Proof: To transform Hamilton Circuit problem to M1 Decide.
–
Step 1. Let f
1,f
2....f
nbe foreign (source) words vocabulary. Each word f
iis a vertex i in the graph G.
Let e
0,e
1,e
2,...e
nbe English (target) words vocabulary.
e
0will be used as boundary word for scoring.
–
Step 2. Create channel model:
Reduction 1: Hamilton Circuit Problem
–
Є(m|l) = 1 if l==m
= 0 otherwise
–
This indicates that input and output should be of same length.
●
Step 3: Create source model
–
b(e
j|e
i) = 1/n if graph G has an edge from i to j
= 0 otherwise
●
Step 4: Set k = 0, indicating that we are looking for
translations with non zero probability.
Reduction 1: Hamilton Circuit Problem
● To solve this Hamilton Circuit Problem, invoke M1 decide.
● Solution of the problem : If M1 Decide returns YES:
– It indicates that there exists a sentence 'e' such that P(e).P(f|e) > 0.
– So, both P(f|e) and P(e) are non zero.
– e = e1,e2,...en and P(e) is non zero implies that each bigram in the sequence has non zero probability including boundary word.
– This ordering 'e' can be corresponded to each of the vertices in G. So, using the ordering of words in 'e', we can produce an ordering of vertices in G.
– This order of vertices in G is nothing but the order in which source words are picked for translation.
– We append vertex 0 to beginning and end of ordering to complete the circuit.
Reduction 1: Hamilton Circuit Problem
●
Solution of the problem:
–
If M1 Decide returns NO:
● That implies that P(e).P(f|e) is 0.
● So, there is at least one zero value in the computation of P(e) or P(f|e).
● According to channel model, P(f|e) = 1. So, P(e) = 0.
● This can happen only if the circuit of ordering is broken.
● So, no hamilton circuit exists.
Implications
●
The results imply that:
–
If answer is YES: there exists a Hamilton Circuit and hence a good word order that can be applied to the source side words.
–
If answer is NO: there is no Hamilton Circuit and hence
no good word order that can be applied to the source
side words.
Example
● Selecting a good word order is like solving the Hamilton Circuit Problem which is NP complete.
● Under the assumptions of determinism and same length of source and target sentence, only the bigram source model takes the responsibility of word ordering and hence, its NP completeness.
● The bigram model makes the graph directed. And a legal word order is like completing the Hamilton circuit.
Boundary
Weekend
is
just around
the
corner
!
Corresponding M1 Optimize
●
The M1 Optimize problem corresponding to Hamilton Circuit problem is the Travelling Salesperson problem.
●
It introduces edge costs. We can view edge costs as the log probabilities and can cast the TSP as an M1 Optimize problem to find the least cost path.
●
Problem would be to find the best source word order by
optimizing P(e).
Reduction 2
●
Reducing the Minimum Set Cover Problem to the Decoding Problem.
●
Assumptions:
–
Neutral Bigram Language Model: All permutations of given length 'n' are possible. Bigram language model has a uniform probability distribution of 1/vocab_size.
–
Length of sentence to be translated is 'm' and its
translation in the target language can have length at
most 'n'.
Minumin set cover problem
● Let the hindi sentence be:- "स ताह का अंत आने ह वाला है |”
● We have to pick minimum number of english words to cover entire hindi sentence. This is similar to minimum set cover problem.
● How do we transform a set cover problem to translation problem?
● Consider the sets as english words:-
C = {weekend, week, of, end, around, coming, the, is, corner, approaching}
● We keep a naive language/source model i.e. equal prob to all possible combinations.
b(e_i|e_j) = 1/(vocab size) = (1/9)
Minumin set cover problem
●
For the channel model, we give equal prob to any hindi word that falls in the set of english word.
s(f_j|e_i) = 1/(g_i) Therefore s( स ताह | weekend) = 1/3
●
Another constraint we set is the length of output translation to be less than or equal to n i.e. Length of hindi sentence.
i.e. ε(m|l) = 1 if l<=n
0 otherwise
●
This completes the reduction.
weekend
just
ताहस का अंत
आने वाला
week of end
around
है is
coming
ह approaching
कोने
corner
Conversion of set cover problem to translation problem
Conversion to M-1 decide problem
● CASE 1:-
– If M-1 decide returns yes , then some translation has both P(e).P(f|e) > 0
– This implies the translation has n or fewer words from the length constraint.
– Secondly every word in hindi sentence is covered by atleast one word in that translation.
– Therefore we conclude that there is a set cover of size <=n for S.
● CASE 2:-
– On the contrary if M-1 decide returns no, it means that all translation have P(e).P(f|e) = 0
– Either a translation does not meet the length criteria or it did not cover a hindi word.
– In both cases, we conclude that there is no set cover of size <= n for S.
Points to note
● The two proofs point at two seperate factors in decoding.
– Word order selection -> tied to the source model (Language Model)
– Picking a set of target words to cover the source words -> tied to the channel model (Lexical Translation Probability)
● More complexity arises from the interaction of the two.
● The proof is constructed on IBM model 1.
Conclusion
●
Decoding is complex because of
– Word order selection problem
– Picking a set of target words to cover the source words
●
In practical scenarios, it is solved using heuristic
approaches like Beam Search and A* using future cost
estimation.
Reference
●
Philipp Koehn. Statistical Machine Translation.
School of Informatics, University of Edinburgh.
Cambridge University Press. 2010
●