Machine Translation
Om Damani
(Ack: Material taken from JurafskyMartin 2 nd Ed., Brown
et. al. 1993)
2
The spirit is willing but the flesh is weak
English-Russian Translation System
Дух охотно готов но плоть слаба
Russian-English Translation System
The vodka is good, but the meat is rotten
State of the Art
Babelfish: Spirit is willingly ready but flesh it is weak Google: The spirit is willing but the flesh is week
3
The spirit is willing but the flesh is weak
Google English-Hindi Translation System
आमा पर शरीर दबलु है
Google Hindi-English Translation System
Spirit on the flesh is weak
State of the Art (English-Hindi) – March
19, 2009
4
Is state of the art so bad
Google English-Hindi Translation System
कला की हालत इतनी खराब है
Google Hindi-English Translation System
The state of the art is so bad
Is State of the Art (English-Hindi) so
bad
5
State of the english hindi translation is not so bad
Google English-Hindi Translation System
राय के अंमेज़ी िहदी अनुवाद का इतना बुरा नहीं है
Google Hindi-English Translation System
State of the English translation of English is not so bad
State of the english-hindi translation is not so bad
OK. Maybe it is __ bad.
OK. Maybe it is __ bad.
6
State of the English Hindi translation is not so bad
Google English-Hindi Translation System
राय म! अंमेज ी से िहंदी अनुवाद का इतना बुरा नहीं है
Google Hindi-English Translation System
English to Hindi translation in the state is not so bad
State of the English-Hindi translation is not so bad
OK. Maybe it is __ __ bad.
OK. Maybe it is __ __ bad.
राय के अंमेज़ी िहदी अनुवाद का इतना बुरा नहीं है
7
Your Approach to Machine Translation
8
Translation Approaches
9
Direct Transfer – What Novices do
10
Direct Transfer: Limitations
Lexical Transfer: Many Bengali poet-PL,OBL this land of songs {sing has}- PrPer,Pl
Many Bengali poets have sung songs of this land Final: Many Bengali poets of this land songs have sung
Local Reordering: Many Bengali poet-PL,OBL of this land songs {has sing}- PrPer,Pl
कई बंग ाली किवय' ने इस भूिम के ग ीत ग ाए ह,
Kai Bangali kaviyon ne is bhoomi ke geet gaaye hain
Morph:
कई बंग ाली किव- PL,OBL ने इस भूिम के ग ीत {ग ाए है}- PrPer,Pl
Kai Bangali kavi-PL,OBL ne is bhoomi ke geet {gaaye hai}-PrPer,Pl
11
Syntax Transfer
(Analysis-Transfer-Generation)
Here phrases NP, VP etc. can be arbitrarily large
12
Syntax Transfer Limitations
He went to Patna -> Vah Patna gaya
He went to Patil -> Vah Patil ke pas gaya
Translation of went depends on the semantics of the object of went
Fatima eats salad with spoon – what happens if you change spoon
Semantic properties need to be included in transfer rules – Semantic Transfer
13
Interlingua Based Transfer
you this
farmer
agt obj
pur
plc
contact
nam
or region
khatav
manchar taluka
nam :01
For this, you contact the farmers of Manchar region or of Khatav taluka.
In theory: N analysis and N transfer modules in stead of N2
In practice: Amazingly complex system to tackle N2 language pairs
14
Difficulties in Translation – Language Divergence
(
Concepts from Dorr 1993, Text/Figures from Dave, Parikh and Bhattacharyya 2002)
Constituent Order Prepositional Stranding Null Subject
Conflational Divergence Categorical Divergence
15
Lost in Translation: We are talking mostly about syntax, not semantics, or pragmatics
You: Could you give me a glass of water Robot: Yes.
….wait..wait..nothing happens..wait…
…Aha, I see…
You: Will you give me a glass of water
…wait…wait..wait..
Image from http://inicia.es/de/rogeribars/blog/lost_in_translation.gif
16
CheckPoint
State of the Art
Different Approaches
Translation Difficulty
Need for a novel approach
17
Statistical Machine Translation: Most ridiculous idea ever
Consider all possible partitions of a sentence.
For a given partition,
Consider all possible translations of each part.
Consider all possible combinations of all possible translations Consider all possible permutations of each combination
And somehow select the best partition/translation/permutation
कई बंग ाली किवय' ने इस भूिम के ग ीत ग ाए ह, Kai Bangali kaviyon ne is bhoomi ke geet gaaye hain
have sung songs farm
Poets from Bangladesh
song sung space
in this Many poets from
Bangal
sing songs
‘s place
to this Several Bengali
have sung poem of
land this
Many Bengali Poets
ग ीत ग ाए ह, के
भूिम ने इस
कई बंग ाली किवय'
To this space have sung songs of many poets from Bangal
18
How many combinations are we talking about
Number of choices for a N word sentence
N=20 ??
Number of possible chess games
19
How do we get the Phrase Table
Collect large amount of bi-lingual parallel text.
For each sentence pair,
Consider all possible partitions of both sentences For a given partition pair,
Consider all possible mapping between parts (phrases) on two side Somehow assign the probability to each phrase pair
इसके िलए आप मंचर 1ेऽ के िकसान' सॆ संपक कीिज ए
For this you contact the farmers of Manchar region
20
Formulating the Problem
. A language model to compute P(E)
. A translation model to compute P(F|E)
. A decoder, which is given F and produces the most probable E
21
P(F|E) vs. P(E|F)
P(F|E) is the translation probability – we need to look at the generation process by which <F,E> pair is obtained.
Parts of F correspond to parts of E. With suitable independence assumptions, P(F,E) measures whether all parts of E are covered by F.
E can be quite ill-formed.
It is OK if P(F|E) for an ill-formed E is greater than P(F|E) for a well formed E. Multiplication by P(E) should hopefully take care of it.
We do not have that luxury in estimating P(E|F) directly – we will need to ensure that well-formed E score higher.
Summary: For computing P(F|E), we may make several independence assumptions that are not valid. P(E) compensated for that.
P(बािरश होरही है|It is raining) = .02 P(बरसात आ रही है| It is raining) = .03
P(बािरश होरही है|rain is happening) = .420
We need to estimate P(It is raining| बािरश होरही है) vs. P(rain is happening| बािरश होरही है)
22
23
CheckPoint
From a parallel corpus, generate probabilistic phrase table
Give a sentence, generate various
candidate translations using the phrase table
Evaluate the candidates using Translation
and Language Models
24
What is the meaning of Probability of Translation
What is the meaning of P(F|E)
By Magic: you simply know P(F|E) for every (E,F) pair – counting in a parallel corpora
Or, we need a ‘random process’ to generate F from E
A semantic graph G is generated from E and F is generated from G
We are no better off. We now have to estimate P(G|E) and P(F|G) for various G and then combine them – How?
We may have a deterministic procedure to convert E to G, in which case we still need to estimate P(F|G)
A parse tree T
Eis generated from E; T
Eis transformed to T
F;finally T
Fis converted into F
Can you write the mathematical expression
25
The Generation Process
Partition: Think of all possible partitions of the source language
Lexicalization: For a give partition, translate each phrase into the foreign language
Spurious insertion: add foreign words that are not attributable to any source phrase
Reordering: permute the set of all foreign words - words possibly moving across phrase boundaries
Try writing the probability expression for the generation process
We need the notion of alignment
26
Generation Example: Alignment
27
Simplify Generation: Only 1->Many
Alignments allowed
28
Alignment: Key Concept
A function from target position to source position:
The alignment sequence is: 2,3,4,5,6,6,6 Alignment function A: A(1) = 2, A(2) = 3 ..
A different alignment function will give the sequence:1,2,1,2,3,4,3,4 for A(1), A(2)..
To allow spurious insertion, allow alignment with word 0 (NULL) No. of possible alignments: 2(I+1)*J
29