Automatic evaluation of Machine Translation (MT): BLEU, its
shortcomings and other evaluation metrics
Presented by:
(As a part of CS 712)
Aditya Joshi Kashyap Popat
Shubham Gautam, IIT Bombay
Guide:
Prof. Pushpak Bhattacharyya,
IIT Bombay
Part I: Introduction and formulation of BLEU
Presented by:
(As a part of CS 712)
Aditya Joshi Kashyap Popat
Shubham Gautam, IIT Bombay
Guide:
Prof. Pushpak Bhattacharyya,
IIT Bombay
Poetry is what gets Poetry is what gets lost in translation.
lost in translation. ””
Robert Frost
Poet (1874 – 1963)
Wrote the famous poem ‘Stopping by woods on a snowy evening’ better known as ‘Miles to go before I sleep’
Motivation
How do we judge a good translation?
Can a machine do this?
Why should a machine do this? Because humans take time!
Outline
• Evaluation
• Formulating BLEU Metric
• Understanding BLEU formula
• Shortcomings of BLEU
• Shortcomings in context of English-Hindi MT
Doug Arnold, Louisa Sadler, and R. Lee Humphreys, Evaluation: an assessment. Machine Translation, Volume 8, pages 1–27. 1993.
K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation, IBM research report rc22176 (w0109- 022). Technical report, IBM Research Division, Thomas, J. Watson Research Center. 2001.
R. Ananthakrishnan, Pushpak Bhattacharyya, M. Sasikumar and Ritesh M.
Shah, Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU, ICON 2007, Hyderabad, India, Jan, 2007.
Part I Part II
BLEU - I
The BLEU Score rises
Evaluation
Evaluation [1] [1]
Of NLP systems
Of MT systems
Evaluation in NLP: Precision/Recall
Precision:
How many results returned were correct?
Recall:
What portion of correct results were returned?
Adapting precision/recall to NLP tasks
Evaluation in NLP: Precision/Recall
• Document Retrieval
Precision =
|Documents relevant and retrieved|
|Documents retrieved|
Recall=
|Documents relevant and retrieved|
| Documents relevant|
• Classification
Precision =
|True Positives|
|True Positives + False Positives|
Recall=
|True Positives|
| True Positives + False
Negatives|
Evaluation in MT [1]
• Operational evaluation
– “Is MT system A operationally better than MT system B?
Does MT system A cost less?”
• Typological evaluation
– “Have you ensured which linguistic phenomena the MT system covers?”
• Declarative evaluation
– “How does quality of output of system A fare with respect
to that of B?”
Operational evaluation
• Cost-benefit is the focus
• To establish cost-per-unit figures and use this as a basis for comparison
• Essentially ‘black box’
• Realism requirement
Word-based SA
Sense-based SA
Cost of pre-processing (Stemming, etc)
Cost of learning a classifier
Cost of pre-processing (Stemming, etc)
Cost of learning a classifier Cost of sense annotation
Typological evaluation
• Use a test suite of examples
• Ensure all relevant phenomena to be tested are covered
• Specific-to-language phenomena
For example, हर आँख वाला लड़का मु कुराया
hari aankhon-wala ladka muskuraya green eyes-with boy smiled
The boy with green eyes smiled.
Declarative evaluation
• Assign scores to specific qualities of output
– Intelligibility: How good the output is as a well- formed target language entity
– Accuracy: How good the output is in terms of preserving content of the source text
For example, I am attending a lecture
म एक या यान बैठा हू ँ
Main ek vyaakhyan baitha hoon I a lecture sit (Present-first person)
I sit a lecture : Accurate but not intelligible म या यान हू ँ
Main vyakhyan hoon I lecture am
I am lecture: Intelligible but not accurate.
Evaluation bottleneck
• Typological evaluation is time-consuming
• Operational evaluation needs accurate modeling of cost-benefit
• Automatic MT evaluation: Declarative
BLEU: Bilingual Evaluation Understudy
Deriving BLEU
Deriving BLEU [2] [2]
Incorporating Precision
Incorporating Recall
How is translation performance measured?
The closer a machine translation is to a
professional human translation, the better it is.
• A corpus of good quality human reference translations
• A numerical “translation closeness” metric
Preliminaries
• Candidate Translation(s): Translation returned by an MT system
• Reference Translation(s): ‘Perfect’ translation by humans
Goal of BLEU: To correlate with human
judgment
Formulating BLEU (Step 1): Precision
I had lunch now.
Candidate 1: मैने अब खाना खाया
maine ab khana khaya
matching unigrams: 3,
I now food ate matching bigrams: 1
I ate food now
Candidate 2: मैने अभी लंच एट
maine abhi lunch ate. matching unigrams: 2,
I now lunch ate
I ate lunch(OOV) now(OOV)
matching bigrams: 1 Unigram precision: Candidate 1: 3/4 = 0.75, Candidate 2: 2/4 = 0.5
Similarly, bigram precision: Candidate 1: 0.33, Candidate 2 = 0.33
Reference 1: मैने अभी खाना खायाmaine abhi khana khaya I now food ate
I ate food now.
Reference 2 : मैने अभी भोजन कया
maine abhi bhojan kiyaa I now meal did I did meal now
Precision: Not good enough
Reference: मुझपर तेरा सु र छाया
mujh-par tera suroor chhaaya me-on your spell cast
Your spell was cast on me
Candidate 1: मेरे तेरा सु र छाया matching unigram: 3
mere tera suroor chhaaya my your spell cast
Your spell cast my
Candidate 2: तेरा तेरा तेरा सु र matching unigrams: 4
tera tera tera suroor your your your spell
Unigram precision: Candidate 1: 3/4 = 0.75, Candidate 2: 4/4 = 1
Formulating BLEU (Step 2): Modified Precision
• Clip the total count of each candidate word with its maximum reference count
• Count
clip(n-gram) = min (count, max_ref_count)
Reference: मुझपर तेरा सु र छाया
mujh-par tera suroor chhaaya me-on your spell cast Your spell was cast on me
Candidate 2: तेरा तेरा तेरा सु र
tera tera tera suroor your your your spell
• matching unigrams:
(तेरा : min(3, 1) = 1 ) (सु र: min (1, 1) = 1) Modified unigram precision: 2/4 = 0.5
Modified n-gram precision
For entire test corpus, for a given n,
n-gram: Matching n-grams in C
n-gram’: All n-grams in C Modified precision for
n-grams Overall candidates of
test corpus
Formula from [2]
Calculating modified n-gram precision (1/2)
• 127 source sentences were translated by two human translators and three MT systems
• Translated sentences evaluated against
professional reference translations using
modified n-gram precision
Calculating modified n-gram precision (2/2)
• Decaying precision with increasing n
• Comparative ranking of the five
Combining precision for different values of n-grams?
Graph from [2]
Formulation of BLEU: Recap
• Precision cannot be used as is
• Modified precision considers ‘clipped word
count’
Recall for MT (1/2)
• Candidates shorter than references
• Reference: या लू लंबे वा य क गुणव ता को समझ पाएगा?
kya blue lambe vaakya ki guNvatta ko samajh paaega?
will blue long sentence-of quality (case-marker) understand able(III-person- male-singular)?
Will blue be able to understand quality of long sentence?
Candidate: लंबे वा य
lambe vaakya long sentence long sentence
modified unigram precision: 2/2 = 1
modified bigram precision: 1/1 = 1
Recall for MT (2/2)
• Candidates longer than references
Reference 2: मैने भोजन कया
maine bhojan kiyaa I meal did
I had meal
Candidate 1: मैने खाना भोजन कया
maine khaana bhojan kiya I food meal did
I had food meal
Modified unigram precision: 1
Reference 1: मैने खाना खाया
maine khaana khaaya I food ate
I ate food
Candidate 2: मैने खाना खाया
maine khaana khaaya I food ate
I ate food
Modified unigram precision: 1
Formulating BLEU (Step 3):
Incorporating recall
• Sentence length indicates ‘best match’
• Brevity penalty (BP):
– Multiplicative factor
– Candidate translations that match reference translations in length must be ranked higher
Candidate 1: लंबे वा य
Candidate 2: या लू लंबे वा य क गुणव तासमझ
पाएगा?
Formulating BLEU (Step 3): Brevity Penalty
e^(1-x)
Graph drawn using www.fooplot.com BP
BP = 1 for c > r.
Why?
x = ( r / c )
Formula from [2]
r: Reference sentence length c: Candidate sentence length
BP leaves out longer translations
Why?
Translations longer than reference are already penalized by modified precision
Validating the claim:
Formula from [2]
BLEU score
Precision -> Modified n- gram precision
Recall -> Brevity Penalty
Formula from [2]
Understanding BLEU Understanding BLEU
Dissecting the formula
Using it: news headline example
Decay in precision
Why log p n ?
To accommodate decay in precision values
Graph from [2]
Formula from [2]
Dissecting the Formula
Claim: BLEU should lie between 0 and 1
Reason: To intuitively satisfy “1 implies perfect translation”
Understanding constituents of the formula to validate the claim
Brevity
Penalty Modified
precision Set to 1/N
Formula from [2]
Validation of range of BLEU
BP
pn : Between 0 and 1
log pn : Between –(infinity) and 0 A: Between –(infinity) and 0
e ^ (A): Between 0 and 1 A
BP: Between 0 and 1
Graph drawn using www.fooplot.com (r/c)
pn
log pn
A
e^(A)
Calculating BLEU: An actual example
Ref: भावना मक एक करण लाने के लए एक योहार
bhaavanaatmak ekikaran laane ke liye ek tyohaar emotional unity bring-to a festival
A festival to bring emotional unity
C1: लाने के लए एक योहार भावना मक एक करण
laane-ke-liye ek tyoaar bhaavanaatmak ekikaran
bring-to a festival emotional unity (This is invalid Hindi ordering)
C2: को एक उ सव के बारे म लाना भावना मक एक करण
ko ek utsav ke baare mein laana bhaavanaatmak ekikaran
for a festival-about to-bring emotional unity (This is invalid Hindi ordering)
Modified n-gram precision
Ref: भावना मक एक करण लाने के लए एक योहार C1: लाने के लए एक योहार भावना मक एक करण
C2: को एक उ सव के बारे म लाना भावना मक एक करण
‘n’ #Matching n- grams
#Total n-grams Modified n- gram Precision
1 7 7 1
2 5 6 0.83
‘n’ #Matching n- grams
#Total n-grams Modified n- gram Precision
1 4 9 0.44
2 1 8 0.125
r: 7 c: 7
r: 7 c: 9
Calculating BLEU score
n wn pn log pn wn*
log pn
1 1 / 2 1 0 0
2 1 / 2 0.83 -0.186 -0.093
Total: -0.093
BLEU:
0.911
wn= 1 / N = 1 / 2 r = 7
C1:
c = 7
BP = exp(1 – 7/7) = 1 C2:
c = 9 BP = 1
n wn pn log pn wn*
log pn
1 1 / 2 0.44 -0.821 0.4105
2 1 / 2 0.125 -2.07 -1.035
Total: -1.445
BLEU:
0.235
C1:
C2:
Hence, the BLEU scores...
C1: लाने के लए एक योहार भावना मक एक करण
C2: को एक उ सव के बारे म लाना
भावना मक एक करण
0.911
0.235
Ref: भावना मक एक करण लाने के लए एक योहार
BLEU v/s human BLEU v/s human
judgment
judgment [2] [2]
Target language: English
Source language: Chinese
Setup
BLEU scores obtained for each system
Five systems perform translation:
3 automatic MT systems 2 human translators
Human judgment (on scale of 5) obtained for each system:
• Group 1: Ten Monolingual speakers of target language (English)
• Group 2: Ten Bilingual speakers
of Chinese and English
BLEU v/s human judgment
• Monolingual speakers:
Correlation co-efficient: 0.99
• Bilingual speakers:
Correlation co-efficient: 0.96
BLEU and human evaluation for S2 and S3
Graph from [2]Comparison of normalized values
• High correlation between monolingual group and BLEU score
• Bilingual group were
lenient on ‘fluency’ for H1
• Demarcation between {S1-S3} and {H1-H2} is captured by BLEU
Graph from [2]
Conclusion
• Introduced different evaluation methods
• Formulated BLEU score
• Analyzed the BLEU score by:
– Considering constituents of formula
– Calculating BLEU for a dummy example
• Compared BLEU with human judgment
Part II: Shortcomings of BLEU
(continued from 7
thMarch, 2013)
Presented by:
(As a part of CS 712)
Aditya Joshi Kashyap Popat
Shubham Gautam, IIT Bombay
Guide:
Prof. Pushpak Bhattacharyya,
IIT Bombay
BLEU score revisited
Precision -> Modified n- gram precision
Recall -> Brevity Penalty
Formula from [2]
Outline
• Evaluation
• Formulating BLEU Metric
• Understanding BLEU formula
• Shortcomings of BLEU in general
• Shortcomings in context of English-Hindi MT
Part I Part II
Chris Callison-Burch, Miles Osborne, Phillipp Koehn, Re-evaluating the role of Bleu in Machine Translation Research, European ACL (EACL) 2006, 2006.
R. Ananthakrishnan, Pushpak Bhattacharyya, M. Sasikumar and Ritesh M.
Shah, Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU, ICON 2007, Hyderabad, India, Jan, 2007.
[A domesticated bird wants to learn to fly]
I can do this. I just have to work out the physics.
I have quadrated by vector angles.
I have adjusted for wind shear.
This is it! Let's fly! Just keep it simple.
Thrust, lift, drag and wait. Thrust, lift, drag, wait.
[speeds up to get to the of the table]
Thrust, lift, drag, wai...wai...wait!
[gets scared and tries to stop himself,
but instead falls from the table]
Lines from Rio (2011) - Blu, the Macaw
“
Image Source: www.fanpop.com
Use of BLEU metric [3]
• Evaluating incremental system changes
– Neglecting actual examples
• Systems ranked on basis of BLEU
• Does minimizing error rate with respect to BLEU
indeed guarantee translation improvements?
Criticisms of BLEU [4]
• Admits too much variation [3]
• Admits too little variation [4] [3]
• Poor correlation with human judgment [3]
Admits too much variation [3]
Permuting phrases
Drawing different items from reference
set
Admits too much variation
• BLEU relies on n-gram matching only
• Puts very few constraints on how n-gram matches can be drawn from multiple
reference translations
Brevity Penalty
(Incorporating recall) Modified precision Set to 1/N
Modified n-gram precision in BLEU [2]
• Clip the total count of each candidate word with its maximum reference count
• Count
clip(n-gram) = min (count, max_ref_count)
Reference: मुझपर तेरा सु र छाया
mujhpar teraa suroor chhaaya me-on your spell has-been-cast Your spell has been cast on me
Candidate 2: तेरा तेरा तेरा सु र
tera tera tera suroor your your your spell Your your your spell
matching unigrams:
(तेरा : min(3, 1) = 1 ) (सु र: min (1, 1) = 1) Modified unigram precision: 2/4 = 0.5
Permuting phrases [3]
• Reordering of unmatched phrases does not affect precision
• Bigram mismatch sites can be freely permuted
Possible to randomly produce other hypothesis translations that have the same BLEU score
B1 B2 B3 B4
B4 B2 B1 B3
Bigram Bigram
mismatch
Issues with precision (1/2)
Candidate:
राजा और रानी शकार करने जंगल म गयेraaja aur raani shikaar karne jungal mein chale gaye King and queen to-do-hunting to-jungle went
Reference 1:
राजा और रानी जंगल को शकार के
लए गये
raaja aur raani jangal ko shikaar ke liye gaye King and queen to-jungle for-hunting went
Reference 2:
राजा और उनक बीवी शकार करने जंगल गये
raaja aur unki biwi shikaar karne jangal gaye king and his wife to-do-hunting jungle went
Matching bi-grams
= 4 / 8
Candidate:
राजा और रानी शकार करने जंगल गये म raaja aur raani shikaar karne gaye jungle meinKing and queen to-do-hunting went jungle to (grammatically incorrect) The king and the queen went to the jungle to hunt.
Matching bi-grams
= 4 / 8
Issues with precision (2/2)
Candidate:
राजा और रानी शकार करने जंगल म गयेraaja aur raani shikaar karne jungal mein chale gaye King and queen to-do-hunting to-jungle went
Reference 1:
राजा और रानी जंगल को शकार के
लए गये
raaja aur raani jangal ko shikaar ke liye gaye King and queen to-jungle for-hunting went
Reference 2:
राजा और उनक बीवी शकार करने जंगल गये
raaja aur unki biwi shikaar karne jangal gaye king and his wife to-do-hunting jungle went
Matching bi-grams
= 4 / 8
Candidate:
शकार करने जंगल राजा और रानी म गयेshikaar karne jungle raaja aur raani mein gaye
to-do hunting jungle raja and rani in went (grammatically incorrect) The king and the queen went to the jungle to hunt.
Matching bi-grams
= 4 / 8
Permuting phrases, in general
• For ‘b’ bi-gram matches in a candidate translation of length
‘k’,
(k – b)! possible ways to generate similarly score items using only the words in this translation
In our example, (8-4)! = 24 candidate translations
In sentence of length k,
total bigrams = k – 1 matched bigrams = b
no. of mismatched bigrams = k – 1 – b no. of matched chunks = k – 1 – b + 1
= k – b
These (k-b) chunks can be reordered in (k – b)! ways
Permuting phrases: Evaluation
• 2005 NIST evaluation data
• MT system (2
ndbest in the task) produced candidate
translation
Drawing different items from the reference set
• If two systems ‘recall’ x words each from a reference sentence,
• Precision may remain the same
• Brevity penalty remains the same
• Translation quality need not remain the same
Drawing different items: Example
You may omit lexical items altogether!
Candidate 2:
घर क मुग दाल Ghar murgi daal hoti House chicken daal is Unigram precision: 1 / 1 Bigram precision: 2 / 3 Reference:
घर क मुग मतलब दाल होती है
Ghar ki murgi matlab daal jaisi hoti hai House-of chicken means daal-like is
Chicken at home is like daal.
Candidate 1:
मुग मतलब होती है
Ki matlab jaisi hai Of means like is
Unigram precision: 1 / 1 Bigram precision: 2 / 3
Drawing different items: In general
• If there are x unigram matches, y bigram matches in candidate C1 and C2 both,
– BLEU score remains the same
– What words are actually matched is not accounted for
– Synonyms also indicate ‘mismatch’
Failures in practice: Evaluation 1 (1/2)
[3]
• NIST MT Evaluation exercise
• Failure to correlate with human judgment
– In 2005 exercise: System ranked 1
stin human judgment, ranked 6
thby BLEU score
• Seven systems compared for adequacy and
fluency
Failures in practice: Evaluation 1 (2/2)
[3]
• System with high human judgment score
Graph
Outlier included
Pearson correlation (R2)
Yes 0.14
No 0.87
Outlier included
Pearson correlation (R2)
Yes 0.002
No 0.742
Graph
Failures in practice: Evaluation 2 (1/2)
[3]
• French-English
• System 1: Rule-based system using Systran
• System 2/3: SMT system using two sets of
Europarl data (28 million words per language)
• 300 test sentences
Failures in practice: Evaluation 2 (2/2)
[3]
• BLEU underestimates human judgment
Graph
Admits too much and too Admits too much and too
less variation
less variation [[4 4]]
Indicative translations
Linguistic divergences & evaluation
Indicative translation
• Draft quality translations
• For assimilation rather than dissemination
• Virtually all general-purpose MT systems
today produce indicative translations
Basic steps to obtain indicative translation: English-Hindi
• Structural transfer
– S V O to S O V
• Lexical transfer
– Looking up corresponding word
– Adding gender, aspect, tense, etc. information
Indicative translation: Example
Rohan goes to school by bus
Structural transfer: SVO to SOV
(Rohan) (to school) (by bus) (goes)
Lexical transfer: Word translation
(Rohan) (paathshaala) (bus) (jaana)
Lexical transfer: Adding GNPTAM information
(Rohan) (paathshaala) (bus se) (jaata hai)
Rohan paathshaala bus se jaata hai
Rohan bus se paathshaala jaata hai
Questions in context of indicative translations [4]
• Can systems be compared with each other using human reference translations?
• Is it wise to track the progress of a system by
comparing its output with human translations in case of indicative translations?
Is “failure of MT” (defined using any measure) simply because of “failure in relation to inappropriate goals”
(translating like a human)?
Linguistic Divergences
• Categorical divergence
• Noun-noun compounds
• Cultural differences
• Pleonastic divergence
• Stylistic differences
• WSD errors
Categorical divergence
• Change in lexical category
Unigram precision: 0 Bigram precision: 0 BLEU score: 0
E: I am feeling hungry H: मुझे भूख लग रह है
mujhe bhookh lag rahi hai to-me hunger feeling is I: म भूखा महसूस कर रहा हू ँ
main bhookha mehsoos kar raha hoon I hungry feel doing am
E: I am sweating
H: मुझे पसीना आ रहा है
Mujhe paseena aa raha hai To-me sweat coming is
Noun-noun compounds
The ten best Aamir Khan performances H:
आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performances Aamir khan-of ten best performances
I:
दस सव तम आ मर ख़ान पफॉम सस Dus sarvottam aamir khaan performances Ten best Aamir Khan performances
Unigrams precision: 5/5
Bi-grams precision: 2/4
Cultural differences
Unigram precision: 8/10 Bigram precision: 6/9
• ‘maamaa’/’taau’
E: Food, clothing and shelter are a man's basic needs
H: रोट , कपड़ा और मकान एक मनु य क बु नयाद ज़ रत ह
roti, kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread, clothing and house a man of basic needs are
I: खाना, कपड़ा और आ य एक मनु य क बु नयाद ज़ रत ह
khaana, kapdaa aur aashray ek manushya ki buniyaadi zarooratein hain food, clothing and shelter a man of basic needs are
Pleonastic divergence
• Words with no semantic content in target language
Unigram precision: 4/5 Bigram precision: 3/4
E: It is raining
H: बा रश हो रह है
baarish ho rahi hai rain happening is I: यह बा रश हो रह है
yeh baarish ho rahi hai it rain happening is
E: One should not trouble the weak.
H: दुबल को परेशान नह करना चा हए durbalon ko pareshaan nahi krana chahiye
to-weak trouble not do should Should not trouble the weak
Stylistic differences
• Typical styles in a language
Unigram precision: 5/7 Bigram precision: 3/6
E: The Lok Sabha has 545 members H: लोक सभा म 545 सद य ह
lok sabha mein 545 sadasya hain Lok Sabha in 545 members are I: लोक सभा के पास 545 सद य ह
lok sabha ke paas 545 sadasya hain Lok Sabha has/near 545 members are
WSD errors
• Synonyms v/s incorrect senses
E: I purchased a bat
H: मैने एक ब ला खर दा (reference) maine ek ballaa kharidaa
I a cricket-bat bought I bought a cricket bat I: मैने एक चमगादड़ खर दा
maine ek chamgaadaD kharidaa I a bat (mammal) bought
I bought a bat (mammal)
E: The thieves were held H: चोर को गर तार कया
choron ko giraftaar kiyaa thieves arrest done
The thieves were arrested I1: चोर को पकड़ा
choron ko pakdaa thieves caught
The thieves were caught I2: चोर को आयोिजत कया
choron ko aayojit kiya thieves organized done The thieves were organized
Evaluation: Precision
• Acceptable translations are rejected by BLEU
• Unacceptable translations are accepted by BLEU
Future of BLEU
• Failed correlation with human judgment [3]
• Suggested changes to BLEU score:
• Not be overly reliant on BLEU. Use it only as an
‘evaluation understudy’
Re-defining a match: [4]
Allowing synonyms
Allowing root forms of words Incorporating specific language
divergences
Limiting use of BLEU:
[3] [4]Not use BLEU for radically different systems
Evaluate on the basis of nature of indicative translation
Conclusion
• Permutation of phrases possible [3]
• Different items may be drawn from reference sets [3]
• Linguistic divergences in English-Hindi MT [4]
• BLEU’s aim of correlating with human
judgment does not go well with goals of
indicative translation [4]
Part III: Overview of MT Part III: Overview of MT
Evaluation Metrics Evaluation Metrics
Presented as a part of CS712 by:
Aditya Joshi Kashyap Popat Shubham Gautam 21
stMarch, 2013
Guide:
Prof. Pushpak Bhattacharyya
IIT Bombay
Lord Ganesha and Lord Karthikeya set out on a race to go around the world.
Source: www.jaishreeganesha.com
Lord Karthikeya:
“I won!
I went around the earth, the world once!”
Lord Ganesha:
“I won!
I went around my parents. They are my world!”
Who among them performed better?
Who won the race?
Outline Outline
Manual Evaluation
Automatic Evaluation
BLEU TER WER ROUGE METEOR NIST
GTM
Entailment-based MT evaluation
Preliminaries
• Candidate Translation(s): Translation returned by an MT system
– Also called hypothesis translation
• Reference Translation(s): ‘Perfect’ translation by humans
• Comparing metrics with BLEU:
Handling incorrect words Handling incorrect word order
Handling recall
Outline Outline
Automatic Evaluation
BLEU TER WER ROUGE METEOR NIST
GTM
Entailment-based MT evaluation
Manual Evaluation
Manual evaluation [11]
Common techniques:
1. Assigning fluency and adequacy scores on five (Absolute)
2. Ranking translated sentences relative to each other (Relative)
3. Ranking translations of syntactic constituents
drawn from the source sentence (Relative)
Manual evaluation: Assigning Adequacy and fluency
• Evaluators use their own perception to rate
• Often adequacy/fluency scores correlate: undesirable
Adequacy:
is the meaning translated correctly?
5 = All 4 = Most 3 = Much 2 = Little 1 = None
Fluency:
Is the sentence grammatically valid?
5 = Flawless English 4 = Good English 3 = Non-native English 2 = Disfluent English 1 = Incomprehensible
म एक या यान बैठा हू ँMain ek vyaakhyan baitha hoon I a lecture sit (Present-first person)
I sit a lecture
Adequate but not fluent म या यान हू ँ
Main vyakhyan hoon I lecture am I am lecture
Fluent but not adequate
Outline Outline
Manual Evaluation
BLEU TER WER ROUGE METEOR NIST
GTM
Entailment-based MT evaluation
Automatic Evaluation
BLEU [1]
• Proposed by IBM. A popular metric
• Uses modified n-gram precision and brevity
penalty (for recall)
Outline Outline
Manual Evaluation
BLEU TER WER ROUGE METEOR NIST
GTM
Entailment-based MT evaluation
Automatic Evaluation
Translation edit rate [5] (TER)
• Introduced in GALE MT task
Central idea: Edits required to change a hypothesis translation into a reference translation
Prime Minister of India will address the nation today
Reference translation:भारत के धान-मं ी आज रा को संबो धत करगे
Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation: धान-मं ी भारत के आज को अ ेस करगे आज
Pradhaan-mantri bhaarat ke aaj ko address karenge aaj
Shift Deletion Substitution Insertion
Prime Minister of India will address the nation today
Reference translation:भारत के धान-मं ी आज रा को संबो धत करगे
Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge
Candidate translation: धान-मं ी भारत के आज को अ ेस करगे आज
Pradhaan-mantri bhaarat ke aaj ko address karenge aaj
Formula for TER
TER = # Edits / # Avg number of reference words
• Cost of shift ‘distance’ not incorporated
• Mis-capitalization also considered an error
Shift Deletion Substitution Insertion
TER = 4 / 8
HTER: Human TER
• Semi-supervised technique
• Human annotators make new reference for each translation based on system output
– Must consider fluency and adequacy while
generating the closest target translation
TER v/s BLEU
TER BLEU
Handling incorrect words Substitution N-gram mismatch Handling incorrect word
order
Shift or delete + insert incorporates this error
N-gram mismatch Handling recall Missed words become
deleted words
Precision cannot detect
‘missing’ words. Hence, brevity penalty!
TER = # Edits /
# Avg number of ref. words
Outline Outline
Manual Evaluation
BLEU TER WER ROUGE METEOR NIST
GTM
Entailment-based MT evaluation
Automatic Evaluation
Word Error Rate (WER) [9]
• Based on the Levenshtein distance (Levenshtein, 1966)
• Minimum substitutions, deletions and insertions that have to be performed to convert the
generated text hyp into the reference text ref
• Also, position-independent word error rate (PER)
WER: Example
• Order of words is important
• Dynamic programming-based implementation to find ‘minimum’ errors
Reference translation:
This looks like the correct sentence.
Candidate translation:
This seems the right sentence.
# errors: 3
WER v/s BLEU
WER BLEU
Handling incorrect words Substitution N-gram mismatch Handling incorrect word
order
Delete + Insert N-gram mismatch Handling recall Missed words become
deleted words
Precision cannot detect
‘missing’ words. Hence, brevity penalty!
WER v/s TER
WER TER
Shift is incorporated.
However, not weighted on distance
Shift is not taken into account
Intuition is based on errors in unigram translation
Intuition is based on edits required for humans
TER = # Edits /
# Avg number of ref. words
Outline Outline
Manual Evaluation
BLEU TER WER ROUGE METEOR NIST
GTM
Entailment-based MT evaluation
Automatic Evaluation
ROUGE [6]
• Recall-Oriented Understudy for Gisting Evaluation
• ROUGE is a package of metrics: ROUGE-N,
ROUGE-L, ROUGE-W and ROUGE-S
ROUGE-N
ROUGE-N incorporates Recall
Will BLEU be able to understand quality of long sentences?
Reference translation:
या लू लंबे वा य क गुणव ता को समझ पाएगा?
Kya bloo lambe waakya ki guNvatta ko samajh paaega?
Candidate translation:
लंबे वा य
Lambe vaakya
ROUGE-N: 1 / 8Modified n-gram Precision: 1
Other ROUGEs
• ROUGE-L
– Considers longest common subsequence
• ROUGE-W
– Weighted ROUGE-L: All common subsequences are considered with weight based on length
• ROUGE-S
– Precision/Recall by matching skip bigrams
ROUGE v/s BLEU
ROUGE
(suite of metrics)
BLEU Handling incorrect words Skip bigrams, ROUGE-N N-gram mismatch Handling incorrect word
order
Longest common sub- sequence
N-gram mismatch Handling recall ROUGE-N incorporates
missing words
Precision cannot detect
‘missing’ words. Hence, brevity penalty!
Outline Outline
Manual Evaluation
BLEU TER WER ROUGE METEOR NIST
GTM
Entailment-based MT evaluation
Automatic Evaluation
METEOR [7]
Aims to do better than BLEU
Central idea: Have a good unigram matching
strategy
METEOR: Criticisms of BLEU
• Brevity penalty is punitive
• Higher order n-grams may not indicate grammatical correctness of a sentence
• BLEU is often zero. Should a score be zero?
Find alignment between words
Phase I
List all possible mappings based on matches
Select best possible alignment
Phase II
Select alignment with least number of ‘alignment
crosses/overlaps’
Repeat with different matching strategies:
Incorporate stemmers Consider synonyms, etc.
Reference:
The intelligent and excited boy jumped The excited and intelligent guy pounced
Candidate:
The intelligent and excited dude jumped
METEOR: Process
METEOR: The score
• Using unigram mappings, precision and recall are calculated. Then,
harmonic mean:
Penalty: Find ‘as many chunks’ that match The bright boy sits on the black benchThe intelligent guy sat on the dark bench More accurate -> Less #chunks, Less penalty Less accurate -> More #chunks, more penalty
METEOR v/s BLEU
METEOR BLEU
Handling incorrect words Alignment chunks.
Matching can be done using different techniques:
Adaptable
N-gram mismatch
Handling incorrect word order
Chunks may be ordered in any manner. METEOR does not capture this.
N-gram mismatch
Handling recall Idea of alignment
incorporates missing word handling
Precision cannot detect
‘missing’ words. Hence, brevity penalty!
Outline Outline
Manual Evaluation
BLEU TER WER ROUGE METEOR NIST
GTM
Entailment-based MT evaluation
Automatic Evaluation
NIST Metric: Introduction [8]
• MT Evaluation metric proposed by National Institute of Standards and Technology
• Are all n-gram matches the same?
• Weights more heavily those n-grams that are more informative (i.e. rarer ones)
• Matching ‘Amendment Act’ is better than ‘of the’
• When a correct n-gram is found, the rarer that
n-gram is, the more weight it is given
NIST Metric
NIST v/s BLEU
NIST BLEU
Handling incorrect words Information gain, additional level of
‘matching’
N-gram mismatch
Handling incorrect word order
No additional provision.
N-gram mismatch (based on info. Gain)
N-gram mismatch
Handling recall Brevity penalty factor based on Lsys and Lref
Precision cannot detect
‘missing’ words. Hence, brevity penalty!
Outline Outline
Manual Evaluation
BLEU TER WER ROUGE METEOR NIST
GTM
Entailment-based MT evaluation
Automatic Evaluation
GTM [10]
• General Text Matcher
• F-score: uses precision and recall
• Does not rely on ‘human judgment’
correlation
– What does BLEU score of 0.006 mean?
• Comparison is easier
GTM Scores: Precision and Recall
• MMS: Maximum Match Size
GTM v/s BLEU
GTM BLEU
Handling incorrect words Precision based on maximum Match Size
N-gram mismatch Handling incorrect word
order
By considering maximum runs
N-gram mismatch Handling recall Recall based on maximum
match size
Precision cannot detect
‘missing’ words. Hence, brevity penalty!
Outline Outline
Manual Evaluation
BLEU TER WER ROUGE METEOR NIST
GTM
Entailment-based MT evaluation
Automatic Evaluation
Entailment-based evaluation [12]
Why entailment?
E: I am feeling hungry
H(Ref): मुझे भूख लग रह है
mujhe bhookh lag rahi hai Candidate: म भूखा महसूस कर रहा हू ँ
main bhookha mehsoos kar raha hoon BLEU Score: 0
Clearly, candidate is entailed by reference
translation.
Entailment-based evaluation [12]
Evaluation of MT output for adequacy - an entailment task
A candidate translation (i.e., MT system output)
should entail the reference translation and vice
versa
The Stanford RTE system
Analysis POS
Tagging/Dependency/
NER etc.
Alignment
Alignment of words and phrases
Entailment Predicting whether
entailment or not
RTE for MT evaluation
Analysis POS
Tagging/Dependency/
NER etc.
Alignment
Alignment of words and phrases
Entailment Predicting whether
entailment or not
Directionality
The standard entailment
recognition task is asymmetric
For MT evaluation,
entailment must hold in both directions for the system translation to be fully adequate
Experimental Setup
Single reference translation
Two MT datasets with English as target language
NIST MT dataset
NIST06: The NIST MT-06 Arabic-to-English dataset
NIST08A: The NIST MT-08 Arabic-to-English dataset
NIST08C: The NIST MT-08 Chinese-to-English dataset
NIST08U: The NIST MT-08 Urdu-to-English dataset
ACL SMT dataset
SMT06E: The NAACL 2006 SMT workshop EUROPARL dataset
SMT06C: The NAACL 2006 SMT workshop Czech-English dataset
SMT07E: The ACL 2007 SMT workshop EUROPARL dataset
SMT07C: The ACL 2007 SMT workshop Czech-English dataset
Experimental Setup
4 systems are used:
BLEU-4
M
T R
TE R
TE+ M
TExperiment 1
Individual corpora
Experiment 2
Combined corpora
Conclusion: Formulation of BLEU (1/3)
• Introduced different evaluation methods
• Formulated BLEU score
• Analyzed the BLEU score by:
– Considering constituents of formula
– Calculating BLEU for a dummy example
• Compared BLEU with human judgment
Conclusion: Shortcomings of BLEU (2/3)
• Permutation of phrases possible [3]
• Different items may be drawn from reference sets [3]
• Linguistic divergences in English-Hindi MT [4]
• BLEU’s aim of correlating with human
judgment does not go well with goals of
indicative translation [4]
Conclusion (3/3)
GTM
Modifies precision and recall based on maximum match
size
NIST
Information gain-based.
Comparatively weights content and function words METEOR
Considers alignments.
Allows addition of different types of matching
ROUGE
Suite of metrics incorporating n-gram recall, skip bigrams, etc.
WER
Error in words based on word order
Also, PER TER
Based on edits required to get reference translation
Also, Human TER Manual Evaluation
Basic techniques and bottlenecks
Entailment-based evaluation
Reference translation must entail candidate translation
and vice versa
References
[1] Doug Arnold, Louisa Sadler, and R. Lee Humphreys, Evaluation: an assessment. Machine Translation, Volume 8, pages 1–27. 1993.
[2] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation, IBM research report rc22176 (w0109- 022). Technical report, IBM Research Division, Thomas, J. Watson Research Center. 2001.
[3] Chris Callison-Burch, Miles Osborne, Phillipp Koehn, Re-evaluating the role of Bleu in Machine Translation Research, European ACL (EACL) 2006, 2006.
[4] R. Ananthakrishnan, Pushpak Bhattacharyya, M. Sasikumar and Ritesh M.
Shah, Some Issues in Automatic Evaluation of English-Hindi MT: More
Blues for BLEU, ICON 2007, Hyderabad, India, Jan, 2007.
References
[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul, "A study of translation edit rate with targeted human annotation",In Proceedings of Association for Machine Translation in the Americas,2006
[6] Chin-yew Lin, "Rouge: a package for automatic evaluation of summaries", 2004
[7] Satanjeev Banerjee and Alon Lavie, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments", Proceedings of the ACL 2005 Workshop on Intrinsic and
Extrinsic Evaluation Measures for MT and/or Summarization, 2005
[8] Doddington, George, "Automatic evaluation of machine translation quality using n-gram co-
occurrence statistics", Proceedings of the second international conference on Human Language Technology Research, HLT 2002
[9] Maja and Ney, Hermann, "Word error rates: decomposition over Pos classes and applications for error analysis", Proceedings of the Second Workshop on Statistical Machine Translation, StatMT 2007
[10] Joseph Turian and Luke Shen and I. Dan Melamed, "Evaluation of Machine Translation and its Evaluation", In Proceedings of MT Summit IX, pages 386-393, 2003
[11] Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz and Josh Schroeder, "(Meta-) Evaluation of Machine Translation", ACL Workshop on Statistical Machine Translation 2007
[12] Michel Galley, Dan Jurafsky, Chris Manning, "Evaluating MT output with entailment technology", 2008