Shubham Gautam, IIT Bombay

(1)

Automatic evaluation of Machine Translation (MT): BLEU, its

shortcomings and other evaluation metrics

Presented by:

(As a part of CS 712)

Aditya Joshi Kashyap Popat

Shubham Gautam, IIT Bombay

Guide:

Prof. Pushpak Bhattacharyya,

IIT Bombay

(2)

Part I: Introduction and formulation of BLEU

Presented by:

Aditya Joshi Kashyap Popat

Shubham Gautam, IIT Bombay

Guide:

Prof. Pushpak Bhattacharyya,

IIT Bombay

(3)

Poetry is what gets Poetry is what gets lost in translation.

lost in translation. ””

Robert Frost

Poet (1874 – 1963)

Wrote the famous poem ‘Stopping by woods on a snowy evening’ better known as ‘Miles to go before I sleep’

(4)

Motivation

How do we judge a good translation?

Can a machine do this?

Why should a machine do this? Because humans take time!

(5)

Outline

• Evaluation

• Formulating BLEU Metric

• Understanding BLEU formula

• Shortcomings of BLEU

• Shortcomings in context of English-Hindi MT

Doug Arnold, Louisa Sadler, and R. Lee Humphreys, Evaluation: an assessment. Machine Translation, Volume 8, pages 1–27. 1993.

K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation, IBM research report rc22176 (w0109- 022). Technical report, IBM Research Division, Thomas, J. Watson Research Center. 2001.

R. Ananthakrishnan, Pushpak Bhattacharyya, M. Sasikumar and Ritesh M.

Shah, Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU, ICON 2007, Hyderabad, India, Jan, 2007.

Part I Part II

(6)

BLEU - I

The BLEU Score rises

(7)

Evaluation

Evaluation ^[1] ^[1]

Of NLP systems

Of MT systems

(8)

Evaluation in NLP: Precision/Recall

Precision:

How many results returned were correct?

Recall:

What portion of correct results were returned?

Adapting precision/recall to NLP tasks

(9)

Evaluation in NLP: Precision/Recall

• Document Retrieval

Precision =

|Documents relevant and retrieved|

|Documents retrieved|

Recall=

|Documents relevant and retrieved|

| Documents relevant|

• Classification

Precision =

|True Positives|

|True Positives + False Positives|

Recall=

|True Positives|

| True Positives + False

Negatives|

(10)

Evaluation in MT [1]

• Operational evaluation

– “Is MT system A operationally better than MT system B?

Does MT system A cost less?”

• Typological evaluation

– “Have you ensured which linguistic phenomena the MT system covers?”

• Declarative evaluation

– “How does quality of output of system A fare with respect

to that of B?”

(11)

Operational evaluation

• Cost-benefit is the focus

• To establish cost-per-unit figures and use this as a basis for comparison

• Essentially ‘black box’

• Realism requirement

Word-based SA

Sense-based SA

Cost of pre-processing (Stemming, etc)

Cost of learning a classifier

Cost of pre-processing (Stemming, etc)

Cost of learning a classifier Cost of sense annotation

(12)

Typological evaluation

• Use a test suite of examples

• Ensure all relevant phenomena to be tested are covered

• Specific-to-language phenomena

For example, हर आँख वाला लड़का मु कुराया

hari aankhon-wala ladka muskuraya green eyes-with boy smiled

The boy with green eyes smiled.

(13)

Declarative evaluation

• Assign scores to specific qualities of output

– Intelligibility: How good the output is as a well- formed target language entity

– Accuracy: How good the output is in terms of preserving content of the source text

For example, I am attending a lecture

म एक या यान बैठा हू ँ

Main ek vyaakhyan baitha hoon I a lecture sit (Present-first person)

I sit a lecture : Accurate but not intelligible म या यान हू ँ

Main vyakhyan hoon I lecture am

I am lecture: Intelligible but not accurate.

(14)

Evaluation bottleneck

• Typological evaluation is time-consuming

• Operational evaluation needs accurate modeling of cost-benefit

• Automatic MT evaluation: Declarative

BLEU: Bilingual Evaluation Understudy

(15)

Deriving BLEU

Deriving BLEU ^[2] ^[2]

Incorporating Precision

Incorporating Recall

(16)

How is translation performance measured?

The closer a machine translation is to a

professional human translation, the better it is.

• A corpus of good quality human reference translations

• A numerical “translation closeness” metric

(17)

Preliminaries

• Candidate Translation(s): Translation returned by an MT system

• Reference Translation(s): ‘Perfect’ translation by humans

Goal of BLEU: To correlate with human

judgment

(18)

Formulating BLEU (Step 1): Precision

I had lunch now.

Candidate 1: मैने अब खाना खाया

maine ab khana khaya

matching unigrams: 3,

I now food ate matching bigrams: 1

I ate food now

Candidate 2: मैने अभी लंच एट

maine abhi lunch ate. matching unigrams: 2,

I now lunch ate

I ate lunch(OOV) now(OOV)

matching bigrams: 1 Unigram precision: Candidate 1: 3/4 = 0.75, Candidate 2: 2/4 = 0.5

Similarly, bigram precision: Candidate 1: 0.33, Candidate 2 = 0.33

Reference 1: मैने अभी खाना खाया

maine abhi khana khaya I now food ate

I ate food now.

Reference 2 : मैने अभी भोजन कया

maine abhi bhojan kiyaa I now meal did I did meal now

(19)

Precision: Not good enough

Reference: मुझपर तेरा सु र छाया

mujh-par tera suroor chhaaya me-on your spell cast

Your spell was cast on me

Candidate 1: मेरे तेरा सु र छाया matching unigram: 3

mere tera suroor chhaaya my your spell cast

Your spell cast my

Candidate 2: तेरा तेरा तेरा सु र matching unigrams: 4

tera tera tera suroor your your your spell

Unigram precision: Candidate 1: 3/4 = 0.75, Candidate 2: 4/4 = 1

(20)

Formulating BLEU (Step 2): Modified Precision

• Clip the total count of each candidate word with its maximum reference count

• Count

_clip

(n-gram) = min (count, max_ref_count)

Reference: मुझपर तेरा सु र छाया

mujh-par tera suroor chhaaya me-on your spell cast Your spell was cast on me

Candidate 2: तेरा तेरा तेरा सु र

tera tera tera suroor your your your spell

• matching unigrams:

(तेरा : min(3, 1) = 1 ) (सु र: min (1, 1) = 1) Modified unigram precision: 2/4 = 0.5

(21)

Modified n-gram precision

For entire test corpus, for a given n,

n-gram: Matching n-grams in C

n-gram’: All n-grams in C Modified precision for

n-grams Overall candidates of

test corpus

Formula from [2]

(22)

Calculating modified n-gram precision (1/2)

• 127 source sentences were translated by two human translators and three MT systems

• Translated sentences evaluated against

professional reference translations using

modified n-gram precision

(23)

Calculating modified n-gram precision (2/2)

• Decaying precision with increasing n

• Comparative ranking of the five

Combining precision for different values of n-grams?

Graph from [2]

(24)

Formulation of BLEU: Recap

• Precision cannot be used as is

• Modified precision considers ‘clipped word

count’

(25)

Recall for MT (1/2)

• Candidates shorter than references

• Reference: या लू लंबे वा य क गुणव ता को समझ पाएगा?

kya blue lambe vaakya ki guNvatta ko samajh paaega?

will blue long sentence-of quality (case-marker) understand able(III-person- male-singular)?

Will blue be able to understand quality of long sentence?

Candidate: लंबे वा य

lambe vaakya long sentence long sentence

modified unigram precision: 2/2 = 1

modified bigram precision: 1/1 = 1

(26)

Recall for MT (2/2)

• Candidates longer than references

Reference 2: मैने भोजन कया

maine bhojan kiyaa I meal did

I had meal

Candidate 1: मैने खाना भोजन कया

maine khaana bhojan kiya I food meal did

I had food meal

Modified unigram precision: 1

Reference 1: मैने खाना खाया

maine khaana khaaya I food ate

I ate food

Candidate 2: मैने खाना खाया

maine khaana khaaya I food ate

I ate food

Modified unigram precision: 1

(27)

Formulating BLEU (Step 3):

Incorporating recall

• Sentence length indicates ‘best match’

• Brevity penalty (BP):

– Multiplicative factor

– Candidate translations that match reference translations in length must be ranked higher

Candidate 1: लंबे वा य

Candidate 2: या लू लंबे वा य क गुणव तासमझ

पाएगा?

(28)

Formulating BLEU (Step 3): Brevity Penalty

e^(1-x)

Graph drawn using www.fooplot.com BP

BP = 1 for c > r.

Why?

x = ( r / c )

Formula from [2]

r: Reference sentence length c: Candidate sentence length

(29)

BP leaves out longer translations

Why?

Translations longer than reference are already penalized by modified precision

Validating the claim:

Formula from [2]

(30)

BLEU score

Precision -> Modified n- gram precision

Recall -> Brevity Penalty

Formula from [2]

(31)

Understanding BLEU Understanding BLEU

Dissecting the formula

Using it: news headline example

(32)

Decay in precision

Why log p _n ?

To accommodate decay in precision values

Graph from [2]

Formula from [2]

(33)

Dissecting the Formula

Claim: BLEU should lie between 0 and 1

Reason: To intuitively satisfy “1 implies perfect translation”

Understanding constituents of the formula to validate the claim

Brevity

Penalty Modified

precision Set to 1/N

Formula from [2]

(34)

Validation of range of BLEU

BP

p_n : Between 0 and 1

log p_n: Between –(infinity) and 0 A: Between –(infinity) and 0

e ^ (A): Between 0 and 1 A

BP: Between 0 and 1

Graph drawn using www.fooplot.com (r/c)

p_n

log p_n

A

e^(A)

(35)

Calculating BLEU: An actual example

Ref: भावना मक एक करण लाने के लए एक योहार

bhaavanaatmak ekikaran laane ke liye ek tyohaar emotional unity bring-to a festival

A festival to bring emotional unity

C1: लाने के लए एक योहार भावना मक एक करण

laane-ke-liye ek tyoaar bhaavanaatmak ekikaran

bring-to a festival emotional unity (This is invalid Hindi ordering)

C2: को एक उ सव के बारे म लाना भावना मक एक करण

ko ek utsav ke baare mein laana bhaavanaatmak ekikaran

for a festival-about to-bring emotional unity (This is invalid Hindi ordering)

(36)

Modified n-gram precision

Ref: भावना मक एक करण लाने के लए एक योहार C1: लाने के लए एक योहार भावना मक एक करण

C2: को एक उ सव के बारे म लाना भावना मक एक करण

‘n’ #Matching n- grams

#Total n-grams Modified n- gram Precision

1 7 7 1

2 5 6 0.83

‘n’ #Matching n- grams

#Total n-grams Modified n- gram Precision

1 4 9 0.44

2 1 8 0.125

r: 7 c: 7

r: 7 c: 9

(37)

Calculating BLEU score

n w_n p_n log p_n w_n*

log p_n

1 1 / 2 1 0 0

2 1 / 2 0.83 -0.186 -0.093

Total: -0.093

BLEU:

0.911

w_n= 1 / N = 1 / 2 r = 7

C1:

c = 7

BP = exp(1 – 7/7) = 1 C2:

c = 9 BP = 1

n w_n p_n log p_n w_n*

log p_n

1 1 / 2 0.44 -0.821 0.4105

2 1 / 2 0.125 -2.07 -1.035

Total: -1.445

BLEU:

0.235

C1:

C2:

(38)

Hence, the BLEU scores...

C1: लाने के लए एक योहार भावना मक एक करण

C2: को एक उ सव के बारे म लाना

भावना मक एक करण

0.911

0.235

Ref: भावना मक एक करण लाने के लए एक योहार

(39)

BLEU v/s human BLEU v/s human

judgment

judgment ^[2] ^[2]

Target language: English

Source language: Chinese

(40)

Setup

BLEU scores obtained for each system

Five systems perform translation:

3 automatic MT systems 2 human translators

Human judgment (on scale of 5) obtained for each system:

• Group 1: Ten Monolingual speakers of target language (English)

• Group 2: Ten Bilingual speakers

of Chinese and English

(41)

BLEU v/s human judgment

• Monolingual speakers:

Correlation co-efficient: 0.99

• Bilingual speakers:

Correlation co-efficient: 0.96

BLEU and human evaluation for S2 and S3

Graph from [2]

(42)

Comparison of normalized values

• High correlation between monolingual group and BLEU score

• Bilingual group were

lenient on ‘fluency’ for H1

• Demarcation between {S1-S3} and {H1-H2} is captured by BLEU

Graph from [2]

(43)

Conclusion

• Introduced different evaluation methods

• Formulated BLEU score

• Analyzed the BLEU score by:

– Considering constituents of formula

– Calculating BLEU for a dummy example

• Compared BLEU with human judgment

(44)

Part II: Shortcomings of BLEU

(continued from 7

^th

March, 2013)

Presented by:

Aditya Joshi Kashyap Popat

Shubham Gautam, IIT Bombay

Guide:

Prof. Pushpak Bhattacharyya,

IIT Bombay

(45)

BLEU score revisited

Precision -> Modified n- gram precision

Recall -> Brevity Penalty

Formula from [2]

(46)

Outline

• Evaluation

• Formulating BLEU Metric

• Understanding BLEU formula

• Shortcomings of BLEU in general

• Shortcomings in context of English-Hindi MT

Part I Part II

Chris Callison-Burch, Miles Osborne, Phillipp Koehn, Re-evaluating the role of Bleu in Machine Translation Research, European ACL (EACL) 2006, 2006.

R. Ananthakrishnan, Pushpak Bhattacharyya, M. Sasikumar and Ritesh M.

Shah, Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU, ICON 2007, Hyderabad, India, Jan, 2007.

(47)

[A domesticated bird wants to learn to fly]

I can do this. I just have to work out the physics.

I have quadrated by vector angles.

I have adjusted for wind shear.

This is it! Let's fly! Just keep it simple.

Thrust, lift, drag and wait. Thrust, lift, drag, wait.

[speeds up to get to the of the table]

Thrust, lift, drag, wai...wai...wait!

[gets scared and tries to stop himself,

but instead falls from the table]

Lines from Rio (2011) - Blu, the Macaw

“

Image Source: www.fanpop.com

(48)

Use of BLEU metric [3]

• Evaluating incremental system changes

– Neglecting actual examples

• Systems ranked on basis of BLEU

• Does minimizing error rate with respect to BLEU

indeed guarantee translation improvements?

(49)

Criticisms of BLEU ^[4]

• Admits too much variation ^[3]

• Admits too little variation ^{[4] [3]}

• Poor correlation with human judgment ^[3]

(50)

Admits too much variation ^[3]

Permuting phrases

Drawing different items from reference

set

(51)

Admits too much variation

• BLEU relies on n-gram matching only

• Puts very few constraints on how n-gram matches can be drawn from multiple

reference translations

Brevity Penalty

(Incorporating recall) Modified precision Set to 1/N

(52)

Modified n-gram precision in BLEU ^[2]

• Clip the total count of each candidate word with its maximum reference count

• Count

_clip

(n-gram) = min (count, max_ref_count)

Reference: मुझपर तेरा सु र छाया

mujhpar teraa suroor chhaaya me-on your spell has-been-cast Your spell has been cast on me

Candidate 2: तेरा तेरा तेरा सु र

tera tera tera suroor your your your spell Your your your spell

matching unigrams:

(तेरा : min(3, 1) = 1 ) (सु र: min (1, 1) = 1) Modified unigram precision: 2/4 = 0.5

(53)

Permuting phrases ^[3]

• Reordering of unmatched phrases does not affect precision

• Bigram mismatch sites can be freely permuted

Possible to randomly produce other hypothesis translations that have the same BLEU score

B1 B2 B3 B4

B4 B2 B1 B3

Bigram Bigram

mismatch

(54)

Issues with precision (1/2)

Candidate:

राजा और रानी शकार करने जंगल म गये

raaja aur raani shikaar karne jungal mein chale gaye King and queen to-do-hunting to-jungle went

Reference 1:

राजा और रानी जंगल को शकार के

लए गये

raaja aur raani jangal ko shikaar ke liye gaye King and queen to-jungle for-hunting went

Reference 2:

राजा और उनक बीवी शकार करने जंगल गये

raaja aur unki biwi shikaar karne jangal gaye king and his wife to-do-hunting jungle went

Matching bi-grams

= 4 / 8

Candidate:

राजा और रानी शकार करने जंगल गये म raaja aur raani shikaar karne gaye jungle mein

King and queen to-do-hunting went jungle to (grammatically incorrect) The king and the queen went to the jungle to hunt.

Matching bi-grams

= 4 / 8

(55)

Issues with precision (2/2)

Candidate:

राजा और रानी शकार करने जंगल म गये

raaja aur raani shikaar karne jungal mein chale gaye King and queen to-do-hunting to-jungle went

Reference 1:

राजा और रानी जंगल को शकार के

लए गये

raaja aur raani jangal ko shikaar ke liye gaye King and queen to-jungle for-hunting went

Reference 2:

राजा और उनक बीवी शकार करने जंगल गये

raaja aur unki biwi shikaar karne jangal gaye king and his wife to-do-hunting jungle went

Matching bi-grams

= 4 / 8

Candidate:

शकार करने जंगल राजा और रानी म गये

shikaar karne jungle raaja aur raani mein gaye

to-do hunting jungle raja and rani in went (grammatically incorrect) The king and the queen went to the jungle to hunt.

Matching bi-grams

= 4 / 8

(56)

Permuting phrases, in general

• For ‘b’ bi-gram matches in a candidate translation of length

‘k’,

(k – b)! possible ways to generate similarly score items using only the words in this translation

In our example, (8-4)! = 24 candidate translations

In sentence of length k,

total bigrams = k – 1 matched bigrams = b

no. of mismatched bigrams = k – 1 – b no. of matched chunks = k – 1 – b + 1

= k – b

These (k-b) chunks can be reordered in (k – b)! ways

(57)

Permuting phrases: Evaluation

• 2005 NIST evaluation data

• MT system (2

^nd

best in the task) produced candidate

translation

(58)

Drawing different items from the reference set

• If two systems ‘recall’ x words each from a reference sentence,

• Precision may remain the same

• Brevity penalty remains the same

• Translation quality need not remain the same

(59)

Drawing different items: Example

You may omit lexical items altogether!

Candidate 2:

घर क मुग दाल Ghar murgi daal hoti House chicken daal is Unigram precision: 1 / 1 Bigram precision: 2 / 3 Reference:

घर क मुग मतलब दाल होती है

Ghar ki murgi matlab daal jaisi hoti hai House-of chicken means daal-like is

Chicken at home is like daal.

Candidate 1:

मुग मतलब होती है

Ki matlab jaisi hai Of means like is

Unigram precision: 1 / 1 Bigram precision: 2 / 3

(60)

Drawing different items: In general

• If there are x unigram matches, y bigram matches in candidate C1 and C2 both,

– BLEU score remains the same

– What words are actually matched is not accounted for

– Synonyms also indicate ‘mismatch’

(61)

Failures in practice: Evaluation 1 (1/2)

[3]

• NIST MT Evaluation exercise

• Failure to correlate with human judgment

– In 2005 exercise: System ranked 1

^st

in human judgment, ranked 6

^th

by BLEU score

• Seven systems compared for adequacy and

fluency

(62)

Failures in practice: Evaluation 1 (2/2)

[3]

• System with high human judgment score

Graph

Outlier included

Pearson correlation (R²)

Yes 0.14

No 0.87

Outlier included

Pearson correlation (R²)

Yes 0.002

No 0.742

Graph

(63)

Failures in practice: Evaluation 2 (1/2)

[3]

• French-English

• System 1: Rule-based system using Systran

• System 2/3: SMT system using two sets of

Europarl data (28 million words per language)

• 300 test sentences

(64)

Failures in practice: Evaluation 2 (2/2)

[3]

• BLEU underestimates human judgment

Graph

(65)

Admits too much and too Admits too much and too

less variation

less variation ^[[4 ^4]]

Indicative translations

Linguistic divergences & evaluation

(66)

Indicative translation

• Draft quality translations

• For assimilation rather than dissemination

• Virtually all general-purpose MT systems

today produce indicative translations

(67)

Basic steps to obtain indicative translation: English-Hindi

• Structural transfer

– S V O to S O V

• Lexical transfer

– Looking up corresponding word

– Adding gender, aspect, tense, etc. information

(68)

Indicative translation: Example

Rohan goes to school by bus

Structural transfer: SVO to SOV

(Rohan) (to school) (by bus) (goes)

Lexical transfer: Word translation

(Rohan) (paathshaala) (bus) (jaana)

Lexical transfer: Adding GNPTAM information

(Rohan) (paathshaala) (bus se) (jaata hai)

Rohan paathshaala bus se jaata hai

Rohan bus se paathshaala jaata hai

(69)

Questions in context of indicative translations ^[4]

• Can systems be compared with each other using human reference translations?

• Is it wise to track the progress of a system by

comparing its output with human translations in case of indicative translations?

Is “failure of MT” (defined using any measure) simply because of “failure in relation to inappropriate goals”

(translating like a human)?

(70)

Linguistic Divergences

• Categorical divergence

• Noun-noun compounds

• Cultural differences

• Pleonastic divergence

• Stylistic differences

• WSD errors

(71)

Categorical divergence

• Change in lexical category

Unigram precision: 0 Bigram precision: 0 BLEU score: 0

E: I am feeling hungry H: मुझे भूख लग रह है

mujhe bhookh lag rahi hai to-me hunger feeling is I: म भूखा महसूस कर रहा हू ँ

main bhookha mehsoos kar raha hoon I hungry feel doing am

E: I am sweating

H: मुझे पसीना आ रहा है

Mujhe paseena aa raha hai To-me sweat coming is

(72)

Noun-noun compounds

The ten best Aamir Khan performances H:

आ मर ख़ान क दस सव तम पफॉम सस Aamir khaan ki dus sarvottam performances Aamir khan-of ten best performances

I:

दस सव तम आ मर ख़ान पफॉम सस Dus sarvottam aamir khaan performances Ten best Aamir Khan performances

Unigrams precision: 5/5

Bi-grams precision: 2/4

(73)

Cultural differences

Unigram precision: 8/10 Bigram precision: 6/9

• ‘maamaa’/’taau’

E: Food, clothing and shelter are a man's basic needs

H: रोट , कपड़ा और मकान एक मनु य क बु नयाद ज़ रत ह

roti, kapda aur makaan ek manushya ki buniyaadi zarooratein hain bread, clothing and house a man of basic needs are

I: खाना, कपड़ा और आ य एक मनु य क बु नयाद ज़ रत ह

khaana, kapdaa aur aashray ek manushya ki buniyaadi zarooratein hain food, clothing and shelter a man of basic needs are

(74)

Pleonastic divergence

• Words with no semantic content in target language

Unigram precision: 4/5 Bigram precision: 3/4

E: It is raining

H: बा रश हो रह है

baarish ho rahi hai rain happening is I: यह बा रश हो रह है

yeh baarish ho rahi hai it rain happening is

E: One should not trouble the weak.

H: दुबल को परेशान नह करना चा हए durbalon ko pareshaan nahi krana chahiye

to-weak trouble not do should Should not trouble the weak

(75)

Stylistic differences

• Typical styles in a language

Unigram precision: 5/7 Bigram precision: 3/6

E: The Lok Sabha has 545 members H: लोक सभा म 545 सद य ह

lok sabha mein 545 sadasya hain Lok Sabha in 545 members are I: लोक सभा के पास 545 सद य ह

lok sabha ke paas 545 sadasya hain Lok Sabha has/near 545 members are

(76)

WSD errors

• Synonyms v/s incorrect senses

E: I purchased a bat

H: मैने एक ब ला खर दा (reference) maine ek ballaa kharidaa

I a cricket-bat bought I bought a cricket bat I: मैने एक चमगादड़ खर दा

maine ek chamgaadaD kharidaa I a bat (mammal) bought

I bought a bat (mammal)

E: The thieves were held H: चोर को गर तार कया

choron ko giraftaar kiyaa thieves arrest done

The thieves were arrested I1: चोर को पकड़ा

choron ko pakdaa thieves caught

The thieves were caught I2: चोर को आयोिजत कया

choron ko aayojit kiya thieves organized done The thieves were organized

(77)

Evaluation: Precision

• Acceptable translations are rejected by BLEU

• Unacceptable translations are accepted by BLEU

(78)

Future of BLEU

• Failed correlation with human judgment ^[3]

• Suggested changes to BLEU score:

• Not be overly reliant on BLEU. Use it only as an

‘evaluation understudy’

Re-defining a match: [4]

Allowing synonyms

Allowing root forms of words Incorporating specific language

divergences

Limiting use of BLEU:

[3] [4]

Not use BLEU for radically different systems

Evaluate on the basis of nature of indicative translation

(79)

Conclusion

• Permutation of phrases possible ^[3]

• Different items may be drawn from reference sets [3]

• Linguistic divergences in English-Hindi MT [4]

• BLEU’s aim of correlating with human

judgment does not go well with goals of

indicative translation ^[4]

(80)

Part III: Overview of MT Part III: Overview of MT

Evaluation Metrics Evaluation Metrics

Presented as a part of CS712 by:

Aditya Joshi Kashyap Popat Shubham Gautam 21

^st

March, 2013

Guide:

Prof. Pushpak Bhattacharyya

IIT Bombay

(81)

Lord Ganesha and Lord Karthikeya set out on a race to go around the world.

Source: www.jaishreeganesha.com

Lord Karthikeya:

“I won!

I went around the earth, the world once!”

Lord Ganesha:

“I won!

I went around my parents. They are my world!”

Who among them performed better?

Who won the race?

(82)

Outline Outline

Manual Evaluation

Automatic Evaluation

BLEU TER WER ROUGE METEOR NIST

GTM

Entailment-based MT evaluation

(83)

Preliminaries

• Candidate Translation(s): Translation returned by an MT system

– Also called hypothesis translation

• Reference Translation(s): ‘Perfect’ translation by humans

• Comparing metrics with BLEU:

Handling incorrect words Handling incorrect word order

Handling recall

(84)

Outline Outline

Automatic Evaluation

BLEU TER WER ROUGE METEOR NIST

GTM

Entailment-based MT evaluation

Manual Evaluation

(85)

Manual evaluation ^[11]

Common techniques:

1. Assigning fluency and adequacy scores on five (Absolute)

2. Ranking translated sentences relative to each other ^(Relative)

3. Ranking translations of syntactic constituents

drawn from the source sentence (Relative)

(86)

Manual evaluation: Assigning Adequacy and fluency

• Evaluators use their own perception to rate

• Often adequacy/fluency scores correlate: undesirable

Adequacy:

is the meaning translated correctly?

5 = All 4 = Most 3 = Much 2 = Little 1 = None

Fluency:

Is the sentence grammatically valid?

5 = Flawless English 4 = Good English 3 = Non-native English 2 = Disfluent English 1 = Incomprehensible

म एक या यान बैठा हू ँ

Main ek vyaakhyan baitha hoon I a lecture sit (Present-first person)

I sit a lecture

Adequate but not fluent म या यान हू ँ

Main vyakhyan hoon I lecture am I am lecture

Fluent but not adequate

(87)

Outline Outline

Manual Evaluation

BLEU TER WER ROUGE METEOR NIST

GTM

Entailment-based MT evaluation

Automatic Evaluation

(88)

BLEU [1]

• Proposed by IBM. A popular metric

• Uses modified n-gram precision and brevity

penalty (for recall)

(89)

Outline Outline

Manual Evaluation

BLEU TER WER ROUGE METEOR NIST

GTM

Entailment-based MT evaluation

Automatic Evaluation

(90)

Translation edit rate ^{[5] (TER)}

• Introduced in GALE MT task

Central idea: Edits required to change a hypothesis translation into a reference translation

Prime Minister of India will address the nation today

Reference translation:भारत के धान-मं ी आज रा को संबो धत करगे

Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge

Candidate translation: धान-मं ी भारत के आज को अ ेस करगे आज

Pradhaan-mantri bhaarat ke aaj ko address karenge aaj

Shift Deletion Substitution Insertion

(91)

Prime Minister of India will address the nation today

Reference translation:भारत के धान-मं ी आज रा को संबो धत करगे

Bhaarat ke pradhaan-mantri aaj raashtra ko sambodhit karenge

Candidate translation: धान-मं ी भारत के आज को अ ेस करगे आज

Pradhaan-mantri bhaarat ke aaj ko address karenge aaj

Formula for TER

TER = # Edits / # Avg number of reference words

• Cost of shift ‘distance’ not incorporated

• Mis-capitalization also considered an error

Shift Deletion Substitution Insertion

TER = 4 / 8

(92)

HTER: Human TER

• Semi-supervised technique

• Human annotators make new reference for each translation based on system output

– Must consider fluency and adequacy while

generating the closest target translation

(93)

TER v/s BLEU

TER BLEU

Handling incorrect words Substitution N-gram mismatch Handling incorrect word

order

Shift or delete + insert incorporates this error

N-gram mismatch Handling recall Missed words become

deleted words

Precision cannot detect

‘missing’ words. Hence, brevity penalty!

TER = # Edits /

# Avg number of ref. words

(94)

Outline Outline

Manual Evaluation

BLEU TER WER ROUGE METEOR NIST

GTM

Entailment-based MT evaluation

Automatic Evaluation

(95)

Word Error Rate (WER) ^[9]

• Based on the Levenshtein distance (Levenshtein, 1966)

• Minimum substitutions, deletions and insertions that have to be performed to convert the

generated text hyp into the reference text ref

• Also, position-independent word error rate (PER)

(96)

WER: Example

• Order of words is important

• Dynamic programming-based implementation to find ‘minimum’ errors

Reference translation:

This looks like the correct sentence.

Candidate translation:

This seems the right sentence.

# errors: 3

(97)

WER v/s BLEU

WER BLEU

Handling incorrect words Substitution N-gram mismatch Handling incorrect word

order

Delete + Insert N-gram mismatch Handling recall Missed words become

deleted words

(98)

WER v/s TER

WER TER

Shift is incorporated.

However, not weighted on distance

Shift is not taken into account

Intuition is based on errors in unigram translation

Intuition is based on edits required for humans

TER = # Edits /

# Avg number of ref. words

(99)

Outline Outline

Manual Evaluation

BLEU TER WER ROUGE METEOR NIST

GTM

Entailment-based MT evaluation

Automatic Evaluation

(100)

ROUGE ^[6]

• Recall-Oriented Understudy for Gisting Evaluation

• ROUGE is a package of metrics: ROUGE-N,

ROUGE-L, ROUGE-W and ROUGE-S

(101)

ROUGE-N

ROUGE-N incorporates Recall

Will BLEU be able to understand quality of long sentences?

Reference translation:

या लू लंबे वा य क गुणव ता को समझ पाएगा?

Kya bloo lambe waakya ki guNvatta ko samajh paaega?

Candidate translation:

लंबे वा य

Lambe vaakya

ROUGE-N: 1 / 8

Modified n-gram Precision: 1

(102)

Other ROUGEs

• ROUGE-L

– Considers longest common subsequence

• ROUGE-W

– Weighted ROUGE-L: All common subsequences are considered with weight based on length

• ROUGE-S

– Precision/Recall by matching skip bigrams

(103)

ROUGE v/s BLEU

ROUGE

(suite of metrics)

BLEU Handling incorrect words Skip bigrams, ROUGE-N N-gram mismatch Handling incorrect word

order

Longest common subsequence

N-gram mismatch Handling recall ROUGE-N incorporates

missing words

(104)

Outline Outline

Manual Evaluation

BLEU TER WER ROUGE METEOR NIST

GTM

Entailment-based MT evaluation

Automatic Evaluation

(105)

METEOR ^[7]

Aims to do better than BLEU

Central idea: Have a good unigram matching

strategy

(106)

METEOR: Criticisms of BLEU

• Brevity penalty is punitive

• Higher order n-grams may not indicate grammatical correctness of a sentence

• BLEU is often zero. Should a score be zero?

(107)

Find alignment between words

Phase I

List all possible mappings based on matches

Select best possible alignment

Phase II

Select alignment with least number of ‘alignment

crosses/overlaps’

Repeat with different matching strategies:

Incorporate stemmers Consider synonyms, etc.

Reference:

The intelligent and excited boy jumped The excited and intelligent guy pounced

Candidate:

The intelligent and excited dude jumped

METEOR: Process

(108)

METEOR: The score

• Using unigram mappings, precision and recall are calculated. Then,

harmonic mean:

Penalty: Find ‘as many chunks’ that match The bright boy sits on the black bench

The intelligent guy sat on the dark bench More accurate -> Less #chunks, Less penalty Less accurate -> More #chunks, more penalty

(109)

METEOR v/s BLEU

METEOR BLEU

Handling incorrect words Alignment chunks.

Matching can be done using different techniques:

Adaptable

N-gram mismatch

Handling incorrect word order

Chunks may be ordered in any manner. METEOR does not capture this.

N-gram mismatch

Handling recall Idea of alignment

incorporates missing word handling

(110)

Outline Outline

Manual Evaluation

BLEU TER WER ROUGE METEOR NIST

GTM

Entailment-based MT evaluation

Automatic Evaluation

(111)

NIST Metric: Introduction ^[8]

• MT Evaluation metric proposed by National Institute of Standards and Technology

• Are all n-gram matches the same?

• Weights more heavily those n-grams that are more informative (i.e. rarer ones)

• Matching ‘Amendment Act’ is better than ‘of the’

• When a correct n-gram is found, the rarer that

n-gram is, the more weight it is given

(112)

NIST Metric

(113)

NIST v/s BLEU

NIST BLEU

Handling incorrect words Information gain, additional level of

‘matching’

N-gram mismatch

Handling incorrect word order

No additional provision.

N-gram mismatch (based on info. Gain)

N-gram mismatch

Handling recall Brevity penalty factor based on Lsys and Lref

(114)

Outline Outline

Manual Evaluation

BLEU TER WER ROUGE METEOR NIST

GTM

Entailment-based MT evaluation

Automatic Evaluation

(115)

GTM ^[10]

• General Text Matcher

• F-score: uses precision and recall

• Does not rely on ‘human judgment’

correlation

– What does BLEU score of 0.006 mean?

• Comparison is easier

(116)

GTM Scores: Precision and Recall

• MMS: Maximum Match Size

(117)

GTM v/s BLEU

GTM BLEU

Handling incorrect words Precision based on maximum Match Size

N-gram mismatch Handling incorrect word

order

By considering maximum runs

N-gram mismatch Handling recall Recall based on maximum

match size

(118)

Outline Outline

Manual Evaluation

BLEU TER WER ROUGE METEOR NIST

GTM

Entailment-based MT evaluation

Automatic Evaluation

(119)

Entailment-based evaluation ^[12]



Why entailment?

E: I am feeling hungry

H(Ref): मुझे भूख लग रह है

mujhe bhookh lag rahi hai Candidate: म भूखा महसूस कर रहा हू ँ

main bhookha mehsoos kar raha hoon BLEU Score: 0

Clearly, candidate is entailed by reference

translation.

(120)

Entailment-based evaluation ^[12]



Evaluation of MT output for adequacy - an entailment task



A candidate translation (i.e., MT system output)

should entail the reference translation and vice

versa

(121)

The Stanford RTE system

Analysis POS

Tagging/Dependency/

NER etc.

Alignment

Alignment of words and phrases

Entailment Predicting whether

entailment or not

(122)

RTE for MT evaluation

Analysis POS

Tagging/Dependency/

NER etc.

Alignment

Alignment of words and phrases

Entailment Predicting whether

entailment or not

 Directionality

 The standard entailment

recognition task is asymmetric

 For MT evaluation,

entailment must hold in both directions for the system translation to be fully adequate

(123)

Experimental Setup

 Single reference translation

 Two MT datasets with English as target language

 NIST MT dataset

 NIST06: The NIST MT-06 Arabic-to-English dataset

 NIST08A: The NIST MT-08 Arabic-to-English dataset

 NIST08C: The NIST MT-08 Chinese-to-English dataset

 NIST08U: The NIST MT-08 Urdu-to-English dataset

 ACL SMT dataset

 SMT06E: The NAACL 2006 SMT workshop EUROPARL dataset

 SMT06C: The NAACL 2006 SMT workshop Czech-English dataset

 SMT07E: The ACL 2007 SMT workshop EUROPARL dataset

 SMT07C: The ACL 2007 SMT workshop Czech-English dataset

(124)

Experimental Setup



4 systems are used:

 BLEU-4

 M

_T

 R

_TE

 R

_TE

+ M

_T

(125)

Experiment 1



Individual corpora

(126)

Experiment 2



Combined corpora

(127)

Conclusion: Formulation of BLEU (1/3)

• Introduced different evaluation methods

• Formulated BLEU score

• Analyzed the BLEU score by:

– Considering constituents of formula

– Calculating BLEU for a dummy example

• Compared BLEU with human judgment

(128)

Conclusion: Shortcomings of BLEU (2/3)

• Permutation of phrases possible ^[3]

• Different items may be drawn from reference sets [3]

• Linguistic divergences in English-Hindi MT [4]

• BLEU’s aim of correlating with human

judgment does not go well with goals of

indicative translation ^[4]

(129)

Conclusion (3/3)

GTM

Modifies precision and recall based on maximum match

size

NIST

Information gain-based.

Comparatively weights content and function words METEOR

Considers alignments.

Allows addition of different types of matching

ROUGE

Suite of metrics incorporating n-gram recall, skip bigrams, etc.

WER

Error in words based on word order

Also, PER TER

Based on edits required to get reference translation

Also, Human TER Manual Evaluation

Basic techniques and bottlenecks

Entailment-based evaluation

Reference translation must entail candidate translation

and vice versa

(130)

References

[1] Doug Arnold, Louisa Sadler, and R. Lee Humphreys, Evaluation: an assessment. Machine Translation, Volume 8, pages 1–27. 1993.

[2] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation, IBM research report rc22176 (w0109- 022). Technical report, IBM Research Division, Thomas, J. Watson Research Center. 2001.

[3] Chris Callison-Burch, Miles Osborne, Phillipp Koehn, Re-evaluating the role of Bleu in Machine Translation Research, European ACL (EACL) 2006, 2006.

[4] R. Ananthakrishnan, Pushpak Bhattacharyya, M. Sasikumar and Ritesh M.

Shah, Some Issues in Automatic Evaluation of English-Hindi MT: More

Blues for BLEU, ICON 2007, Hyderabad, India, Jan, 2007.

(131)

References

[5] Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul, "A study of translation edit rate with targeted human annotation",In Proceedings of Association for Machine Translation in the Americas,2006

[6] Chin-yew Lin, "Rouge: a package for automatic evaluation of summaries", 2004

[7] Satanjeev Banerjee and Alon Lavie, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments", Proceedings of the ACL 2005 Workshop on Intrinsic and

Extrinsic Evaluation Measures for MT and/or Summarization, 2005

[8] Doddington, George, "Automatic evaluation of machine translation quality using n-gram co-

occurrence statistics", Proceedings of the second international conference on Human Language Technology Research, HLT 2002

[9] Maja and Ney, Hermann, "Word error rates: decomposition over Pos classes and applications for error analysis", Proceedings of the Second Workshop on Statistical Machine Translation, StatMT 2007

[10] Joseph Turian and Luke Shen and I. Dan Melamed, "Evaluation of Machine Translation and its Evaluation", In Proceedings of MT Summit IX, pages 386-393, 2003

[11] Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz and Josh Schroeder, "(Meta-) Evaluation of Machine Translation", ACL Workshop on Statistical Machine Translation 2007

[12] Michel Galley, Dan Jurafsky, Chris Manning, "Evaluating MT output with entailment technology", 2008

Shubham Gautam, IIT Bombay

Automatic evaluation of Machine Translation (MT): BLEU, its

shortcomings and other evaluation metrics

Presented by:

Aditya Joshi Kashyap Popat

Shubham Gautam, IIT Bombay

Guide:

Prof. Pushpak Bhattacharyya,

IIT Bombay

Part I: Introduction and formulation of BLEU

Presented by:

Aditya Joshi Kashyap Popat

Shubham Gautam, IIT Bombay

Guide:

Prof. Pushpak Bhattacharyya,

IIT Bombay

Poetry is what gets Poetry is what gets lost in translation.

lost in translation. ””

Robert Frost

Motivation

How do we judge a good translation?

Can a machine do this?

Why should a machine do this? Because humans take time!

Outline

• Evaluation

• Formulating BLEU Metric

• Understanding BLEU formula

• Shortcomings of BLEU

• Shortcomings in context of English-Hindi MT

BLEU - I

The BLEU Score rises

Evaluation

Evaluation [1] [1]

Of NLP systems

Of MT systems

Evaluation in NLP: Precision/Recall

Precision:

How many results returned were correct?

Recall:

What portion of correct results were returned?

Adapting precision/recall to NLP tasks

Evaluation in NLP: Precision/Recall

• Document Retrieval

Precision =

|Documents relevant and retrieved|

|Documents retrieved|

Recall=

|Documents relevant and retrieved|

| Documents relevant|

• Classification

Precision =

|True Positives|

|True Positives + False Positives|

Recall=

|True Positives|

| True Positives + False

Negatives|

Evaluation in MT [1]

• Operational evaluation

– “Is MT system A operationally better than MT system B?

Does MT system A cost less?”

• Typological evaluation

– “Have you ensured which linguistic phenomena the MT system covers?”

• Declarative evaluation

– “How does quality of output of system A fare with respect

to that of B?”

Operational evaluation

• Cost-benefit is the focus

• To establish cost-per-unit figures and use this as a basis for comparison

• Essentially ‘black box’

• Realism requirement

Typological evaluation

• Use a test suite of examples

• Ensure all relevant phenomena to be tested are covered

• Specific-to-language phenomena

For example, हर आँख वाला लड़का मु कुराया

Declarative evaluation

• Assign scores to specific qualities of output

– Intelligibility: How good the output is as a well- formed target language entity

– Accuracy: How good the output is in terms of preserving content of the source text

Evaluation ^[1] ^[1]

Deriving BLEU ^[2] ^[2]