April 7, 2006
Natural Language Processing/Language Technology for the Web
Cross-Language Information Retrieval (CLIR)
Ananthakrishnan R
Computer Science & Engg., IIT Bombay
(anand@cse)
Cross Language Information Retrieval (CLIR)
“A subfield of information retrieval dealing with retrieving information written in a language different from the
language of the user's query.”
E.g., Using Hindi queries to retrieve English documents
Also called multi-lingual, cross-lingual, or trans-lingual
IR.
Why CLIR?
E.g., On the web, we have:
Documents in different languages
Multilingual documents
Images with captions in different languages
A single query should retrieve all such resources .
Approaches to CLIR
Knowledge- based
Corpus-based
Query Translation Dictionary/Thes aurus-based
Pseudo- Relevance
Feedback (PRF)
Document Translation
MT
(rule-based)
MT
(EBMT/StatMT)
Intermediate Representation
UNL
(AgroExplorer)
Latent Semantic Indexing
Most effective approaches are hybrid – a combination of knowledge and corpus-based methods.
most efficient;
commonly used
infeasible for
large
collections
Dictionary-based Query Translation
आयरलड शांित
वाता
Ireland peace
talks
Hindi-English dictionaries
Collection search
• phrase identification
• words to be transliterated
The problem with dictionary-based CLIR -- ambiguity
अंत र ीय घटना
cosmic outer-spaceincident event occurrence
lessen subside decrease lower diminish ebb decline reduce
जाली धन
lattice mesh net wire_netting meshed_fabric counterfeit forged false fabricated small_net network gauze grating sievemoney riches wealth appositive property
आयरलड शांित वाता
Irelandpeace calm tranquility silence quietude
conversation talk negotiation tale
… filtering/disambiguation is required after
query translation.
Disambiguation using co-occurrence statistics
Hypothesis: correct translations of query terms will
co-occur and incorrect translations will tend not
to co-occur
Problem with counting co-occurrences:
data sparsity
freq(Marathi Shallow Parsing CRFs)
freq(Marathi Shallow Structuring CRFs) freq(Marathi Shallow Analyzing CRFs)
… are all zero.
How do we choose between parsing,
structuring, and analyzing?
Pair-wise co-occurrence
अंत र ीय घटना
cosmic outer-space
incident event occurrence lessen subside decrease lower diminish ebb decline reduce
freq(cosmic incident) Æ 70800
freq(cosmic event Æ 269000
freq(cosmic lessen) Æ 7130
freq(cosmic subside) Æ 3120
freq(outer-space incident) Æ 26100 freq(outer-space event) Æ 104000 freq(outer-space lessen) Æ 2600 freq(outer-space subside) Æ 980
Shallow Parsing, Structuring or Analyzing?
shallow parsing Æ 166000 shallow structuring Æ 180000 shallow analyzing Æ 1230000
CRFs parsing Æ 540
CRFs structuring Æ 125 CRFs analyzing Æ 765 Marathi parsing Æ 17100 Marathi structuring Æ 511 Marathi analyzing Æ 12200
“shallow parsing” Æ 40700
“shallow structuring” Æ 11
“shallow analyzing” Æ 2
collocation?
But,
analyzing Æ 74100000 parsing Æ 40400000 structuring Æ 17400000
shallow Æ 33300000
Ranking senses using co-occurrence statistics
Use co-occurrence scores to calculate similarity between two words: sim(x, y)
Point-wise mutual information (PMI)
Dice coefficient
PMI-IR
) ( )
(
) log (
) , (
- hits x hits y
y x
y hits x
IR
PMI
AND= ×
Disambiguation algorithm
} ,
...
, {
: query s
user'
2 1
s m s
s
q q
q q =
} {
ons, translati
of set the
, each
For
, t
j i i
s i
w S
q
=
∑
∈∀
=
, ' '
'
'
) ( , )
, (
.
1
, , ,i t
l
i S
w
t l i t
j i i
t j
i
S sim w w
w sim
∑
≠∀
=
i i
i t
j i t
j
i
sim w S
w score
'
) ,
( )
(
.
2
, , '} ,
...
, ,
{
query translated
2 1
t m t
t
t
q q q
q =
) (
max arg
.
3
,,
t j i w
t
i
score w
q
t j i
=
Example
अंत र ीय घटना
cosmic outer-space
incident event lessen subside decrease lower diminish ebb decline reduce
score(cosmic)= PMI-IR(cosmic, incident) + PMI-IR(cosmic, event) +
PMI-IR(cosmic, lessen) +
PMI-IR(cosmic, subside) …
Disambiguation algorithm: sample outputs
आयरलड शांित वाता Ireland peace talks अंत र ीय घटना cosmic events
जाली धन net money (?)
Results on TREC8 (disks 4 and 5)
English topics (401-450) manually translated to Hindi
Assumption: relevance judgments for English topics hold for the translated queries
Results (all TF-IDF):
Technique MAP
Monolingual 23
All-translations 16
PMI based disambiguation 20.5
Manual filtering 21.5
Pseudo-Relevance Feedback for CLIR
(User) Relevance Feedback (mono-lingual)
1.
Retrieve documents using the user’s query
2.
The user marks relevant documents
3.
Choose the top N terms from these documents
Top terms Æ IDF is one option for scoring
4.
Add these N terms to the user’s query to form a new query
5.
Use this new query to retrieve a new set of
documents
Pseudo-Relevance Feedback (PRF) (mono-lingual)
1.
Retrieve documents using the user’s query
2.
Assume that the top M documents retrieved are relevant
3.
Choose the top N terms from these M documents
4.
Add these N terms to the user’s query to form a new query
5.
Use this new query to retrieve a new set of
documents
PRF for CLIR
Corpus-based Query Translation
Uses a parallel corpus of documents:
H1 ÅÆ E1 H2 ÅÆ E2 . . . . . . Hm ÅÆEm
Hindi collection
H
English collectionE
PRF for CLIR
1.
Retrieve documents in H using the user’s query
2.
Assume that the top M documents retrieved are relevant
3.
Select the M documents in E that are aligned to the top M retrieved documents
4.
Choose the top N terms from these documents
5.
These N terms are the translated query
6.
Use this query to retrieve from the target collection
(which is in the same language as E )
Cross-Lingual Relevance Models
- Estimate relevance models using a parallel corpus
Ranking with Relevance Models
Relevance model or Query model (distribution encodes the information need):
Probability of word
occurrence in a relevant document
Probability of word
occurrence in the candidate document
Ranking function (relative entropy or KL divergence)
Θ
R)
|
( w
RP Θ )
| ( w D P
∑ Θ
=
w
P w
RD w
D P w
P R D
KL
)
| (
)
| log (
).
| (
)
||
(
Estimating Mono-Lingual Relevance Models
) ...
(
) ...
, (
) ...
| ( )
| ( )
| (
2 1
2 1
2 1
m m
m R
h h
h P
h h
h w P
h h
h w P Q
w P w
P
=
=
≈ Θ
∑ ∏
Μ
∈ =
⎟⎟
⎠
⎜⎜ ⎞
⎝
= ⎛
M
m
i
i
m
P M P w M P h M
h h
h w P
1 2
1
... ) ( ) ( | ) ( | )
,
(
Estimating Cross-Lingual Relevance Models
∑ ∏
Μ
∈ =
⎟⎟
⎠
⎜⎜ ⎞
⎝
= ⎛
} ,
{ 1
2
1
... ) ({ , }) ( | ) ( | )
, (
E H M M
m
i
H i
E E
H
m
P M M P w M P h M
h h
h w P
) ( ) 1
( )
| (
,
,
P w
freq M freq
w P
v v X
X w
X
λ ⎟ ⎟ + − λ
⎠
⎞
⎜ ⎜
⎝
= ⎛
∑
CLIR Evaluation – TREC (Text REtrieval Conference)
TREC CLIR track (2001 and 2002)
Retrieval of Arabic language newswire documents from topics in English
383,872 Arabic documents (896 MB) with SGML markup
50 topics
Use of provided resources (stemmers, bilingual dictionaries, MT systems, parallel corpora) is encouraged to minimize variability
http://trec.nist.gov/
CLIR Evaluation – CLEF
(Cross Language Evaluation Forum)
Major CLIR evaluation forum
Tracks include
Multilingual retrieval on news collections
topics will be provided in many languages including Hindi
Multiple language Question Answering
ImageCLEF
Cross Language Speech Retrieval
WebCLEF
http://www.clef-campaign.org/
Summary
CLIR techniques
Query Translation-based
Document Translation-based
Intermediate Representation-based
Query translation using dictionaries, followed by
disambiguation, is a simple and effective technique for CLIR
PRF uses a parallel corpus for query translation
Parallel corpora can also be used to estimate cross- lingual relevance models
CLEF and TREC: important CLIR evaluation
conferences
References (1)
1. Phrasal Translation and Query Expansion Techniques for Cross- language Information Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1995.
2. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information
Retrieval, 1998.
3. A Maximum Coherence Model for Dictionary-Based Cross-
Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y.
Chai, ACM SIGIR, 2005.
4. A Comparative Study of Knowledge-Based Approaches for Cross- Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr, Paul G. Hackett, and Maria Katsova, Technical Report CS-TR-
3897, University of Maryland, 1998.
References (2)
5.
Translingual Information Retrieval: A Comparative Evaluation, Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D.
Brown, Yibing Geng, and Danny Lee, International Joint Conference on Artificial Intelligence, 1997.
6.
A Multistage Search Strategy for Cross Lingual Information Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak Bhattacharyya, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005.
7.
Relevance-Based Language Models, Victor Lavrenko, and W.
Bruce Croft, Research and Development in Information Retrieval, 2001.
8.