• No results found

Cross-Language Information Retrieval (CLIR)

N/A
N/A
Protected

Academic year: 2022

Share "Cross-Language Information Retrieval (CLIR)"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

April 7, 2006

Natural Language Processing/Language Technology for the Web

Cross-Language Information Retrieval (CLIR)

Ananthakrishnan R

Computer Science & Engg., IIT Bombay

(anand@cse)

(2)

Cross Language Information Retrieval (CLIR)

“A subfield of information retrieval dealing with retrieving information written in a language different from the

language of the user's query.”

E.g., Using Hindi queries to retrieve English documents

Also called multi-lingual, cross-lingual, or trans-lingual

IR.

(3)

Why CLIR?

E.g., On the web, we have:

‰

Documents in different languages

‰

Multilingual documents

‰

Images with captions in different languages

A single query should retrieve all such resources .

(4)

Approaches to CLIR

Knowledge- based

Corpus-based

Query Translation Dictionary/Thes aurus-based

Pseudo- Relevance

Feedback (PRF)

Document Translation

MT

(rule-based)

MT

(EBMT/StatMT)

Intermediate Representation

UNL

(AgroExplorer)

Latent Semantic Indexing

Most effective approaches are hybrid – a combination of knowledge and corpus-based methods.

most efficient;

commonly used

infeasible for

large

collections

(5)

Dictionary-based Query Translation

आयरलड शांित

वाता

Ireland peace

talks

Hindi-English dictionaries

Collection search

• phrase identification

• words to be transliterated

(6)

The problem with dictionary-based CLIR -- ambiguity

अंत र ीय घटना

cosmic outer-space

incident event occurrence

lessen subside decrease lower diminish ebb decline reduce

जाली धन

lattice mesh net wire_netting meshed_fabric counterfeit forged false fabricated small_net network gauze grating sieve

money riches wealth appositive property

आयरलड शांित वाता

Ireland

peace calm tranquility silence quietude

conversation talk negotiation tale

(7)

… filtering/disambiguation is required after

query translation.

(8)

Disambiguation using co-occurrence statistics

Hypothesis: correct translations of query terms will

co-occur and incorrect translations will tend not

to co-occur

(9)

Problem with counting co-occurrences:

data sparsity

freq(Marathi Shallow Parsing CRFs)

freq(Marathi Shallow Structuring CRFs) freq(Marathi Shallow Analyzing CRFs)

… are all zero.

How do we choose between parsing,

structuring, and analyzing?

(10)

Pair-wise co-occurrence

अंत र ीय घटना

cosmic outer-space

incident event occurrence lessen subside decrease lower diminish ebb decline reduce

freq(cosmic incident) Æ 70800

freq(cosmic event Æ 269000

freq(cosmic lessen) Æ 7130

freq(cosmic subside) Æ 3120

freq(outer-space incident) Æ 26100 freq(outer-space event) Æ 104000 freq(outer-space lessen) Æ 2600 freq(outer-space subside) Æ 980

(11)

Shallow Parsing, Structuring or Analyzing?

shallow parsing Æ 166000 shallow structuring Æ 180000 shallow analyzing Æ 1230000

CRFs parsing Æ 540

CRFs structuring Æ 125 CRFs analyzing Æ 765 Marathi parsing Æ 17100 Marathi structuring Æ 511 Marathi analyzing Æ 12200

“shallow parsing” Æ 40700

“shallow structuring” Æ 11

“shallow analyzing” Æ 2

collocation?

But,

analyzing Æ 74100000 parsing Æ 40400000 structuring Æ 17400000

shallow Æ 33300000

(12)

Ranking senses using co-occurrence statistics

„

Use co-occurrence scores to calculate similarity between two words: sim(x, y)

„ Point-wise mutual information (PMI)

„ Dice coefficient

„ PMI-IR

) ( )

(

) log (

) , (

- hits x hits y

y x

y hits x

IR

PMI

AND

= ×

(13)

Disambiguation algorithm

} ,

...

, {

: query s

user'

2 1

s m s

s

q q

q q =

} {

ons, translati

of set the

, each

For

, t

j i i

s i

w S

q

=

(14)

=

, ' '

'

'

) ( , )

, (

.

1

, , ,

i t

l

i S

w

t l i t

j i i

t j

i

S sim w w

w sim

=

i i

i t

j i t

j

i

sim w S

w score

'

) ,

( )

(

.

2

, , '

} ,

...

, ,

{

query translated

2 1

t m t

t

t

q q q

q =

) (

max arg

.

3

,

,

t j i w

t

i

score w

q

t j i

=

(15)

Example

अंत र ीय घटना

cosmic outer-space

incident event lessen subside decrease lower diminish ebb decline reduce

score(cosmic)= PMI-IR(cosmic, incident) + PMI-IR(cosmic, event) +

PMI-IR(cosmic, lessen) +

PMI-IR(cosmic, subside) …

(16)

Disambiguation algorithm: sample outputs

आयरलड शांित वाता Ireland peace talks अंत र ीय घटना cosmic events

जाली धन net money (?)

(17)

Results on TREC8 (disks 4 and 5)

„

English topics (401-450) manually translated to Hindi

„

Assumption: relevance judgments for English topics hold for the translated queries

„

Results (all TF-IDF):

Technique MAP

Monolingual 23

All-translations 16

PMI based disambiguation 20.5

Manual filtering 21.5

(18)

Pseudo-Relevance Feedback for CLIR

(19)

(User) Relevance Feedback (mono-lingual)

1.

Retrieve documents using the user’s query

2.

The user marks relevant documents

3.

Choose the top N terms from these documents

‰ Top terms Æ IDF is one option for scoring

4.

Add these N terms to the user’s query to form a new query

5.

Use this new query to retrieve a new set of

documents

(20)

Pseudo-Relevance Feedback (PRF) (mono-lingual)

1.

Retrieve documents using the user’s query

2.

Assume that the top M documents retrieved are relevant

3.

Choose the top N terms from these M documents

4.

Add these N terms to the user’s query to form a new query

5.

Use this new query to retrieve a new set of

documents

(21)

PRF for CLIR

Corpus-based Query Translation

„

Uses a parallel corpus of documents:

H1 ÅÆ E1 H2 ÅÆ E2 . . . . . . Hm ÅÆEm

Hindi collection

H

English collection

E

(22)

PRF for CLIR

1.

Retrieve documents in H using the user’s query

2.

Assume that the top M documents retrieved are relevant

3.

Select the M documents in E that are aligned to the top M retrieved documents

4.

Choose the top N terms from these documents

5.

These N terms are the translated query

6.

Use this query to retrieve from the target collection

(which is in the same language as E )

(23)

Cross-Lingual Relevance Models

- Estimate relevance models using a parallel corpus

(24)

Ranking with Relevance Models

„

Relevance model or Query model (distribution encodes the information need):

„

Probability of word

occurrence in a relevant document

„

Probability of word

occurrence in the candidate document

„

Ranking function (relative entropy or KL divergence)

Θ

R

)

|

( w

R

P Θ )

| ( w D P

Θ

=

w

P w

R

D w

D P w

P R D

KL

)

| (

)

| log (

).

| (

)

||

(

(25)

Estimating Mono-Lingual Relevance Models

) ...

(

) ...

, (

) ...

| ( )

| ( )

| (

2 1

2 1

2 1

m m

m R

h h

h P

h h

h w P

h h

h w P Q

w P w

P

=

=

≈ Θ

∑ ∏

Μ

=

⎟⎟

⎜⎜ ⎞

= ⎛

M

m

i

i

m

P M P w M P h M

h h

h w P

1 2

1

... ) ( ) ( | ) ( | )

,

(

(26)

Estimating Cross-Lingual Relevance Models

∑ ∏

Μ

=

⎟⎟

⎜⎜ ⎞

= ⎛

} ,

{ 1

2

1

... ) ({ , }) ( | ) ( | )

, (

E H M M

m

i

H i

E E

H

m

P M M P w M P h M

h h

h w P

) ( ) 1

( )

| (

,

,

P w

freq M freq

w P

v v X

X w

X

λ ⎟ ⎟ + − λ

⎜ ⎜

= ⎛

(27)

CLIR Evaluation – TREC (Text REtrieval Conference)

„

TREC CLIR track (2001 and 2002)

„

Retrieval of Arabic language newswire documents from topics in English

„

383,872 Arabic documents (896 MB) with SGML markup

„

50 topics

„

Use of provided resources (stemmers, bilingual dictionaries, MT systems, parallel corpora) is encouraged to minimize variability

http://trec.nist.gov/

(28)

CLIR Evaluation – CLEF

(Cross Language Evaluation Forum)

„

Major CLIR evaluation forum

„

Tracks include

‰

Multilingual retrieval on news collections

‰ topics will be provided in many languages including Hindi

‰

Multiple language Question Answering

‰

ImageCLEF

‰

Cross Language Speech Retrieval

‰

WebCLEF

http://www.clef-campaign.org/

(29)

Summary

„

CLIR techniques

„ Query Translation-based

„ Document Translation-based

„ Intermediate Representation-based

„

Query translation using dictionaries, followed by

disambiguation, is a simple and effective technique for CLIR

„

PRF uses a parallel corpus for query translation

„

Parallel corpora can also be used to estimate cross- lingual relevance models

„

CLEF and TREC: important CLIR evaluation

conferences

(30)

References (1)

1. Phrasal Translation and Query Expansion Techniques for Cross- language Information Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1995.

2. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information

Retrieval, 1998.

3. A Maximum Coherence Model for Dictionary-Based Cross-

Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y.

Chai, ACM SIGIR, 2005.

4. A Comparative Study of Knowledge-Based Approaches for Cross- Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr, Paul G. Hackett, and Maria Katsova, Technical Report CS-TR-

3897, University of Maryland, 1998.

(31)

References (2)

5.

Translingual Information Retrieval: A Comparative Evaluation, Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D.

Brown, Yibing Geng, and Danny Lee, International Joint Conference on Artificial Intelligence, 1997.

6.

A Multistage Search Strategy for Cross Lingual Information Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak Bhattacharyya, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005.

7.

Relevance-Based Language Models, Victor Lavrenko, and W.

Bruce Croft, Research and Development in Information Retrieval, 2001.

8.

Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette,

and W. Croft, ACM-SIGIR, 2002.

(32)

Thank You

References

Related documents

Goal in graph theoretic language: Select maximum number of edges such that at most one selected edge is incident on any vertex.. Such a collection of edges is called

∋ (Extensible) data types with vague predicates.. Probabilistic Retrieval with XIRQL. Problem: weighting of different forms of occurrence of terms /document[.//heading ∋ "XML"

The cross polarization di scrimination of the scattered field, on the condition of an incident plane wave propagating at an arbitrary direction, is derived and

The variation of photocurrent as a function of applied field, intensity of the incident light, response time, and incident wavelength was studied.. It was observed that the

The influence of incident beam divergence on the length of the streak intercepted by the Ewald sphere is considered, as a relp HK'L of a faulted hexagonal crystal, mounted about

At the times when a propagation path is partially illuminated as at sunrise and sunset transitions, a single waveguide mode incident at discontinuity between

Let me begin with a couple of personal experiences. As the Railway Minister of India, I visit a number of places and meet a large number of people from all walks of

The computations for angle ^ from equation (16) using symmetrical detector situation, are plotted in figure 3 for various kinetic energies. Thus, the energy equation