• No results found

IIT B b IIT Bombay

N/A
N/A
Protected

Academic year: 2022

Share "IIT B b IIT Bombay"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

CS344: Introduction to Artificial Intelligence

Pushpak Bhattacharyya CSE Dept.,

IIT B b IIT Bombay

Lecture 32: Information Retrieval: Basic t d M d l

concepts and Model

(2)

The elusive user satisfaction

The elusive user satisfaction

Ranking Ranking

Correctness Correctness

of

Query Processing Coverage

I d i

NER

Stemming MWE Crawling Indexing

MWE

(3)

Q I di T ib i L ti

Query: Indian Tribes in Latin America

America

(4)

Google

„

Indians of Latin America : an exhibition of materials in the Lilly ...

„

Lilly Library: Latin American mss. Brazil. A large map in colors, this locates the course of rivers, towns, mountain ranges, and Indian tribes . ...

www.indiana.edu/~liblilly/etexts/ila/ - 241k - Cached - Similar pages - Note this

„

Indigenous peoples of the Americas - Wikipedia, the free encyclopedia

American Indian creation legends tell of a variety of originations of that it had confirmed the presence

„

American Indian creation legends tell of a variety of originations of ... that it had confirmed the presence of 67 different uncontacted tribes in Brazil, ...

en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas - 178k - Cached - Similar pages - Note this

„

Cognition :: Giving Technologies New Meaning

„

The volumes that Farabee produced from his travels include Indian Tribes of Eastern Peru ... motor vehicles that are lemons · Indian tribes of Latin America ...

wikipedia cognition com/?num=10&from val Indian%20tribes%20of%20Latin%20Ame - 54k - wikipedia.cognition.com/?num=10&from_val...Indian%20tribes%20of%20Latin%20Ame... - 54k - Cached - Similar pages - Note this

„

Top 25 American Indian Tribes for the United

„

Top 25 American Indian Tribes for the UnitedStates: 1990 and 1980--Con. ... 16028 73.0 Canadian and Latin American ... 19375 248.3 Chickasaw. ...

www.census.gov/population/socdemo/race/indian/ailang1.txt - 6k - Cached - Similar pages - Note this Ten Largest American Indian Tribes 2000 Infoplease com

„

Ten Largest American Indian Tribes , 2000 — Infoplease.com

„

Latin American Indian , 180940. Choctaw, 158774. Sioux, 153360 ... American Indian and Alaska Native Population by Selected Tribes , Census 2000 ...

www.infoplease.com/ipa/A0767349.html - 29k - Cached - Similar pages - Note this

„

The Indian Tribes of North America by John R. Swanton at Questia ...

„

Read the complete book The Indian Tribes of North America by becoming a ... Sao Paulo recently elected its must cope with demands by Latin America for

its...must cope with demands by Latin America for ...

www.questia.com/library/book/the-indian-tribes-of-north-america-by-john-r-swanton.jsp - Similar pages

- Note this

(5)

Yahoo

„ different indian tribes of latin america,

„ More...

„ WEB RESULTS

„ South America Daily

„ Indian Pepper Photos Prices Spices. The Times of India ... Archaeologists unearth ancient tribe members sacri London ... Iran and the left in Latin America ...

„ www.wn.com/LatinAmerica -192k

„ www.wn.com/LatinAmerica 192k

„ Native American Indian Cultures - Mexico, South America

„ Also, many of the Yanomamo tribe are losing their members and culture by ... of Amazon Indian tribal art in the world, with over 75 tribes represented. ...

„ indian-cultures.com -Cached

„ Native American Indian Cultures - links

„ North American Tribes. rednation.org - RedNation of the Cherokee. Meso and Latin American Indians ... Human Rights in Latin America ...

i di lt /C lt /Li k ht l C h d

„ www.indian-cultures.com/Cultures/Links.html -Cached

„ Indigenous peoples of the Americas - Wikipedia, the free encyclopedia

„ ... in America, particularly with regards to native Indians. ... Uncontacted Indian tribe found in Brazil's Amazon. The Peopling of the American Continents ...

„ en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas -179k-Cached

„ Native American Images - American Indian North America Tribe Map

„ American Indian North America Tribe Map. Click here to view more images ... Medal | History Hotline | Iraqi War | Korean War | Latin Americans | Medal of ...|

„ www.nativeamericans.com/NativeAmericanImages6.htm -Cached

„ Resources for

„ Numbers of Native Americans or Indians in Latin America: 39,442,000 million ... Indian Tribes in Latin America - Latin American Indian Population - Up date ...

„ www.xmission.com/~amauta/population.htm -Cached

„ Indian tribe found in Brazil's Amazon - Boston.com

„ Latin America/Caribbean. Indian tribe found in Brazil's Amazon ... Uncontacted tribes are usually discovered when loggers and ranchers encroach on

encroach on ...

„ boston.com/news/world/latinamerica/articles/2007/06/01/.../ News

(6)

AltaVista

„ Latin America

Compare airfare prices from over 120 top websites and save up to 70%.

Flights.SideStep.com

Regional Telecom Statistics & Forecasts

Fixed, mobile, Internet, broadband telecom statistics and forecasts.

www.hottelecom.com

AltaVista found 4,520,000 results

„ South America Daily

Indian Pepper Photos Prices Spices. The Times of India ... Archaeologists unearth ancient tribe members sacri London ... Iran and the left in Latin America ...www.wn.com/LatinAmerica

More pages from wn.com

Native American Indian Cultures - Mexico, South America Native American Indian Cultures Mexico, South America

Also, many of the Yanomamo tribe are losing their members and culture by ... of Amazon Indian tribal art in the world, with over 75 tribes represented. ...

indian-cultures.com

More pages from indian-cultures.com

Indian tribes in Suriname cross borders - Boston.com

Days of rain near Suriname's southern border have deluged Amerindian farmland, ... Latin America/Caribbean. Indian tribes in Suriname cross borders ...

www.boston.com/news/world/latinamerica/articles/2006/05/12...in_suriname_cross_borders More pages from boston.com

Indigenous peoples of the Americas - Wikipedia, the free encyclopedia

... in America, particularly with regards to native Indians. ... Uncontacted Indian tribe found in Brazil's Amazon. The Peopling of the American Continents ...

en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas More pages from en.wikipedia.org

Native American Images - American Indian North America Tribe Map

American Indian North America Tribe Map. Click here to view more images ... Medal | History Hotline | Iraqi War | Korean War | Latin Americans | Medal of ...www.nativeamericans.com/NativeAmericanImages6.htm

More pages from nativeamericans.com

(7)

MSN

„

Native American Images - American Indian North America Tribe Map

„

Native American Images American Indian North America Tribe Map Click here to view more images ... History Hotline | Iraqi War | Korean War | Latin Americans ...

„ www.nativeamericans.com/NativeAmericanImages6.htm

„ · Cached page

„

Resources for

152t ) P (126t ) P (67t ) S i (10t ) d V l (331t ) (t th d)

I di T ib

i L ti

„

... 152t.), Panama (126t.), Paraguay (67t.), Surinam (10t.), and Venezuela (331t.) (t.=thousand). - Indian Tribes in Latin

America

„ www.xmission.com/~amauta/tribes.htm

„ · Cached page

„

Latin America Community Assistance Foundation - LACA

„

The Tarahumara Indians are the most primitive of all Indian tribes in North America, and are the least touched by modern society.

www lacafoundation org/?page id 58

„ www.lacafoundation.org/?page_id=58

„ · Cached page

„

Latin America Tour Set for Curtis Photos of North America Tribes

„

28 September 2005. Latin America Tour Set for Curtis Photos of North America Tribes. Famed photographer recorded

Indian tribal life in 19th, early 20th century

„ www.america.gov/st/washfile-english/2005/September/20050928134700GLnesnoM0.2225763.html

„

Latin America // Current

C t TV L ti A

i

t di l L ti A

i

t i d f th A j l l d fli t

„

Current TV Latin America category, discover popular Latin America stories, news and ... of the Amazon jungle, a land conflict between rice farmers and a handful of Indian tribes ...

„ current.com/topics/75844112_latin_america

„ · Cached page

„

Bloomberg.com: Latin America

„

May 30 (Bloomberg) -- Brazil's National Indian Foundation has discovered an Indian tribe in the Amazon that hasn't had contact with civilization in a rare sighting of the few ...

bl b / / ? id 20601086& id S j5 fHW CQ& f l ti i

„ www.bloomberg.com/apps/news?pid=20601086&sid=aSrj5wfHW.CQ&refer=latin_america

(8)

Personalized focused search (wikipedia.cognition)

„ Indian Latin-America tribe: 249 files —

„ William Curtis Farabee

„ The volumes that Farabee produced from his travels include Indian Tribes of Eastern Peru based on his first trip in 1906-1908 (Obituary, 1925).

„ Direct link (no highlighting)

„ Mexican Texas

„ Settlers were empowered to create their own militias to help control hostile Indian tribes. Texas faced raids from both the Apache and

„ Settlers were empowered to create their own militias to help control hostile Indian tribes. Texas faced raids from both the Apache and Comanche tribes, [...]

„ Direct link (no highlighting)

„ Temecula, California

„ The Luiseño and Cahuilla tribes were involved, rather bloodily, in the local battles of the Mexican-American War during the following years.

„ Direct link (no highlighting)

„ Kaweah Indian Nation

„ Recently, scam artists have sold purported citizenships in the non-recognized tribe, particularly to Mexican nationals who have entered the US ill ll 1 [ ]

illegally.1 [...]

„ Direct link (no highlighting)

„ Flag of Puerto Rico

„ The tribal nation flag of the Jatibonicu Taino Indians of Borikén, represents the Jatibonicu Taino tribe's original pre-Columbian territories of [...]

„ Direct link (no highlighting)

„ Maina Indians

„ The Maina Indians are a group of tribes constituting a distinct linguistic stock, the [...] along the north bank of the Marañón River in South America

America

„ Direct link (no highlighting)

„ Erie (tribe)

„ ^ Ebooks by Google: "Handbook of American Indians North of Mexico" By Frederick Webb Hodge http://books.google.com/books?

„ Direct link (no highlighting)

„ Miccosukee

„ [1] Other members went on to form the Miccosukee Tribe of Indians of Florida, which was not recognized by Fidel Castro's Cuban government in 1959. The [...]

„ Direct link (no highlighting)

„ New Tribes Mission

„ In Paraguay in 1979 and 1986, New Tribes Mission was accused of assisting in the forcible contact of nomadic Ayoreo Indians.

„ Direct link (no highlighting)

(9)

Example: Semantically precise search for relations/events

Query: afghans destroying opium poppies

Q y fg y g p p pp

(10)

India Wide Cross Lingual

I f ti A (CLIA)

Information Access (CLIA) Endeavour

Endeavour

(11)

Motivation

„ English still the most dominant language on the web

™ Contributes 72% of the content

„ Number of non-English users g steadily rising all over the world

„ English penetration in India

„ English penetration in India

™ Estimated to be around 3-4%

™ Mostly the urban educated class

class

„ Need to enable access to

above information through

local languages

(12)

Cross Language Information Retrieval (CLIR)

Crawled and Target Language Index Crawled and

Indexed Web Pages Target Language Index

in English

ित पित याऽा

Hindi Query

CLIR Engine

T I f i

CLIR Engine

ित पित आने के िलए रेल

ित पित

याऽा Target Information

in English Language

Resources साधन

ित पित पु य नगर पहुँचने के िलए बहुत रेल उपल ध ह | अगर मुंबई से

याऽा कर रहे है तो मुंबई - चे नई ए सूेस गाड़ से ूवास कर सकते है

Ranked List of Results

ए सूस गाड़ स ूवास कर सकत ह

| Result Snippets

in Hindi

(13)

Challenges involved in CLIA Challenges involved in CLIA

„ Indexing, retrieval and ranking of multilingual documents

„ Web data is not clean and regular

™ Different font encodings – some of them proprietary Spelling variations very common

™ Spelling variations very common

™ Different document encodings

„ Language identification needed to invoke appropriate

l l

language analyzers

„ Involves a number of fundamental NLP research problems like query disambiguation, machine

p q y g

transliteration, named-entity recognition, multi-word

recognition

(14)

Cross Language Information Access (CLIA) Consortia Project

„ Indian Language CLIR Engine under development g g g p

™ Input – Six Indian Languages (Hindi, Bengali, Telugu, Tamil, Marathi and Punjabi)

™ Output – p Hindi, English and Input Language , g p g g of Query Q y

™ Domains – Tourism (Current Release)

„ Involves 10 academic institutes all over the country: IITs, Indian Statistical Institute CDAC Anna University

Indian Statistical Institute, CDAC, Anna University, Jadavpur University

™ IIT Bombay – Overall co-ordinator

Responsible for Hindi Marathi language verticals

™ Responsible for Hindi, Marathi language verticals

„ Includes full-fledged search features

™ Snippet translation

™ Summary generation

™ Information Extraction

(15)

Portal

„ Public portal released at

http://www clia iitb ac in/clia-beta-ext/ in http://www.clia.iitb.ac.in/clia-beta-ext/ in September 2009. (Outside IITB)

„ Public portal released at Public portal released at

http://www.clia.iitb.ac.in:8080/clia-beta-ext/ in

September 2009. (Inside IITB)

(16)
(17)

Recent Press Coverage

Recent Press Coverage

(18)

Hindustan Times

(19)

IR Basics IR Basics

(mainly from R. Baeza-Yates and B. Ribeiro-Neto. Modern Information RetrievalAddison-Wesley, Wokingham, UK, 1999.

Christopher D. Manning, Prabhakar Raghavan and Hinrich p g, and g Schütze, Introduction to Information Retrieval , Cambridge

University Press. 2008. )

(20)

Definition of IR Model

An IR model is a quadrupul [D, Q, F, R(q i , d j )]

[ , Q, , (q i , j )]

Where,

D: documents D: documents Q : Queries

F : Framework for modeling document query F : Framework for modeling document, query and their relationships

R(.,.) : Ranking function returning a real no.

R(.,.) : Ranking function returning a real no.

expressing the relevance of d j with q i

(21)

Index Terms

„ Keywords representing a document

„ Semantics of the word helps remember

„ Semantics of the word helps remember the main theme of the document

Generally nouns

„ Generally nouns

„ Assign numerical weights to index

d h

terms to indicate their importance

(22)

Introduction

Docs Index Terms Index Terms

doc

Information Need Ranking

match Information Need

query

(23)

Classic IR Models - Basic Concepts

• The importance of the index terms is represented by weights associated to them

• Let

– t be the number of index terms in the system – K= {k 1 , k 2 , k 3 ,... k t } set of all index terms

– k i be an index term d be a document – d j be a document

– w ij is a weight associated with (k i ,d j )

– w w ij ij = 0 0 indicates that term does not belong to doc indicates that term does not belong to doc – vec(d j ) = (w 1j , w 2j , …, w tj ) is a weighted vector

associated with the document dj

– g i (vec(dj)) = w ij is a function which returns the weight

associated with pair (k i ,d j )

(24)

The Boolean Model

• Simple model based on set theory

• Only y AND, OR , and NOT are used

• Queries specified as boolean expressions – precise semantics

– neat formalism

– q = k a ∧ (k b ∨ ¬ k c )

T ith t b t Th {0 1}

• Terms are either present or absent. Thus, w ij ε {0,1}

• Consider

– q = k ∧ (k ∨ k ) – q = k a ∧ (k b ∨ ¬ k c )

– vec(q dnf ) = (1,1,1) ∨ (1,1,0) ∨ (1,0,0)

– vec(q vec(q cc cc ) = (1,1,0) ) (1,1,0) is a conjunctive component is a conjunctive component

(25)

The Boolean Model

k (k k ) (1 1 0)

K a K b

• q = k a ∧ (k b ∨ ¬ k c )

(1,1,1) (1,0,0) (1,1,0)

• sim(q,d j ) = 1 if ∃ vec(q cc ) |

K c

j (vec(q cc ) ε vec(q dnf )) ∧

(∀k i , g i (vec(d j )) = g i (vec(q cc ))) 0 otherwise

0 otherwise

(26)

Drawbacks of the Boolean Model

• Retrieval based on binary decision criteria with no notion of partial matching

• No ranking of the documents is provided (absence of a grading scale)

Information need has to be translated into a Boolean

• Information need has to be translated into a Boolean expression which most users find awkward

• The Boolean queries formulated by the users are most often q y too simplistic

• As a consequence, the Boolean model frequently returns

ith t f t d t i t

either too few or too many documents in response to a user

query

(27)

The Vector Model

• Use of binary weights is too limiting

Non binary weights provide consideration for

• Non-binary weights provide consideration for partial matches

• These term weights are used to compute a

degree of similarity between a query and each document

• Ranked set of documents provides for better

matching

(28)

The Vector Model

• Define:

• Define:

– w ij > 0 whenever k i ∈ d j

w >= 0 associated with the pair (k q) – w iq >= 0 associated with the pair (k i ,q) – vec(d j ) = (w 1j , w 2j , ..., w tj )

vec(q) = (w w w )

vec(q) = (w 1q , w 2q , ..., w tq )

• In this space queries and documents are

• In this space, queries and documents are

represented as weighted vectors

(29)

The Vector Model

j

dj

i q

Θ

• Sim(q,d j ) = cos(Θ)

= [vec(d j ) • vec(q)] / |d j | * |q|= [Σ w ij * w iq ] / |d j | * |q|

Si 0 d 0 0 i ( d ) 1

i

• Since w ij > 0 and w iq > 0, 0 <= sim(q,d j ) <=1

• A document is retrieved even if it matches the query terms only partially

(30)

The Vector Model

• Sim(q,dj) = [Σ w ij * w iq ] / |d j | * |q|

• How to compute the weights w ij and w i ?

• How to compute the weights w ij and w iq ?

• A good weight must take into account two effects:

effects:

– quantification of intra-document contents (similarity)

(similarity)

• tf factor, the term frequency within a document

– quantification of inter-documents separation (dissi- – quantification of inter-documents separation (dissi-

milarity)

• idf d factor, the acto , t e inverse document frequency e se docu e t eque cy

– wij = tf(i,j) * idf(i)

(31)

The Vector Model

• Let,

– N be the total number of docs in the collection – n i be the number of docs which contain k i

– freq(i,j) raw frequency of k i within d j

• A normalized tf factor is given by f(i j) = freq(i j) / max (freq(l j)) – f(i,j) = freq(i,j) / max l (freq(l,j))

– where the maximum is computed over all terms which occur within the document djj

• The idf factor is computed as – idf(i) = log (N/n i )

– the log is used to make the values of tf and idf

comparable. It can also be interpreted as the amount of information associated with the term k i

information associated with the term k i .

(32)

The Vector Model

• The best term-weighting schemes use weights which are give by

w = f(i j) * log(N/n ) – w ij = f(i,j) * log(N/n i )

– the strategy is called a tf-idf weighting scheme

• For the query term weights, a suggestion is For the query term weights, a suggestion is

– wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)

• The vector model with tf-idf weights is a good ranking strategy with general collections

• The vector model is usually as good as the known ranking alternatives It is also simple and fast to compute

alternatives. It is also simple and fast to compute.

(33)

The Vector Model

• Advantages:

– term-weighting improves quality of the answer set g g – partial matching allows retrieval of docs that

approximate the query conditions

– cosine ranking formula sorts documents according to degree of similarity to the query

• Disadvantages:

– assumes independence of index terms (??); not p ( );

clear that this is bad though

(34)

The Vector Model:

Example I

d7

k1 k2

d1 d2

d4 d5 d3

d6 d7

d1

k3

k1 k2 k3 q • dj

d1 1 0 1 2

d2 1 0 0 1

d3 0 1 1 2

d4 1 0 0 1

d5 1 1 1 3

d6 1 1 0 2

d7 0 1 0 1

q 1 1 1

(35)

The Vector Model:

Example II

d7

k1 k2

d1 d2

d4 d5 d3

d6 d7

d1

k3

k1 k2 k3 q • dj

d1 1 0 1 4

d2 1 0 0 1

d2 1 0 0 1

d3 0 1 1 5

d4 1 0 0 1

d5 1 1 1 6

d5 1 1 1 6

d6 1 1 0 3

d7 0 1 0 2

q 1 2 3

(36)

The Vector Model:

Example III

d7

k1 k2

d1 d2

d4 d5 d3

d6 d7

d1

k3

k1 k2 k3 q • dj

d1 2 0 1 5

d2 1 0 0 1

d2 1 0 0 1

d3 0 1 3 11

d4 2 0 0 2

d5 1 2 4 17

d5 1 2 4 17

d6 1 2 0 5

d7 0 5 0 10

q 1 2 3

References

Related documents

● Instead of the entire routing table, a router only provides the state (cost and status) of its directly connected links3. ● Routers use flooding to disseminate

• Combine results of sub-problems to obtain solution of larger problem.?.

● allows data frames to contain arbitrary number of bits. ● allows character codes with arbitrary number of bits

– condition to continue execution of the iterative block (to check if count is within the prescribed limit)!. – Increment the count after

● students are able to productively abstract the relevant concepts from the analogy to solve a previously unseen technical problem. ● students who are successful in solving

 Long tail Phenomenon: Probability is very low but not zero over a large observed sequence. 

Dill and Mattis, as well as other researchers in psychology, social work, epidemiology, public health, and other fields, are building up a body of published evidence showing that

For both long-distance migrants, and residents and short-distant migrants, status has improved considerably since then, and in 2005, only 12.5% of North American residents