• No results found

CS344: Introduction to Artificial Intelligence

N/A
N/A
Protected

Academic year: 2022

Share "CS344: Introduction to Artificial Intelligence"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

CS344: Introduction to Artificial Intelligence

Pushpak Bhattacharyya CSE Dept.,

IIT B b IIT Bombay

Lecture 32-33: Information Retrieval:

B i t d M d l

Basic concepts and Model

(2)

The elusive user satisfaction

The elusive user satisfaction

Ranking Ranking

Correctness Correctness

of

Query Processing Coverage

I d i

NER

Stemming MWE Crawling Indexing

MWE

(3)

What happens in IR What happens in IR

// S h B I

Index Table q1 q2 … qn // Search Box, qi are query

terms I1

I2 Documents

.. D1

Documents

. . Ik D2

. Ranked

. List . . Dm

List

Note: High ranked relevant document Note: High ranked relevant document

= user information need getting satisfied !

(4)

Search Box

User Index Table /

Documents Relevance/Feedbac

k

(5)

How to check quality of retrieval (P, How to check quality of retrieval (P, R, F)

„ Three parameters

„ Precision P = |A ^ O|/|O|

Actual(A) Obtained(O)

A ^ O

„ Recall R = |A ^ O| / |A|

„ F-score = 2PR/(P+R)

„ Harmonic mean

All the above formula are very general. We haven’t considered that the

documents retrieved are ranked and thus the above expressions need to be modified.

(6)

P, R, F (contd.)

„

Precision is easy to calculate, Recall is not.

not.

„

Given a known set of pair of <q, D>

Relevance judgement <q D>

„

Relevance judgement <q,D>

(Human evaluation)

(7)

Relation between P & R

P i i l l t d P is inversely related to R (unless additional knowledge is given) Precision

P

Recall R

(8)

Precision at rank k

„

Choose the top k documents, see how many of them are relevant out of them.

Documents

„

P

k

= (# of relevant documents)/k

D1

D2

Documents

„

Mean Average Precision (MAP)

=

. ..

=

DD

k

. .. Dm

(9)

Sample Exercise

„

D

1

: Delhi is the capital of India. It is a large city.

large city.

„

D

2

: Mumbai, however is the

commercial capital with million dollars commercial capital with million dollars inflow & outflow.

D : There is rivalry for supremacy

The words in red constitute the useful words from each sentence.

„

D

3

: There is rivalry for supremacy between the two cities.

The words in red constitute the useful words from each sentence.

The other words (those in black) are very common and thus do not add to the information content of the sentence.

Vocabulary: unique red words, 11 in number; each doc will be represented by a 11-tuple vector: each component 1 or 0

(10)

IR Basics IR Basics

(mainly from R. Baeza-Yates and B. Ribeiro-Neto. Modern Information RetrievalAddison-Wesley, Wokingham, UK, 1999.

Christopher D. Manning, Prabhakar Raghavan and Hinrich p g, and g Schütze, Introduction to Information Retrieval, Cambridge

University Press. 2008. )

(11)

Definition of IR Model

An IR model is a quadrupul [D, Q, F, R(q

i

, d

j

)]

[ , Q, , (q

i

,

j

)]

Where,

D: documents D: documents Q : Queries

F : Framework for modeling document query F : Framework for modeling document, query and their relationships

R(.,.) : Ranking function returning a real no.

R(.,.) : Ranking function returning a real no.

expressing the relevance of d

j

with q

i

(12)

Index Terms

„

Keywords representing a document

„

Semantics of the word helps remember

„

Semantics of the word helps remember the main theme of the document

Generally nouns

„

Generally nouns

„

Assign numerical weights to index

d h

terms to indicate their importance

(13)

Introduction

Docs Index TermsIndex Terms

doc

Information Need Ranking

match Information Need

query

(14)

Classic IR Models - Basic Concepts

• The

importance

of the index terms is represented by weights associated to them

• Let

t

be the number of index terms in the system –

K= {k

1

, k

2

, k

3

,... k

t

}

set of all index terms

k

i be an index term

d

be a document –

d

j be a document

w

ij is a weight associated with

(k

i

,d

j

)

w w

ijij

= 0 0

indicates that term does not belong to docindicates that term does not belong to doc –

vec(d

j

) = (w

1j

, w

2j

, …, w

tj

)

is a weighted vector

associated with the document

dj

g

i

(vec(dj)) = w

ij is a function which returns the weight associated with pair

(k

i

,d

j

)

(15)

The Boolean Model

• Simple model based on set theory

• Only y

AND, OR ,

and

NOT

are used

• Queries specified as boolean expressions – precise semantics

– neat formalism

q = k

a

(k

b ∨ ¬

k

c

)

T ith t b t Th

{0 1}

• Terms are either present or absent. Thus,

w

ij ε

{0,1}

• Consider

q = k

(k

k )

q = k

a

(k

b ∨ ¬

k

c

)

vec(q

dnf

) = (1,1,1)

(1,1,0)

(1,0,0)

vec(q vec(q

cccc

) = (1,1,0) ) (1,1,0)

is a conjunctive componentis a conjunctive component

(16)

The Boolean Model

k (k k )

(1 1 0)

Ka Kb

• q = k

a

∧ (k

b

∨ ¬ k

c

)

(1,1,1) (1,0,0) (1,1,0)

• sim(q,d

j

) = 1 if ∃ vec(q

cc

) |

Kc

j

(vec(q

cc

) ε vec(q

dnf

)) ∧

(∀k

i

, g

i

(vec(d

j

)) = g

i

(vec(q

cc

))) 0 otherwise

0 otherwise

(17)

Drawbacks of the Boolean Model

• Retrieval based on binary decision criteria with no notion of partial matching

• No ranking of the documents is provided (absence of a grading scale)

Information need has to be translated into a Boolean

• Information need has to be translated into a Boolean expression which most users find awkward

• The Boolean queries formulated by the users are most often q y too simplistic

• As a consequence, the Boolean model frequently returns

ith t f t d t i t

either too few or too many documents in response to a user query

(18)

The Vector Model

• Use of binary weights is too limiting

Non binary weights provide consideration for

• Non-binary weights provide consideration for partial matches

• These term weights are used to compute a

degree of similarity between a query and each document

• Ranked set of documents provides for better

matching

(19)

The Vector Model

• Define:

• Define:

– w

ij

> 0 whenever k

i

∈ d

j

w >= 0 associated with the pair (k q) – w

iq

>= 0 associated with the pair (k

i

,q) – vec(d

j

) = (w

1j

, w

2j

, ..., w

tj

)

vec(q) = (w w w )

vec(q) = (w

1q

, w

2q

, ..., w

tq

)

• In this space queries and documents are

• In this space, queries and documents are

represented as weighted vectors

(20)

The Vector Model

j

dj

i q

Θ

Sim(q,dj) = cos(Θ)

= [vec(dj) • vec(q)] / |dj| * |q|= [Σ wij * wiq] / |dj| * |q|

Si 0 d 0 0 i ( d ) 1

i

• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

• A document is retrieved even if it matches the query terms only partially

(21)

The Vector Model

• Sim(q,dj) = [Σ w

ij

* w

iq

] / |d

j

| * |q|

• How to compute the weights w

ij

and w

i

?

• How to compute the weights w

ij

and w

iq

?

• A good weight must take into account two effects:

effects:

– quantification of intra-document contents (similarity)

(similarity)

tf

factor, the

term frequency

within a document

– quantification of inter-documents separation (dissi- – quantification of inter-documents separation (dissi-

milarity)

idf d

factor, the acto , t e

inverse document frequency e se docu e t eque cy

– wij = tf(i,j) * idf(i)

(22)

The Vector Model

• Let,

N

be the total number of docs in the collection –

n

i be the number of docs which contain

k

i

freq(i,j)

raw frequency of

k

i within

d

j

• A normalized

tf

factor is given by

f(i j) = freq(i j) / max (freq(l j))

f(i,j) = freq(i,j) / max

l

(freq(l,j))

– where the maximum is computed over all terms which occur within the document

djj

• The

idf

factor is computed as –

idf(i) = log (N/n

i

)

– the

log

is used to make the values of

tf

and

idf

comparable. It can also be interpreted as the

amount of information

associated with the term

k

i

information

associated with the term

k

i

.

(23)

The Vector Model

• The best term-weighting schemes use weights which are give by

w = f(i j) * log(N/n )

w

ij

= f(i,j) * log(N/n

i

)

– the strategy is called a

tf-idf

weighting scheme

• For the query term weights, a suggestion isFor the query term weights, a suggestion is

w

iq

= (0.5 + [0.5 * freq(i,q) / max

l

(freq(l,q)]) * log(N/ni)

• The vector model with

tf-idf

weights is a good ranking strategy with general collections

• The vector model is usually as good as the known ranking alternatives It is also simple and fast to compute

alternatives. It is also simple and fast to compute.

(24)

The Vector Model

• Advantages:

– term-weighting improves quality of the answer set g g – partial matching allows retrieval of docs that

approximate the query conditions

– cosine ranking formula sorts documents according to degree of similarity to the query

• Disadvantages:

– assumes independence of index terms; not clear p ;

that this is bad though

(25)

The Vector Model:

Example I

d7

k1 k2

d1 d2

d4 d5 d3

d6 d7

d1

k3

k1 k2 k3 q • dj

d1 1 0 1 2

d2 1 0 0 1

d3 0 1 1 2

d4 1 0 0 1

d5 1 1 1 3

d6 1 1 0 2

d7 0 1 0 1

q 1 1 1

(26)

The Vector Model:

Example II

d7

k1 k2

d1 d2

d4 d5 d3

d6 d7

d1

k3

k1 k2 k3 q • dj

d1 1 0 1 4

d2 1 0 0 1

d2 1 0 0 1

d3 0 1 1 5

d4 1 0 0 1

d5 1 1 1 6

d5 1 1 1 6

d6 1 1 0 3

d7 0 1 0 2

q 1 2 3

(27)

The Vector Model:

Example III

d7

k1 k2

d1 d2

d4 d5 d3

d6 d7

d1

k3

k1 k2 k3 q • dj

d1 2 0 1 5

d2 1 0 0 1

d2 1 0 0 1

d3 0 1 3 11

d4 2 0 0 2

d5 1 2 4 17

d5 1 2 4 17

d6 1 2 0 5

d7 0 5 0 10

q 1 2 3

References

Related documents

 The agent has complete information of the domain (perception is perfect), actions are instantaneous and their effects are deterministic..  The agent knows the world completely,

• Execution of a plan: achieved through a data structure called Triangular Table....

Report on the Internal Financial Controls under Clause (i) of Sub-section 3 of Section 143 of the C ompanies Act, 2013 (“the Act”) We have audited the internal fi nancial controls

While Class AB operation still uses two complementary transistors in its output stage a very small biasing voltage is applied to the Base of the transistor to bias it close to

✓ Microprocessor is a Programmable, Clock driven, Register based, Electronic device that reads instruction from a storage device, takes the data from input unit and process the

Advanced technologies, such as Artificial Intelligence (AI) systems, have beenpushing nowadays societies toward new ethical and legal challenges, including copyright law

The garments are presented in the fashion shows by fashion designers in the presence of the media, which plays the role of giving coverage to the styles exhibited, thus

28.1 The Contract shall be deemed to be a Contract made under, governed by and construed in accordance with the laws of India for the time being in force and shall be subject to