• No results found

Information Retrieval

N/A
N/A
Protected

Academic year: 2022

Share "Information Retrieval"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

Information Retrieval

Yogesh Kakde

29th January, 2011

(2)

Outline

1 What is Information Retrieval

2 Basic Components in an Web-IR system

3 Theoretical Models Of IR Boolean Model

Vector Model Probabilistic Model

(3)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

What is Information Retrieval

Information Retrieval (IR) means searching for relevant documents and information within the contents of a specific data set such as the World Wide Web.

ss

(4)

Basic Components in an Web-IR system

ss

Crawling: The system browses the document collection and fetches documents.

Indexing: The system builds an index of the documents fetched during crawling.

Ranking: The system retrieves documents that are relevant to the query from the index and displays to the user.

Relevance Feedback: The initial results returned from a given query may be used to refine the query itself.

(5)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Terminologies Used

Inverted Index: An index that maps back from terms to the documents where they occur.

D1 :The GDP increased 2percent this quarter.

D2 :The Spring economic slowdown continued to spring downwards this quarter.

ss

(6)

Terminologies Used

A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.

Example: The GDP increased 2 percent this quarter.

Tokens: The,GDP,increased,2,percent,this,quarter

A type is the class of all tokens containing the same character sequence.

Example: To be or not to be.

Types: to,be,or,not

A term is a type that is included in the IR system’s dictionary.

Example: The GDP increased 2 percent this quarter.

Terms: GDP, increased,2,percent,quarter

Stop words: some extremely common words which are of little value in select documents matching a user need.

Example: The GDP increased 2 percent this quarter.

Stop words: The,this

(7)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Terminologies Used

Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form. Stemmers can be catagorised as

Rule-based stemmers

Co-occurrence based stemmers

Lemma: the canonical form, dictionary form, or citation form of a set of words.

Lemmatization is the algorithmic process of determining the lemma for a given word with the use of a vocabulary and morphological analysis.

(8)

Logical View Of Document

ss

(9)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Measures Of Accuracy

ss

(10)

A Formal Characterization of IR Models

An information retrieval model is a quadruple {D,Q,F,R(qi,dj)}

where

D is a set composed of logical views for the documents in the collection.

Q is a set composed of logical views for the user information needs.

F is a framework for modeling document representations, queries, and their relationships.

R(qi,dj) is a ranking function which associates a real number with a queryqi belong to Q and a document representation dj belong to D. Such ranking defines an ordering among the documents with regard to the query qi.

(11)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Boolean Model

Boolean Model

The model is based on Boolean Logic and classical Set theory Documents and the query are conceived as sets of terms.

Retrieval is based on whether or not the documents contain the query terms

The queries are specified as Boolean expressions which have precise semantics.

Advantages:

Formalism and Simplicity Very precise

Disadvantages

No notion of partial match

Boolean expressions have precise semantics(not simple from users point-of-view)

Not Ranked

The model does not use term weights and frequency Given a large set of documents, the Boolean model either retrieves too many documents or very few documents.

(12)

Vector Model

The use of binary weights is too limiting so vector model proposes non-binary weights to index terms in queries and documents.

Document vector d¯j = {w1j,w2j, ...,wtj} Query vector q¯i = {w1i,w2i, ...,wyi} Wheret: total number of index terms in the collection of documents

Term Frequency(tf): The term-frequency is simply the count of the termi in document j. The term-frequency of a term is normalized by the maximum term frequency of any term in the document to bring the range of the term-frequency 0 to 1.

tfi,j = freqi,j

maxl(freql,j)

(13)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Vector Model

Inverse Document Frequency: This factor is motivated from the fact that if a term appears in all documents in a set, then it loses its distinguishing power. The inverse document frequency (idf) of a termi is given by:

idfi =logN ni

where

ni is the number of documents in which the termi occurs N is the total number of documents

The term weights can be approximated by the product of term frequency(tf) and inverse document frequency(idf). This is called thetf −idf measure.

(14)

Similarity measure between two vectors

Cosine Similarity The cosine similarity between two vectors ¯dj (the document vector) and ¯qi (query vector) is given by:

similarity( ¯dj,q¯i) =cosθ= d¯j.q¯i

|d¯j|,|q¯i|

ss

(15)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Vector Model

Advantage

The cosine similarity measure returns value in the range 0 to 1. Hence, partial matching is possible.

Ranking of the retrieved results according to the cosine similarity score is possible.

Disadvantage

Index terms are considered to be mutually independent. Thus, this model does not capture the semantics of the query or the document.

It cannot denote the “clear logic view” like Boolean model

(16)

Prbabilistic Model

The document is retrieved according to the probability of the document being relevant to the query. Mathematically, the scoring function is given by

P(R= 1|d,q)

The document is termed relevant if its probability of being relevant is greater than its probability of being non relevant

P(R = 1|d¯,q)¯ >P(R= 0|d¯,q)¯

(17)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Probabilistic Model

Binary Independence Model

Assumptions

Each index term is Boolean in nature (1 if present, 0 if absent).

There is no association between index terms.

The relevance of each document is calculated independent of the relevance of other documents.

Documents and queries are both represented as binary term incidence vectors.

dj =x1j,x2j, ...,xMj

[xi is 1 if xi is present in document. Else, this is zero.]

qj =q1,q2, ...,qM

(18)

Odds of relevance is used as ranking function as

it is monotonic with respect to probability of relevance it reduces the computation

odds of relevance= P(dj relevant to q) P(dj non−relevant to q) Ranking Functionrf()

rf() = P(R|d¯j) P( ¯R|d¯j) where

P(R|d¯j) is the probability of documentdj relevant toq.

P( ¯R|d¯j) is the probability of documentdj not relevant toq.

(19)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Probabilistic Model

By using Baye’s Rule we get

rf() = P( ¯dj|R).P(R) P( ¯dj|R).P¯ ( ¯R)

SinceP(R) andP( ¯R) are same for each document eliminating it from the above equation wont change the ranking of the

documents.

rf() = P( ¯dj|R) P( ¯dj|R)¯

= P(x1|R).P(x2|x1,R)...P(xM|x1...xM−1,R) P(x1|R).P¯ (x2|x1,R)...P¯ (xM|x1...xM−1,R)¯

=

M

Y

t=1

(P(xt|x1...xt−1,R P(xt|x1...xt−1,R¯)

(1)

(20)

Applying Naive-Bayes assumption which states that all index term xt are independent we get,

rf() =

M

Y

t=1

P(xt|R) P(xt|R)¯

Since eachxt is either 0 or 1 under Binary Independence Model, we can separate the terms to give

rf() =

M

Y

t:xt=1

P(xt|R) P(xt|R)¯ .

M

Y

t:xt=0

P(xt|R) P(xt|R)¯

Now let,pt =P(xt|R) br the probability of a term xt appearing in a relevant document andut=P(xt|R) be the probability that a¯ termxt appears in a non-relevant document.

rf() =

M

Y

t:xt=1

pt ut.

M

Y

t:xt=0

1−pt 1−ut

(21)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Probabilistic Model

Multiply the above equation by

M

Y

t:xt=1

1−pt

1−ut and its reciprocal

M

Y

t:xt=1

1−ut

1−pt

Now the term

M

Y

t:xt=1∨0

1−pt 1−ut

as this term is independent of index termsxt. Hence it can be ignored and we get,

rf() =

M

Y

t:xt=1

pt.(1−ut) ut.(1−pt)

(22)

Taking the logarithm of equation we get

rf() =

M

X

t:xt=1

logpt.(1−ut) ut.(1−pt)

=

M

X

t:xt=1

log pt

(1−pt)+log(1−ut) ut

=

M

X

t:xt=1

ct

(2)

ct is the log odds ratio for the terms in query.

ct =log odds of term present in a relevant document odds of term being present in a non−relevant document

(23)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Probabilistic Model

Equation (2) gives the formal scoring function of probabilistic information retrieval model. Nowct needs to be approximated.

Advantages

Documents are ranked in decreasing order of their probability if being relevant

Disadvantages

The need to guess the initial seperation of documents into relevant and non-relevant sets.

All wights are binary

Index terms are assumed to be independent

(24)

Estimation of ct

When there are no retrieved documents we can assume pt is constant for all index terms xi.

The distribution of index terms among the non-relevant documents can be approximated by the distribution of index terms among all the documents in the collection.

Thus,

pt = 0.5 ut = ni

N

where,ni is the number of documents which contain the index termxi and N is the total number of documents in the collection.

Putting these value we get

ct =log 0.5

1−0.5+ 1− nNt

nt

N

=logN−nt nt

(3)

(25)

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Probabilistic Model

This gives the initial ranking. Improved ranking can be obtained by following procedure We assume

We can approximate pt by the distribution of the index term xi among the documents initially retrieved.

We can approximate ut by considering that the non-retrieved documents are not relevant.

We can write,

pt =Vi

V ut =ni−Vi

N−V

(4)

where,

V is subset of the documents initially retrieved and ranked by the probabilistic model.

Vi be the subset ofV containing index termxi.

(26)

For small values ofVi and V it creates problem, so we add an adjustment factor.

pt =Vi +nNi V + 1 ut =ni−Vi+nNi

N−V + 1

(5)

References

Related documents

The assumptions, assessments, statements and information provided in this RFP/ Bidding Documents (as defined hereinafter) may not be complete, accurate, adequate or

The assumptions, assessments, statements and information provided in this RFP/ Bidding Documents (as defined hereinafter) may not be complete, accurate, adequate or

S.O.494. Where the party adopts higher market value than that given in the Basic Register the value given by the party shall be taken for levy of stamp duty and fees... If a

These gains in crop production are unprecedented which is why 5 million small farmers in India in 2008 elected to plant 7.6 million hectares of Bt cotton which

The Vice Chairman and Managing Director, Telangana State Mineral Development Corporation, Khairtabad invites bids for Excavation of Sand 7,34,655 CBM from Block

Having carefully examined all the tender documents consisting of Invitation of Tender, Instructions to Bidders, General Conditions of Contract, Special Conditions of the

We have used passages from the standard Malayalam online newspaper articles and also rephrased them. Now only documents in the .txt format is considered. N-grams for both

I/we hereby declare that I/we have not received any other documents or information showing a different price, value, quantity, or description of the said goods and that if at any