Information Retrieval
Yogesh Kakde
29th January, 2011
Outline
1 What is Information Retrieval
2 Basic Components in an Web-IR system
3 Theoretical Models Of IR Boolean Model
Vector Model Probabilistic Model
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
What is Information Retrieval
Information Retrieval (IR) means searching for relevant documents and information within the contents of a specific data set such as the World Wide Web.
ss
Basic Components in an Web-IR system
ss
Crawling: The system browses the document collection and fetches documents.
Indexing: The system builds an index of the documents fetched during crawling.
Ranking: The system retrieves documents that are relevant to the query from the index and displays to the user.
Relevance Feedback: The initial results returned from a given query may be used to refine the query itself.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Terminologies Used
Inverted Index: An index that maps back from terms to the documents where they occur.
D1 :The GDP increased 2percent this quarter.
D2 :The Spring economic slowdown continued to spring downwards this quarter.
ss
Terminologies Used
A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.
Example: The GDP increased 2 percent this quarter.
Tokens: The,GDP,increased,2,percent,this,quarter
A type is the class of all tokens containing the same character sequence.
Example: To be or not to be.
Types: to,be,or,not
A term is a type that is included in the IR system’s dictionary.
Example: The GDP increased 2 percent this quarter.
Terms: GDP, increased,2,percent,quarter
Stop words: some extremely common words which are of little value in select documents matching a user need.
Example: The GDP increased 2 percent this quarter.
Stop words: The,this
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Terminologies Used
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form. Stemmers can be catagorised as
Rule-based stemmers
Co-occurrence based stemmers
Lemma: the canonical form, dictionary form, or citation form of a set of words.
Lemmatization is the algorithmic process of determining the lemma for a given word with the use of a vocabulary and morphological analysis.
Logical View Of Document
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Measures Of Accuracy
ss
A Formal Characterization of IR Models
An information retrieval model is a quadruple {D,Q,F,R(qi,dj)}
where
D is a set composed of logical views for the documents in the collection.
Q is a set composed of logical views for the user information needs.
F is a framework for modeling document representations, queries, and their relationships.
R(qi,dj) is a ranking function which associates a real number with a queryqi belong to Q and a document representation dj belong to D. Such ranking defines an ordering among the documents with regard to the query qi.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Boolean Model
Boolean Model
The model is based on Boolean Logic and classical Set theory Documents and the query are conceived as sets of terms.
Retrieval is based on whether or not the documents contain the query terms
The queries are specified as Boolean expressions which have precise semantics.
Advantages:
Formalism and Simplicity Very precise
Disadvantages
No notion of partial match
Boolean expressions have precise semantics(not simple from users point-of-view)
Not Ranked
The model does not use term weights and frequency Given a large set of documents, the Boolean model either retrieves too many documents or very few documents.
Vector Model
The use of binary weights is too limiting so vector model proposes non-binary weights to index terms in queries and documents.
Document vector d¯j = {w1j,w2j, ...,wtj} Query vector q¯i = {w1i,w2i, ...,wyi} Wheret: total number of index terms in the collection of documents
Term Frequency(tf): The term-frequency is simply the count of the termi in document j. The term-frequency of a term is normalized by the maximum term frequency of any term in the document to bring the range of the term-frequency 0 to 1.
tfi,j = freqi,j
maxl(freql,j)
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Vector Model
Inverse Document Frequency: This factor is motivated from the fact that if a term appears in all documents in a set, then it loses its distinguishing power. The inverse document frequency (idf) of a termi is given by:
idfi =logN ni
where
ni is the number of documents in which the termi occurs N is the total number of documents
The term weights can be approximated by the product of term frequency(tf) and inverse document frequency(idf). This is called thetf −idf measure.
Similarity measure between two vectors
Cosine Similarity The cosine similarity between two vectors ¯dj (the document vector) and ¯qi (query vector) is given by:
similarity( ¯dj,q¯i) =cosθ= d¯j.q¯i
|d¯j|,|q¯i|
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Vector Model
Advantage
The cosine similarity measure returns value in the range 0 to 1. Hence, partial matching is possible.
Ranking of the retrieved results according to the cosine similarity score is possible.
Disadvantage
Index terms are considered to be mutually independent. Thus, this model does not capture the semantics of the query or the document.
It cannot denote the “clear logic view” like Boolean model
Prbabilistic Model
The document is retrieved according to the probability of the document being relevant to the query. Mathematically, the scoring function is given by
P(R= 1|d,q)
The document is termed relevant if its probability of being relevant is greater than its probability of being non relevant
P(R = 1|d¯,q)¯ >P(R= 0|d¯,q)¯
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Probabilistic Model
Binary Independence Model
Assumptions
Each index term is Boolean in nature (1 if present, 0 if absent).
There is no association between index terms.
The relevance of each document is calculated independent of the relevance of other documents.
Documents and queries are both represented as binary term incidence vectors.
dj =x1j,x2j, ...,xMj
[xi is 1 if xi is present in document. Else, this is zero.]
qj =q1,q2, ...,qM
Odds of relevance is used as ranking function as
it is monotonic with respect to probability of relevance it reduces the computation
odds of relevance= P(dj relevant to q) P(dj non−relevant to q) Ranking Functionrf()
rf() = P(R|d¯j) P( ¯R|d¯j) where
P(R|d¯j) is the probability of documentdj relevant toq.
P( ¯R|d¯j) is the probability of documentdj not relevant toq.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Probabilistic Model
By using Baye’s Rule we get
rf() = P( ¯dj|R).P(R) P( ¯dj|R).P¯ ( ¯R)
SinceP(R) andP( ¯R) are same for each document eliminating it from the above equation wont change the ranking of the
documents.
rf() = P( ¯dj|R) P( ¯dj|R)¯
= P(x1|R).P(x2|x1,R)...P(xM|x1...xM−1,R) P(x1|R).P¯ (x2|x1,R)...P¯ (xM|x1...xM−1,R)¯
=
M
Y
t=1
(P(xt|x1...xt−1,R P(xt|x1...xt−1,R¯)
(1)
Applying Naive-Bayes assumption which states that all index term xt are independent we get,
rf() =
M
Y
t=1
P(xt|R) P(xt|R)¯
Since eachxt is either 0 or 1 under Binary Independence Model, we can separate the terms to give
rf() =
M
Y
t:xt=1
P(xt|R) P(xt|R)¯ .
M
Y
t:xt=0
P(xt|R) P(xt|R)¯
Now let,pt =P(xt|R) br the probability of a term xt appearing in a relevant document andut=P(xt|R) be the probability that a¯ termxt appears in a non-relevant document.
rf() =
M
Y
t:xt=1
pt ut.
M
Y
t:xt=0
1−pt 1−ut
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Probabilistic Model
Multiply the above equation by
M
Y
t:xt=1
1−pt
1−ut and its reciprocal
M
Y
t:xt=1
1−ut
1−pt
Now the term
M
Y
t:xt=1∨0
1−pt 1−ut
as this term is independent of index termsxt. Hence it can be ignored and we get,
rf() =
M
Y
t:xt=1
pt.(1−ut) ut.(1−pt)
Taking the logarithm of equation we get
rf() =
M
X
t:xt=1
logpt.(1−ut) ut.(1−pt)
=
M
X
t:xt=1
log pt
(1−pt)+log(1−ut) ut
=
M
X
t:xt=1
ct
(2)
ct is the log odds ratio for the terms in query.
ct =log odds of term present in a relevant document odds of term being present in a non−relevant document
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Probabilistic Model
Equation (2) gives the formal scoring function of probabilistic information retrieval model. Nowct needs to be approximated.
Advantages
Documents are ranked in decreasing order of their probability if being relevant
Disadvantages
The need to guess the initial seperation of documents into relevant and non-relevant sets.
All wights are binary
Index terms are assumed to be independent
Estimation of ct
When there are no retrieved documents we can assume pt is constant for all index terms xi.
The distribution of index terms among the non-relevant documents can be approximated by the distribution of index terms among all the documents in the collection.
Thus,
pt = 0.5 ut = ni
N
where,ni is the number of documents which contain the index termxi and N is the total number of documents in the collection.
Putting these value we get
ct =log 0.5
1−0.5+ 1− nNt
nt
N
=logN−nt nt
(3)
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR Probabilistic Model
This gives the initial ranking. Improved ranking can be obtained by following procedure We assume
We can approximate pt by the distribution of the index term xi among the documents initially retrieved.
We can approximate ut by considering that the non-retrieved documents are not relevant.
We can write,
pt =Vi
V ut =ni−Vi
N−V
(4)
where,
V is subset of the documents initially retrieved and ranked by the probabilistic model.
Vi be the subset ofV containing index termxi.
For small values ofVi and V it creates problem, so we add an adjustment factor.
pt =Vi +nNi V + 1 ut =ni−Vi+nNi
N−V + 1
(5)