• No results found

Crawling and Indexing in IR

N/A
N/A
Protected

Academic year: 2022

Share "Crawling and Indexing in IR"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

CS344: Introduction to Artificial CS344: Introduction to Artificial

Intelligence

Vishal Vachhani M.Tech, CSE

Lecture 34-35: CLIR and Ranking,

Crawling and Indexing in IR

(2)

Road Map Road Map

y

Cross Lingual IR

y

Motivation

y

CLIA architecture

y

CLIA demo

y

CLIA demo

y

Ranking

y

Various Ranking methods Various Ranking methods

y

Nutch/lucene Ranking

y

Learning a ranking function

y

Experiments and results

(3)

Cross Lingual IR Cross Lingual IR

y

Motivation

y

Information unavailability in some languages

y

Language barrier

y

D fi iti

y

Definition:

y

Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving g g

information written in a language different from the language of the user's query (wikipedia)

E l

y

Example:

y

A user may ask query in Hindi but retrieve relevant documents

written in English. g

(4)

Wh CLIR?

Why CLIR?

Query in Que y

Tamil

E li h

System search

English Document

Marathi Marathi Document

English Snippet

Generation

4

Document Translation and

(5)

Cross Lingual Information Access Cross Lingual Information Access

y

Cross Lingual Information Access (CLIA)

y

A web portal supporting monolingual and cross lingual IR in 6 Indian languages and English

y

Domain : Tourism

y

Domain : Tourism

y

It supports :

y Summarization of web documents

y Snippet translation into query language

y Temple based information extraction

y

The CLIA system is publicly available at y p y

y http://www.clia.iitb.ac.in/clia-beta-ext

(6)
(7)

CLIA Demo

(8)

Various Ranking methods Various Ranking methods

y

Vector Space Model

y

Lucene, Nutch , Lemur , etc

y

Probabilistic Ranking Model

Cl i l k J h ’ ki (L ODD i )

y

Classical spark John’s ranking (Log ODD ratio)

y

Language Model

y

Ranking using Machine Learning Algo

y

Ranking using Machine Learning Algo

y

SVM, Learn to Rank, SVM-Map, etc

y

Link analysis based Ranking y g

y

Page Rank, Hubs and Authorities, OPIC , etc

(9)

Nutch Ranking Nutch Ranking

y

CLIA is built on top on Nutch – A open source web search engine.

y

It is based on Vector space model

,

  , 2

 

.  

,      

| ||  | 

(10)

Link analysis Link analysis

y

Calculates the importance of the pages using web graph d

y

Node: pages

y

Edge: hyperlinks between pages

y

Motivation: link analysis based score is hard to manipulate Motivation: link analysis based score is hard to manipulate using spamming techniques

y

Plays an important role in web IR scoring function

y

Page rank

y

Hub and Authority

y

Online Page Importance Computation (OPIC) g p p ( )

y

Link analysis score is used along with the tf-idf based score

y

We use OPIC score as a factor in CLIA.

(11)
(12)

Learning a ranking function Learning a ranking function

y

How much weight should be given to different part of the

b d t hil ki th d t ?

web documents while ranking the documents?

y

A ranking function can be learned using following method

y

Machine learning algorithms: SVM, Max-entropy g g , py

y

Training

y A set of query and its some relevant and non-relevant docs for each query

y A set of features to capture the similarity of docs and query

y A set of features to capture the similarity of docs and query

y In short, learn the optimal value of features y

Ranking

y U T i d d l d t b bi i diff t f t

y Use a Trained model and generate score by combining different feature score for the documents set where query words appears

y Sort the document by using score and display to user

(13)

Extended Features for Web IR Extended Features for Web IR

1.

Content based features

Tf IDF l th d t

Tf, IDF, length, co-ord, etc

2.

Link analysis based features

OPIC score

Domains based OPIC score

3.

Standard IR algorithm based features

BM25 scoreBM25 score

Lucene score

LM based score

L i b d f

4.

Language categories based features

Named Entity

Phrase based features

(14)

Content based Features

Feature   Formulation Descriptions

C1  ,   Term frequency (tf)

C2  log , 1   SIGIR feature 

C3  ,

| |   Normalized tf

C4 SIGIR feature

C4  log 1 | |,   SIGIR feature

C5  log | |

  Inverse doc frequency (IDF)

C6  l l | | SIGIR feature

log log | |

C7  log 1 | |

,

  SIGIR feature

C8  Tf*IDF

log 1 ,

| | ,

1

C9  log 1 ,

| | log | |

  SIGIR feature

C10 | | SIGIR feature

C10  log 1 | |,

| |

,

  SIGIR feature  

(15)

Details of features Details of features

Feature No Descriptions

1 Length of body

1 Length of body

2 length of title

3 length of URL

4 length of Anchor

5-14 C1-C10 for Title of the page 15-24 C1-C10 for Body of the pagey p g 25-34 C1-C10 for URL of the page 35-44 C1-C10 for Anchor of the page

45 OPIC score

45 OPIC score

46 Domain based classification score

(16)

Details of features(Cont) Details of features(Cont)

Feature No Descriptions

48 BM25 Score

48 BM25 Score

49 Lucene score

50 Language Modeling score

51 -54 Named entity weight for title, body , anchor , url 55-58 Multi-word weight for title, body , anchor , url 59-62 Phrasal score for title, body , anchor , urly

63-66 Co-ord factor for title, body , anchor , url 71 Co-ord factor for H1 tag of web document

(17)

Experiments and results Experiments and results

MAP Nutch Ranking 0.2267 0.2267 0.2667 0.2137 DIR with Title + content 0.6933 0.64 0.5911 0.3444

DIR with URL+ content 0.72 0.62 0.5333 0.3449

DIR with Title + URL + content 0.72 0.6533 0.56 0.36

DIR i h Ti l +URL+ + h 0 73 0 66 0 58 0 3734

DIR with Title+URL+content+anchor 0.73 0.66 0.58 0.3734 DIR with Title+URL+ content +

anchor+ NE feature

0.76 0.63 0.6 0.4

(18)

Crawling, Indexing

(19)

Outline Outline

y

Nutch Overview

y

Cra ler in CLIA s stem

y

Crawler in CLIA system

y Data structure

y Crawler in CLIA

y

I d i

y

Indexing

y Types of index and indexing tools

y

Searching

d l

y Command line API

y Searching through GUI

y

Demo

(20)

Crawler Crawler

y

The crawler system is driven by the Nutch crawl tool, and a The crawler system is driven by the Nutch crawl tool, and a

family of related tools to build and maintain several types of data

structures, including the web database, a set of segments, and the

index.

(21)

Crawler Data Structure Crawler Data Structure

y

Web Database (webdb)

ƒ persistent data structure for web graph being crawled.

ƒ Stores pages and links

y

Segment Segment

ƒ A collection of pages fetched and indexed by the crawler in a single run

y

Index

ƒ Inverted index of all of the pages the system has retrieved

(22)

Crawler Crawler

Initial URLs

Injector Web

Webpages/files CrawlDB

update get

Generator CrawlDBTool Fetcher

p get

t

read/write

Segment Parser

generate read/write

(23)

Crawl command Crawl command

y Aimed for intranet-scale crawling

y A front end to other lower-level toolsA front end to other, lower level tools

y It performs crawling and indexing

y Create a URLS directory and put URLs list in it.

y CommandCommand

ƒ $NUTCH_HOME/bin/nutch crawl urlDir [Options]

ƒ Options

y -dir: the directory to put the crawl in.

d th h li k d h f h h h ld b l d

y -depth: the link depth from the root page that should be crawled.

y -threads: the number of threads that will fetch in parallel.

y -topN: number of total pages to be crawled.

ƒ Example

bin/nutch crawl urls -dir crawldir -depth 3 -topN 10

(24)

Inject command Inject command

y Inject root URLs into the WebDB

y

Command

$NUTCH HOME/bi / t h i j t < ldb> < ldi >

ƒ $NUTCH_HOME/bin/nutch inject <crawldb> <urldir>

<crawldb>: Path to the Crawl Database directoryy

<urldir>: Path to the directory containing flat text url files

(25)

Generate command Generate command

y Generates a new Fetcher Segment from the Crawl Database

y Command:

ƒ $NUTCH_HOME/bin/nutch generate <crawldb> <segments_dir> [-topN

<num>] [-numFetchers <fetchers>]

ƒ <crawldb>: Path to the crawldb directory.

<segments_dir>:Path to the directory where the Fetcher Segments are created.

[-topN <num>]: Selects the top <num> ranking URLs for this segment [-numFetchers <fetchers>]:The number of fetch partitions.

(26)

Fetch command Fetch command

y Runs the Fetcher on a segment

y

Command

:

y $NUTCH HOME/bin/nutch Fetch <segment> [-threads <n>] [-

y $NUTCH_HOME/bin/nutch Fetch <segment> [-threads <n>] [- noParsing]

h h f h

y <segment>: Path to the segment to fetch

[-threads <n>]:The number of fetcher threads to run

[-noParsing]:Disables automatic parsing of the segment's data

(27)

Parse command Parse command

y Runs ParseSegment on a segment.

y

Command

$NUTCH HOME/bi / t h < t>

ƒ $NUTCH_HOME/bin/nutch parse <segment>

ƒ <segment>:segment : Path to the segment to parse.Path to the segment to parse.

(28)

Updatedb command Updatedb command

y Updates the Crawl DB with information obtained from the Fetcher

y

Command

:

ƒ $NUTCH HOME/bin/nutch updatedb <crawldb> <segment>$NUTCH_HOME/bin/nutch updatedb <crawldb> <segment>

ƒ <crawldb>: Path to the crawl database.

<segment>: Path to the segment that has been fetched.

(29)

Index and Indexing Index and Indexing

y

Sequential Search is bad (Not Scalable)

y

Indexing – the creation of a data structure that facilitates fast, random access to information stored in it.

y

Types of Index

ƒ

Forward Index

ƒ

Inverted Index

ƒ

Full Inverted Index

(30)

Forward Index Forward Index

y

It stores a list of words for each documents

y

Example

D

1

=“it is what it is.”

D

2

=“what is it.”

D

3

=“it is a banana”

Document Words

1 It, is , what

2 What, is, it

3 It is a banana

3 It , is , a, banana

(31)

Inverted Index Inverted Index

y

It stores a list of documents for each word

Word Documents

a 3

banana 3

is 1 2 3

is 1,2,3

it 1,2,3

What 1,2

(32)

Full Inverted Index Full Inverted Index

y

It is used to support phrase search. pp p

y

Query: “What is it”

Word Documents

a {(3.2)}

banana {(3,3)}

is {(1,1),(1,4),(2,1),(3,1)}

it {(1,0),(2,2),(3,0)}

What {(1,2),(2,0)}

(33)

Invertlink command Invertlink command

y Updates the Link Database with linking information from a segment

y Command:

ƒ $NUTCH HOME/bin/nutch invertlink <linkdb> (-dir $NUTCH_HOME/bin/nutch invertlink linkdb ( dir segmentsDir | segment1 segment2 ...)

<li kdb> P h h li k d b

ƒ <linkdb>: Path to the link database.

<segment>: Path to the segment that has been fetched. A directory or more than one segment may be specified.

(34)

Index command Index command

y

Creates an index of a segment using information from the crawldb and the linkdb to score pages in the index

y

Command :

$NUTCH HOME/bi / t h i d <i d > < ldb>

ƒ

$NUTCH_HOME/bin/nutch index <index> <crawldb>

<linkdb> <segment> ...

ƒ

<index>: e : Path to the directory where the index will be at to t e ecto y w e e t e e w e created

<crawldb>: Path to the crawl database directory

<linkdb>: P th t th li k d t b di t

<linkdb>: Path to the link database directory

<segment>: Path to the segment that has been fetched

More then one segment may be specified

(35)

Dedup command Dedup command

y Removes duplicate pages from a set of segment indexes

y

Command:

ƒ $NUTCH HOME/bin/nutch dedup <indexes>

ƒ $NUTCH_HOME/bin/nutch dedup <indexes>

<indexes>: Path to directories containing indexes

(36)

Merge command Merge command

y Merges several segment indexes

y

Command:

ƒ $NUTCH HOME/bin/nutch merge <outputIndex> <indexesDir> $NUTCH_HOME/bin/nutch merge <outputIndex> <indexesDir>

...

ƒ <outputIndex>: Path to a directory where the merged index will b t d

be created.

<indexesDir>: Path to a directory containing indexes to merge.

More then one directory may be specified.

(37)

Configuring CLIA crawler Configuring CLIA crawler

Configure file: $NUTCH/conf/nutch-site.xml g

ƒ

Required user parameters

ƒ http.agent.name

htt t d i ti

ƒ http.agent.description

ƒ http.agent.url

ƒ http.agent.email

ƒ

Optional user parameters

ƒ http.proxy.host

ƒ http.proxy.portp.p y.p

(38)

Configuring CLIA crawler Configuring CLIA crawler

Configure file: $NUTCH/conf/crawl-urlfilters.txt

Regular expression to filter URLs during crawlingRegular expression to filter URLs during crawling

E.g.

y To ignore files with certain suffix:

-\.(gif|exe|zip|ico)$(g | | p| )

y To accept host in a certain domain

+^http://([a-z0-9]*\.)*apache.org/

y

change the following line

ƒ @ 26 line of crawl-urlfileters.txt

#skip everything else

+.

(39)

Searching and Indexing Searching and Indexing

Segments

CrawlDB LinkDB

Indexer (Lucene)

Index

Searcher (Lucene)

GUI (Tomcat)

(40)

Crawl Directory Structure Crawl Directory Structure

y Crawldb

ƒ Contains the information about every URL known to Nutchy

y Linkdb

ƒ contains the list of known links to each URL

y Segment

y l g t t f l t b f t h d

y crawl_generatenames a set of urls to be fetched

y crawl_fetchcontains the status of fetching each url

y contentcontains the content of each url

y parse_textcontains the parsed text of each url

l k d d d f h l

y parse_data contains outlinks and metadata parsed from each url

y crawl_parsecontains the outlink urls, used to update the crawldb

y Index

y Contains Lucene-format indexes.

(41)

Searching Searching

Configure file: $NUTCH/conf/nutch-default.xml

Change the following property:

y searcher dir – complete path to you crawl foldersearcher.dir complete path to you crawl folder

Command line searching API

y $NUTCH_HOME/bin/nutch org.apache.nutch.searcher.NutchBean queryString

(42)

Searching Searching

y

Create clia-alpha-test.war file using “ant war”

y

Deploy clia-alpha-test.war file in tomcat webapp directory

y

http://localhost:8080/clia-alpha-test/

(43)

Thanks

Thanks

References

Related documents

Failing to address climate change impacts can undermine progress towards most SDGs (Le Blanc 2015). Many activities not only declare mitigation targets but also cite the importance

To estimate the welfare losses from restrictions on air travel due to Covid-19, as well as those losses associated with long run efforts to minimise the

These gains in crop production are unprecedented which is why 5 million small farmers in India in 2008 elected to plant 7.6 million hectares of Bt cotton which

INDEPENDENT MONITORING BOARD | RECOMMENDED ACTION.. Rationale: Repeatedly, in field surveys, from front-line polio workers, and in meeting after meeting, it has become clear that

3 Collective bargaining is defined in the ILO’s Collective Bargaining Convention, 1981 (No. 154), as “all negotiations which take place between an employer, a group of employers

While Greenpeace Southeast Asia welcomes the company’s commitment to return to 100% FAD free by the end 2020, we recommend that the company put in place a strong procurement

Women and Trade: The Role of Trade in Promoting Gender Equality is a joint report by the World Bank and the World Trade Organization (WTO). Maria Liungman and Nadia Rocha 

Harmonization of requirements of national legislation on international road transport, including requirements for vehicles and road infrastructure ..... Promoting the implementation