• No results found

Content Based Search

N/A
N/A
Protected

Academic year: 2023

Share "Content Based Search"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Content Based Search

Rajesh Kumar Jain

Roll No: 07405402

(rkjain@cse.iitb.ac.in)

(2)

Agenda-e-Day

Motivation

What Do People Want from Search Engine?

Types of Search Engines

Existing Search Engines (Google, Yahoo, Ask AppliedSemantics)

INIS – International Nuclear Information System

AgroExplorer

Our approach – Functional Architecture with exa.

Conclusion and Future Work

(3)

Motivation

Web major source of information.

Need for search engines

Efficient and time saving.

Language barrier.

Most relevant documents.

Meaning Based Search

Used to retrieve most relevant documents

Multilingual Search

Used to eliminate language barrier.

(4)

What Do People Want from Search Engine?

Integrated Solutions

Distributed Solutions

Efficient, Flexible Indexing and Retrieval

Interfaces and Browsing

Effective Retrieval

Multimedia Retrieval

Information Extraction

Relevance Feedback

(5)

Types of Search Engines

Individual Search engines

Compile their own databases.

Further classified as

Keyword based search engines.

Search on the keywords. e.g. Google.

Meaning based search engines.

Search on the meaning or semantics. e.g. AgroExplorer

Meta Search engines

Do not compile their own databases.

Search databases of different search engines. e.g.

Dogpile.

Subject Directories

Created and maintained by human editors. I.e. LIBRARIANS' INDEX http://lii.org, INFOMINE http://infomine.ucr.edu, ACADEMIC INFO, http://www.academicinfo.us

(6)

Existing Search Engines - Google

Keyword Based Search

Page Rank

Relative importance of the web page.

Anchor Text

(7)

Existing Search Engines – .

Yahoo! search

http://search.yahoo.com

? Huge (15 or more billion web pages)

? Relevancy ranking (word proximity and placement) - not popularity

ranking

? Capitalize OR, AND, or AND NOT.

Put parentheses around words joined by OR.

? No search-size word limit (Google limits you to 32 terms)

Services and tools similar to

Google's

(8)

Existing Search Engines – .

Differences between searching Google and Yahoo! Search

 Parentheses around ORed terms – sometimes works without parentheses

("global warming" OR "greenhouse effect") rise "sea level" (california OR "los angeles" OR "san diego" OR "san francisco")

 Supports intitle: site: inurl: hostname:(for entire site name - hosthame:google.com

 Shortcuts available at

http://tools.search.yahoo.com/shortcuts

(9)

Existing Search Engines – .

Ask.com http://ask.com

Subject-Specific Popularity ranking (links from pages on same subject as your search)

Search results analyzed to provide:

BROADER & NARROWER TERMS suggestions

Smaller database than Google or Yahoo! - about 2 billion

No differences between basic searching in Google and searching Ask.

.com

(10)

Existing Search Engines – AppliedSemantics

•Internet’s first meaning based search engine.

•Used in Google Adsense (Advertising solutions).

•CIRCA technology used. (Conceputal

Information Retrieval and Communication Architecture)

CIRCA has

•a scalable, language independent ontology.

Ontology has

Millions of words with their meanings

•Conceptual relationships to other meanings.

(11)

CIRCA

•Identifies concepts related to specific words and phrases.

•Finds how close “phrase A” is to “concept B”.

•For a given query

•Finds the distance between the query and various concepts in the database.

•E.g. Query – “Colorado Bicycle trips”.

•Possible concepts– region, bicycling,

travel, etc.

(12)

Existing Search Engines –

.

.com

(13)

INIS

There are three major INIS products:

The INIS Database, which today contains 2.9 million bibliographic records; it is accessible by subscription only and has currently 1.3 million authorized users.

A unique collection of over 850 000 full-text

documents (non-conventional "grey" literature – NCL) in 63 languages, including many documents that

cannot easily be found anywhere else.

The INIS Multilingual Thesaurus – a major tool for

describing nuclear information and knowledge in a

structured form, which assists in multilingual and

semantic searches.

(14)

INIS-Features and Benefits

IAEA official design

Direct access to NCL documents in pdf format

Extended and configurable hyper-linking of external web addresses and emails, facilitating easier access to NCL documents on external systems or contacting authors

Weekly email notifications

Improved usability:

Allows users to see the query and its results at the same time

Allows users to preserve previously run queries for comparison purposes.

Displays records in reverse chronological order, giving users quick access to the latest records.

Better documentation:

Tool-tips assist users in performing tasks

Static help pages with "how-to" documents, manuals and glossary of terms can be opened in separate window for consultation.

(15)

INIS-Features and Benefits

Improved configurability:

Allows users to fully customize the search mask and search results pages

The interface can be used in English, German and Spanish, with

Portuguese to be added soon. More languages can be added upon demand

Anonymous users can register their own profiles and enjoy personalized features

Improved Index/Authority Navigator with search-composing assistant (CTRL-CLICK)

Increased data export capabilities: new formats (XML, Excel, formatted text, delimited text, HTML), sorting of exports

The type-ahead, search-ahead functionality "INIS Suggest"

assists users when entering search terms and shows the hit count before the search is executed; this provides additional useful information when composing queries

Searches are much faster, now enabling queries that used to time out in the old system. Most queries are estimated to be between 5 and 20 times faster

(16)

INIS-Features and Benefits

Support for concurrent users: a round- robin load balancer distributes the load among different databases

Improved maintenance: all update

procedures are automated, require no human intervention and notify

administrators in case of problems

Zero downtime per week: updates are transparent to users, who can use the system 24/7 without performance

detriments.

(17)

AgroExplorer

A meaning based multilingual search engine.

Agriculture domain.

UNL is used as interlingua.

Supports english, hindi and and marathi languages.

Methodology

User phrases the query in native language.

System translates it to Universal Networking Language (UNL).

UNL corpus is searched.

Related documents in UNL are fetched.

Fetched documents are converted to native language.

(18)

AgroExplorer

(19)

Query Output

Complete Expression Matching.

Retrieves completely relevant documents where query UNL graph is a subgraph of any sentence UNL graph.

Partial Expression Matching

Retrieves relevant documents where query UNL graph is a part of any sentence UNL graph.

Universal Word Matching

Search on Universal words which are concepts, not just keywords.

Keyword Based Matching.

Traditional search. Lucene search engine used.

(20)

Multilingual Information Retrieval

Need

Document collection contains documents in many languages.

User may not be fluent to express query in document language.

Approaches

Machine translation for text translation

Thesaurus/Dictionary Based

Corpus Based (Sub word clusters)

(21)

Our Aproach – Functional

Architecture

(22)

Example…

Commercial Description:

1. Automobile Radio and Stereo Retail Store;

2. Automobile Engine Rebuilding, Repair, and Exchange Workshop;

3. Car Repair and Retail Shop;

4. Jeep Repair and Retail Shop; and

5. Motor Mending and Replacement Workshop.

(23)

Example…

 For our search, we shall compare these encoding and retrieval techniques:

a flat list of words,

a structured list of words,

a flat list of word senses plus the linguistic Ontology

a structured list of word senses, using WordNet’s

ontology.

(24)

Method – Flat list of Words

Both recall and precision of this method is very bad!!!

NO. QUERY DESCRIPTIONS FOUND

1 Automobile 1, 2 2 Automobile

Retail

1 3 Car Repair 3 4 Motor Repair - 5 Engine Repair 2 6. Motor

Exchange

-

(25)

Method – Structured list of Words

NO. BUSINESS TYPE

ACTIVITY OBJECT MARKET AREA

1 Store Retail Radio Automobile

Store Retail Stereo Automobile 2 Workshop Rebuildin

g

Engine Automobile Workshop Repair Engine Automobile Workshop Exchange Engine Automobile

3 Shop Retail Car

Shop Repair Car

4 Shop Retail Jeep

Shop Repair Jeep

5 Workshop Replacem

ent

Motor

Workshop Mending Motor

(26)

Method – Structured list of Words

Recall remains the same because we have not eliminated the

semantic-match problems.

(27)

Method –WordNet Synset and Linguistic ontology

N O .

DISAMBIGUATED DESCRIPTION

1 [car, auto, automobile, machine, motorcar], [radio receiver, receiving set, radio set, radio, tuner, wireless], [stereo,

stereo system, stereophonic system], [retail, sell retail], [shop, store]

2 [car, auto, automobile, machine, motorcar], [engine], [rebuilding], [repair, fix, fixing, mending, reparation], [substitution, exchange], [workshop, shop]

3 [car, auto, automobile, machine, motorcar], [repair, fix,

fixing, mending, reparation], [retail, sell retail], [shop, store]

4 [jeep, landrover], [repair, fix, fixing, mending, reparation], [retail, sell retail], [shop, store]

5 [motor], [repair, fix, fixing, mending, reparation], [replacement, replacing], [workshop, shop]

(28)

Method – Flat list of Word senses and Linguistic ontology

N O .

DISAMBIGUATED QUERY DESCRIP TIONS FOUND 1 [car, auto, automobile, machine,

motorcar]

1, 2, 3, 4 2 [car, auto, automobile, machine,

motorcar], [retail, sell retail]

1, 3, 4 3 [car, auto, automobile, machine,

motorcar], [repair, fix, fixing, mending, reparation]

2, 3, 4 4 [motor], [repair, fix, fixing, mending,

reparation]

2, 5 5

[locomotive, engine, locomotive engine,

railway locomotive], [repair, fix, fixing, mending, reparation]

6 [motor], [substitution, exchange] 2, 5

(29)

Method – Flat list of Word senses and Linguistic ontology

Decouple the user vocabulary from the data

vocabulary, by covering the most common English words;

Increase recall, by exploiting the hierarchy to make generic queries and recognizing synonyms;

Increase precision, through the disambiguation

mechanism and the ability to navigate the hierarchy to select specific

queries

(30)

Conclusion and Future action…

Meaning based search engines can include the concept or idea expressed by the user in his query and can thus provide more accurate results than the traditional

keyword search engines.

Universal Networking Language (UNL) can be used as an effective interlingua, to represent information in

documents written in natural languages.

Multilingual search engines can help the users to access documents written in languages, other than the query language.

Future Work

The lack of a large scored, multilingual corpus and the adverse effects of polysemous words are found to be the cause of most of the limitations of MLIR systems.

Research efforts are being directed towards these fields and approaches to use interlingua like UNL, subword clusters, etc. effectively for MLIR.

(31)

References

What Do People Want from Information Retrieval?”, W. Bruce Croft Center for Intelligent Information Retrieval Computer Science

Department University of Massachusetts, Amherst

“Beyond Google”, Joe Barker, jbarker@library.berkeley.edu, John Kupersmith, jkupersm@library.berkeley.edu, A “Know Your Library” Workshop Teaching

Library, University of California, Berkeley Fall 2006

D.W. Oard and B.J. Dorr, A survey of multilingual text

retrieval.Institute of Advanced Computer Studies and Computer Science Department University of sity of Maryland, 1996.

Mrugank Surve, Sarvjeet Singh, Satish Kagathara, AgroExplorer

Group and , Pushpak Bhattacharyya, AgroExplorer: a Meaning Based Multilingual Search Engine, International Conference on Digital

Libraries, Delhi, India, February,2004.

The UNL Center, The Universal Networking Language (UNL) Specifications. UNDL Foundation, 3rd edition, December 2004.

S. Singh, A Multilingual Meaning Based Search Engine, B.Tech Project Report, Indian Institute of Technology Bombay, 2003.

U. Hahn, K. Marko, S. Schulz, Subword Clusters as Light Weight

Interlingua for Multilingual Document Retrieval, Proceedings of the 10th Machine Translation Summit of the International Association for Machine Translation, (MT-Summit X) Phuket, Thailand. 2005.

(32)

References (cont)

K. Marko, U. Hahn, S. Schulz, P. Daumke, and P. Nohama,

Interlingual indexing across different language, In RIAO 2004 – Conference Proceedings. Avignon,

France, 26-28 April 2004.

Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, The pagerank citation ranking: Bringing order to the web,

Technical report, Stanford Digital Library, Technologies Project, 1998.

K. Marko, S. Schulz, A. Medelyan and U. Hahn. 2005, Bootstrapping Dictionaries

for Cross Language Information Retrieval, In SIGIR 2005 , Proceedings of the Proceedings of the

28th Annual International ACM SIGIR Conference, Salvador, Brazil, August 15-19, 2005.

(33)

References

Related documents

Even though the share of HEVs decreases slightly, manufacturers reach the 20% CO 2 reduction target assumed for 2025 because unlike in the Adopted Policies scenario, emission

For passenger transport by car and light truck, the lower scenario corresponds to a situation where the per-capita demand for car transport levels off in advanced economies and

Therefore an auto-focus algorithm based on maximum gradient and threshold is proposed It acquaints two adaptive threshold parameters with lessen the impedance of noise and

A 'gasoline-electric hybrid car' or 'hybrid electric vehicle' is a vehicle which relies not only on batteries but also on an internal combustion engine which drives a

Abstract: Speckle interferometric technique is used to record a series of short exposure images of several close binary stars with sub-arcsecond separation through a narrow band

correlation of HR4689 and oi,„,w vs WF-P plot respectively The axes of the figures I (a) and (b) arc the pixel value, each pixel value is 0.015 arc-seconds Since the

Golaka C.Nath and G.P.Samanta (2003)4' has used Granger causality test in Vector Auto Regression framework and Geweke's feedback measures on daily data of the exchange rate of

For the purpose of Regression Analysis, different Sectoral Indices at National Stock Exchange (NSE) namely NIFTY Auto Index, NIFTY Bank Index, NIFTY Financial Services