Content Based Search
Rajesh Kumar Jain
Roll No: 07405402
(rkjain@cse.iitb.ac.in)
Agenda-e-Day
Motivation
What Do People Want from Search Engine?
Types of Search Engines
Existing Search Engines (Google, Yahoo, Ask AppliedSemantics)
INIS – International Nuclear Information System
AgroExplorer
Our approach – Functional Architecture with exa.
Conclusion and Future Work
Motivation
Web major source of information.
Need for search engines
Efficient and time saving.
Language barrier.
Most relevant documents.
Meaning Based Search
Used to retrieve most relevant documents
Multilingual Search
Used to eliminate language barrier.
What Do People Want from Search Engine?
Integrated Solutions
Distributed Solutions
Efficient, Flexible Indexing and Retrieval
Interfaces and Browsing
Effective Retrieval
Multimedia Retrieval
Information Extraction
Relevance Feedback
Types of Search Engines
Individual Search engines
Compile their own databases.
Further classified as
Keyword based search engines.
Search on the keywords. e.g. Google.
Meaning based search engines.
Search on the meaning or semantics. e.g. AgroExplorer
Meta Search engines
Do not compile their own databases.
Search databases of different search engines. e.g.
Dogpile.
Subject Directories
Created and maintained by human editors. I.e. LIBRARIANS' INDEX http://lii.org, INFOMINE http://infomine.ucr.edu, ACADEMIC INFO, http://www.academicinfo.us
Existing Search Engines - Google
Keyword Based Search
Page Rank
Relative importance of the web page.
Anchor Text
Existing Search Engines – .
Yahoo! search
http://search.yahoo.com
? Huge (15 or more billion web pages)
? Relevancy ranking (word proximity and placement) - not popularity
ranking
? Capitalize OR, AND, or AND NOT.
Put parentheses around words joined by OR.
? No search-size word limit (Google limits you to 32 terms)
Services and tools similar to
Google's
Existing Search Engines – .
Differences between searching Google and Yahoo! Search
Parentheses around ORed terms – sometimes works without parentheses
("global warming" OR "greenhouse effect") rise "sea level" (california OR "los angeles" OR "san diego" OR "san francisco")
Supports intitle: site: inurl: hostname:(for entire site name - hosthame:google.com
Shortcuts available at
http://tools.search.yahoo.com/shortcuts
Existing Search Engines – .
Ask.com http://ask.com
Subject-Specific Popularity ranking (links from pages on same subject as your search)
Search results analyzed to provide:
BROADER & NARROWER TERMS suggestions
Smaller database than Google or Yahoo! - about 2 billion
No differences between basic searching in Google and searching Ask.
.com
Existing Search Engines – AppliedSemantics
•Internet’s first meaning based search engine.
•Used in Google Adsense (Advertising solutions).
•CIRCA technology used. (Conceputal
Information Retrieval and Communication Architecture)
• CIRCA has
•a scalable, language independent ontology.
• Ontology has
• Millions of words with their meanings
•Conceptual relationships to other meanings.
CIRCA
•Identifies concepts related to specific words and phrases.
•Finds how close “phrase A” is to “concept B”.
•For a given query
•Finds the distance between the query and various concepts in the database.
•E.g. Query – “Colorado Bicycle trips”.
•Possible concepts– region, bicycling,
travel, etc.
Existing Search Engines –
.
.comINIS
There are three major INIS products:
The INIS Database, which today contains 2.9 million bibliographic records; it is accessible by subscription only and has currently 1.3 million authorized users.
A unique collection of over 850 000 full-text
documents (non-conventional "grey" literature – NCL) in 63 languages, including many documents that
cannot easily be found anywhere else.
The INIS Multilingual Thesaurus – a major tool for
describing nuclear information and knowledge in a
structured form, which assists in multilingual and
semantic searches.
INIS-Features and Benefits
IAEA official design
Direct access to NCL documents in pdf format
Extended and configurable hyper-linking of external web addresses and emails, facilitating easier access to NCL documents on external systems or contacting authors
Weekly email notifications
Improved usability:
Allows users to see the query and its results at the same time
Allows users to preserve previously run queries for comparison purposes.
Displays records in reverse chronological order, giving users quick access to the latest records.
Better documentation:
Tool-tips assist users in performing tasks
Static help pages with "how-to" documents, manuals and glossary of terms can be opened in separate window for consultation.
INIS-Features and Benefits
Improved configurability:
Allows users to fully customize the search mask and search results pages
The interface can be used in English, German and Spanish, with
Portuguese to be added soon. More languages can be added upon demand
Anonymous users can register their own profiles and enjoy personalized features
Improved Index/Authority Navigator with search-composing assistant (CTRL-CLICK)
Increased data export capabilities: new formats (XML, Excel, formatted text, delimited text, HTML), sorting of exports
The type-ahead, search-ahead functionality "INIS Suggest"
assists users when entering search terms and shows the hit count before the search is executed; this provides additional useful information when composing queries
Searches are much faster, now enabling queries that used to time out in the old system. Most queries are estimated to be between 5 and 20 times faster
INIS-Features and Benefits
Support for concurrent users: a round- robin load balancer distributes the load among different databases
Improved maintenance: all update
procedures are automated, require no human intervention and notify
administrators in case of problems
Zero downtime per week: updates are transparent to users, who can use the system 24/7 without performance
detriments.
AgroExplorer
A meaning based multilingual search engine.
Agriculture domain.
UNL is used as interlingua.
Supports english, hindi and and marathi languages.
Methodology
User phrases the query in native language.
System translates it to Universal Networking Language (UNL).
UNL corpus is searched.
Related documents in UNL are fetched.
Fetched documents are converted to native language.
AgroExplorer
Query Output
Complete Expression Matching.
Retrieves completely relevant documents where query UNL graph is a subgraph of any sentence UNL graph.
Partial Expression Matching
Retrieves relevant documents where query UNL graph is a part of any sentence UNL graph.
Universal Word Matching
Search on Universal words which are concepts, not just keywords.
Keyword Based Matching.
Traditional search. Lucene search engine used.
Multilingual Information Retrieval
Need
Document collection contains documents in many languages.
User may not be fluent to express query in document language.
Approaches
Machine translation for text translation
Thesaurus/Dictionary Based
Corpus Based (Sub word clusters)
Our Aproach – Functional
Architecture
Example…
Commercial Description:
1. Automobile Radio and Stereo Retail Store;
2. Automobile Engine Rebuilding, Repair, and Exchange Workshop;
3. Car Repair and Retail Shop;
4. Jeep Repair and Retail Shop; and
5. Motor Mending and Replacement Workshop.
Example…
For our search, we shall compare these encoding and retrieval techniques:
a flat list of words,
a structured list of words,
a flat list of word senses plus the linguistic Ontology
a structured list of word senses, using WordNet’s
ontology.
Method – Flat list of Words
Both recall and precision of this method is very bad!!!
NO. QUERY DESCRIPTIONS FOUND
1 Automobile 1, 2 2 Automobile
Retail
1 3 Car Repair 3 4 Motor Repair - 5 Engine Repair 2 6. Motor
Exchange
-
Method – Structured list of Words
NO. BUSINESS TYPE
ACTIVITY OBJECT MARKET AREA
1 Store Retail Radio Automobile
Store Retail Stereo Automobile 2 Workshop Rebuildin
g
Engine Automobile Workshop Repair Engine Automobile Workshop Exchange Engine Automobile
3 Shop Retail Car
Shop Repair Car
4 Shop Retail Jeep
Shop Repair Jeep
5 Workshop Replacem
ent
Motor
Workshop Mending Motor
Method – Structured list of Words
Recall remains the same because we have not eliminated the
semantic-match problems.
Method –WordNet Synset and Linguistic ontology
N O .
DISAMBIGUATED DESCRIPTION
1 [car, auto, automobile, machine, motorcar], [radio receiver, receiving set, radio set, radio, tuner, wireless], [stereo,
stereo system, stereophonic system], [retail, sell retail], [shop, store]
2 [car, auto, automobile, machine, motorcar], [engine], [rebuilding], [repair, fix, fixing, mending, reparation], [substitution, exchange], [workshop, shop]
3 [car, auto, automobile, machine, motorcar], [repair, fix,
fixing, mending, reparation], [retail, sell retail], [shop, store]
4 [jeep, landrover], [repair, fix, fixing, mending, reparation], [retail, sell retail], [shop, store]
5 [motor], [repair, fix, fixing, mending, reparation], [replacement, replacing], [workshop, shop]
Method – Flat list of Word senses and Linguistic ontology
N O .
DISAMBIGUATED QUERY DESCRIP TIONS FOUND 1 [car, auto, automobile, machine,
motorcar]
1, 2, 3, 4 2 [car, auto, automobile, machine,
motorcar], [retail, sell retail]
1, 3, 4 3 [car, auto, automobile, machine,
motorcar], [repair, fix, fixing, mending, reparation]
2, 3, 4 4 [motor], [repair, fix, fixing, mending,
reparation]
2, 5 5
[locomotive, engine, locomotive engine,railway locomotive], [repair, fix, fixing, mending, reparation]
—
6 [motor], [substitution, exchange] 2, 5
Method – Flat list of Word senses and Linguistic ontology
Decouple the user vocabulary from the data
vocabulary, by covering the most common English words;
Increase recall, by exploiting the hierarchy to make generic queries and recognizing synonyms;
Increase precision, through the disambiguation
mechanism and the ability to navigate the hierarchy to select specific
queries
Conclusion and Future action…
Meaning based search engines can include the concept or idea expressed by the user in his query and can thus provide more accurate results than the traditional
keyword search engines.
Universal Networking Language (UNL) can be used as an effective interlingua, to represent information in
documents written in natural languages.
Multilingual search engines can help the users to access documents written in languages, other than the query language.
Future Work
The lack of a large scored, multilingual corpus and the adverse effects of polysemous words are found to be the cause of most of the limitations of MLIR systems.
Research efforts are being directed towards these fields and approaches to use interlingua like UNL, subword clusters, etc. effectively for MLIR.
References
“What Do People Want from Information Retrieval?”, W. Bruce Croft Center for Intelligent Information Retrieval Computer Science
Department University of Massachusetts, Amherst
“Beyond Google”, Joe Barker, jbarker@library.berkeley.edu, John Kupersmith, jkupersm@library.berkeley.edu, A “Know Your Library” Workshop Teaching
Library, University of California, Berkeley Fall 2006
D.W. Oard and B.J. Dorr, A survey of multilingual text
retrieval.Institute of Advanced Computer Studies and Computer Science Department University of sity of Maryland, 1996.
Mrugank Surve, Sarvjeet Singh, Satish Kagathara, AgroExplorer
Group and , Pushpak Bhattacharyya, AgroExplorer: a Meaning Based Multilingual Search Engine, International Conference on Digital
Libraries, Delhi, India, February,2004.
The UNL Center, The Universal Networking Language (UNL) Specifications. UNDL Foundation, 3rd edition, December 2004.
S. Singh, A Multilingual Meaning Based Search Engine, B.Tech Project Report, Indian Institute of Technology Bombay, 2003.
U. Hahn, K. Marko, S. Schulz, Subword Clusters as Light Weight
Interlingua for Multilingual Document Retrieval, Proceedings of the 10th Machine Translation Summit of the International Association for Machine Translation, (MT-Summit X) Phuket, Thailand. 2005.
References (cont)
K. Marko, U. Hahn, S. Schulz, P. Daumke, and P. Nohama,
Interlingual indexing across different language, In RIAO 2004 – Conference Proceedings. Avignon,
France, 26-28 April 2004.
Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, The pagerank citation ranking: Bringing order to the web,
Technical report, Stanford Digital Library, Technologies Project, 1998.
K. Marko, S. Schulz, A. Medelyan and U. Hahn. 2005, Bootstrapping Dictionaries
for Cross Language Information Retrieval, In SIGIR 2005 , Proceedings of the Proceedings of the
28th Annual International ACM SIGIR Conference, Salvador, Brazil, August 15-19, 2005.