COMAD Tutorial on
Multi-lingual Database Systems
Jayant Haritsa
Indian Institute of Science Bangalore, India
Database
Systems
Laboratory
Tutorial Contents
• Motivation
• Multilingual Standards and Systems
• Multilingual Performance
• New Multilingual Query Operators
• Multilingual Relational Query Algebra
• Multilingual Implementation
• Conclusions & Future Research
Organization
• Motivation
• Standards and Systems
• Multilingual Performance
• New Multilingual Query Operators
• Multilingual Relational Query Algebra
• Multilingual Implementation
• Conclusions & Future Research
Tracks History ( User History )
Generates Recommendations ( User Pref. & Mined Data )
Generates Incentives ( User Pref. & Mined Data ) Product Categories
( Meta-Data )
In Database-speak…
Data: 1TB Live; 25TB Warehouse DBMS: 26
Systems: 1500 Servers
Deployment: A set of Monolingual DBMS Sub-second Response time
What is needed in DBMS to make such a Portal
Multilingual?
Multilingual Portal
Example Multilingual Database
• Books.com
INR 250
ÝCò «ü£F
€ 19.95 L'Histoire De La France
SAR 95
€ 75.00 Il Coronation del Virgin
Price Title
üõý˜ô£™
«ï¼
François Lebrun
Bicci Nero
Author_FN
Author Category
êKˆFó‹
Arti Fini Histoire
Language
$ 49.95 History of Civilization
Will/Ariel
Durant History English
Tamil Italian French Arabic
INR 175
ªddT£d HI¶ šddy¡d
¡d®ddUµT¬dd¬d
¦dyUµè Be£d²d±d Hindi
£ 35.00 History and Historians
Mark T.
Gilderhus Historiography English
€ 12.00 Κατερινα
Σαρρη Μουσικη′ Greek
£ 15.00
Letters to My Daughter Autobiography English
€ 99.95 Les Méditations Metaphysiques
René
Descartes Philosophie French
¥ 7500
無門關
無門 慧開
禅 Japanese
€ 99.95
êˆFò «ê£î¬ù
«ñ£è¡î£v
裉F ²òêKî‹ Tamil
Παιχνι′δια στο Πια′νο
Granularity of Multi-lingualism
• Uni-lingual rows, multi-lingual columns
• Uni-lingual columns, multi-lingual rows
• Multi-lingual rows, multi-lingual columns
• Multi-lingual attribute values
Why Worry about Multilingual Data?
• Growing Multilingual On-line Users and Data
– By 2010, most of the web-pages will be multilingual
• Today English down to 35% from 90% in 1995
– Non-native English speaking population has grown rapidly
from about one-third in mid-90’s to about two-thirds in mid-00’s
• E-commerce Implications
– Opens up enormous new markets
– Users are four-times more likely to buy a product or a service, if the information is presented in their native language [Aberdeen]
• E-governance Implications
– Opens up communication to native communities
Sample Applications
• Vidyanidhi
– Portal for all Indian research theses, hosted at Univ. of Mysore, Karnataka, India
– Contains close to 100000 records in English, Hindi, Kannada
• Bhoomi
– Computerized Land Record System in State of Karnataka, India
• storing 20 million records with composite information in Kannada and English
– to be followed in all other states as well
MLDB Research Questions (1)
• Are today’s database systems equally (a) functional
(b) efficient
across all human languages?
– i.e. is the DBMS “natural-language-neutral” ?
– Specifically, is there a preference for Latin-script based languages (English, French, German, …) as compared to those based on other scripts
(Arabic, Cyrillic, CJK, Indic, …) ?
MLDB Research Questions (2)
• Are new functionalities desired from a multilingual database system?
– i.e. multi-lingual SQL operators ?
Organization
• Motivation
• Multilingual Standards and Systems
• Multilingual Performance
• New Multilingual Query Operators
• Multilingual Relational Query Algebra
• Multilingual Implementation
• Conclusions & Future Research
Multilingual Functionality
To assess the support offered by current database
systems and standards for multilingual data
Background – Character Encoding
• Character is smallest component of a written language that has a semantic value. The set of all the characters in a language is called a
repertoire.
• Character Encoding assigns unique numerical value to each of the characters in a repertoire.
– Several encodings available
Multilingual Character Encoding
• ASCII [&ISO:8859] Encoding
– 7-bit [8-bit] for English [Western European]
• ISCII Encoding
– 8-bit (proprietary) encoding for Indic Languages
• ISO:10646 – Universal Character Set (UCS-2)
– Uniform 2-Byte encoding for all languages
• Unicode Encoding
– 2-byte encoding along the lines of UCS-2 (UTF-16)
• The default standard for Multilingual Data Storage in DBMS
– Has a variable-byte encoding (UTF-8) that favors
ASCII (Western European Languages)
Unicode
• Unicode is a uniform 2-byte encoding standard that allows storage of characters from any known
alphabet or ideographic system irrespective of platform or programming environments.
• Unicode codes are arranged in Character Blocks, which encode contiguously the characters of a
given Script (usually single language).
• Unicode has 3 different byte encodings – UTF-8,
UTF-16 and UTF-32 to store same character in a
byte, half-word or double-word formats.
Sample Encodings
E4.16.27.16.97.16.E6
00.E4.00.16.00.27.00.16.00.97.00.16.00.E6 E4.16.27.16.97.16.E6
Narayan Narayan Narayan ASCII
Unicode (UTF-16) Unicode (UTF-8) English
English English
Representation (Hexadecimal) Multilingual
String Encoding
Lang.
A8.BE.B0.BE.AF.A9.CD
0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CD
E0.AE.A8.E0.AE.BE.E0.AE.B0.E0.AE.BE.E0.AE.AF.E0.AE .A9.E0.AF.CD
ï£ó£ò¡
ï£ó£ò¡
ï£ó£ò¡
ISCII
Unicode (UTF-16) Unicode (UTF-8) Tamil
Tamil Tamil
A8.BE.B0.BE.AF.A3.CD
0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD
£ÁgÁAiÀÄ£ï
£ÁgÁAiÀÄ£ï
ISCII
Unicode (UTF-16) Kannada
Kannada
5B.FA.4E.95.6B.63.53.5A
BF.BA.AF.BA.E4.E6.95.A3.AD.8D.E5.9A Unicode (UTF-16)
Unicode (UTF-8) Kanji
Kanji
Database Systems
Commercial Systems
Oracle, Microsoft SQL Server, IBM DB2
Public-domain:
MySQL, PostgreSQL
For legal reasons, will randomly refer to them
as Systems A, B, C, D, E
Standards and Systems
System E System
D
None None
None No Spec
Cross-Language Query Support
No Spec
~25 Similar to Character
Any Collations
System Defined &
User Definable
National Char SQL:1999
Standard
Data + Meta-data OS Defined Similar to Character
System Collations
OS Specified
(
Pre-defined)
UCS-2 System
B
Data Data
Support Level
~40
~50 Locales
Similar to Character Similar to
Character Query
Processing
System Collations System
Collations Indexing
Pre-defined Pre-defined
Collations
Unicode UTF-8/16 Unicode
UTF-8/16 Storage Format
System C System
Database A
None Data
~30 Similar to Character
Any Collations
User Definable
(
sourcechanges
)
Unicode UTF-8
None Data
~25 Binary
Any Collations
User Definable
(
sourcechanges
)
Binary
Remarks
Database systems are generally
equivalent in their storage and querying of multilingual data, and offer similar
SQL querying power ...
However,
- Uniformly, no cross-language support
- Unknown differential performance
10,000’ View …
Proper Names
Documents String Values Other Attributes Category Attributes
Visual
Grapheme Image
Encoding ( ASCII, Unicode … ) Scripts
( English, Hindi … )
Nehru «ï¼ ¦dyUµè Amazon
Text Strings
Semantics
Representation
TEXT DATA
Phonemic Transformation
Aural
Encoding ( ITrans, Unicode … ) Normalized
Representation ( IPA, Arpabet … )
Qm´zAn /N/ /ae/ /R/ /oo/
Phoneme Strings
unicode ⇔ cuniform
Concepts
Abstracted Synsets (WordNet Taxonomies)
Multilingual Synsets (WordNet)
Semantic Transformation
Organization
• Motivation
• Standards and Systems
• Multilingual Performance
• New Multilingual Query Operators
• Multilingual Relational Query Algebra
• Multilingual Implementation
• Conclusions & Future Research
Multilingual Performance
Are the DBMS’s natural-language neutral,
wrt performance?
Database Setup
• Generated a 1 GB TPC-H Database
• Tables modified to hold both original CHAR [ASCII] Data and equivalent NCHAR [Unicode] Data
• Experiments conducted with
– separate CHAR and NCHAR tables
– common tables (to eliminate the impact of disk I/O)
• Example PartSuppCommon table with equivalent Tamil data
ð£è‹ ªðò˜
#000018 Part Name #000018
îò£KŠðõ˜ 18
#2503 Supplier
#2503 2503
PartName_NChar (Unicode) PartName
(ASCII) Part
ID SuppName
(ASCII) Supplier
ID
SuppName_NChar
(Unicode)
System and DBMS Setup
• Stand-alone Pentium-IV running Windows 2000
– Cold-start ensured before each experiment
• DBMS
– Oracle, DB2, SQL Server, Postgres – Installed with Default Configurations
• Display time nullified through aggregate functions in
the select clause
Database Operators Measured
• Table-Scan
– Time for scanning for a specific key
• Index-Create and Index-Scan
– Time for creating index
– Time for retrieving 20% of search keys in index
• Sort
– Time for sorting the attribute
• Join [Nested-Loop, Hash, Sort-Merge]
– Join types forced by Optimizer Hints, Setting Optimization Levels, etc.
– Plan pictures verified the use of appropriate join type in a query
Sample Queries
Join Operator
select count(*)
from PartSuppCommon PS1, PartSuppCommon PS2 where PS1.SuppName = PS2.SuppName
{ PS1.SuppName_NChar = PS2.SuppName_NChar } and PS1.PartName <> PS2.PartName
{ PS1.PartName_NChar <> PS2.PartName_NChar }
Table-Scan Operator
select count(*)
from PartSuppCommon PS1
where PS1.SuppName = ‘Supplier #2503’
{ PS1.SuppName_NChar = ‘ îò£KŠðõ˜ #2503’}
Performance Metrics
• Operator Performance
– Measure Operator Differential
Performance between Char and NChar
• Database Relative Efficiency
– Measures Database Differential
Performance between Char and NChar
T NChar / T char (Ideal Value: 1)
• Optimizer Prediction Equity
– Optimizer Prediction [In]Equity between Char and NChar
GM NChar / GM Char (Ideal Value: 1)
(Ideal Value: 1)
(O NChar /O Char )
(T NChar /T Char )
Operator Performance (Common Table)
• MRO Oper Metric (Ideal Value = 1)
• In a Nutshell:
– Wide variation in System Performance – Slowdown can be as much as 200%
– Generally, 30-100% Slower for NChar
All operators are slow on Multilingual Data
2.70 1.79
1.01 1.24 1.36
1.12 1.32 Database System D
1.34 1.55 1.92
Join (Sort- Merge)
1.29 1.35 2.60
Join (Hash)
1.97 1.35 2.75
Index- Scan
1.23 1.48 1.81
Sort
1.59 1.03 1.03
Join (Nested
-Loop)
1.06 1.33 2.72
Table- Scan
1.39 1.25 1.21
Index- Create
Database System C Database System B Database System A
Database
System
Overall DBMS Performance
• ME DBMS Metric (Ideal Value = 1)
• Databases are about 50% to 100% inefficient in multilingual world – Note, conservative estimate since only considering in-memory
differentials because of common table
– With separate tables, the inefficiency jumps to several hundred percent (e.g. slowdown was upto 475% for Scans and upto 275% for Joins)
0.69 Database System D
0.70 Database System C
0.80 Database System B
0.57 Database System A
Efficiency
Database
Query Optimizer Performance
• MPE Oper Metric (Ideal Value = 1)
• In a Nutshell:
– Generally, 5-100% mis-prediction
– Could be due to erroneous cost-models between Char and NChar.
0.74 0.55
0.37 0.99
0.89 Database System D
0.95 1.20 0.89
Join (Sort- Merge)
1.22 0.75 1.26
Join (Hash)
0.31 1.55 0.38
Index- Scan
1.16 0.97 0.97
Join (Nested
-Loop)
0.94 0.75 0.37
Table- Scan
Database System C Database System B Database System A
Database System
Could lead to grossly inefficient plans.
Analysis of Performance
• Experiments on DBMS system A
– System A exhibited worst differential performance
• Our Objective:
– What are the components of the slowdown?
– How can these be addressed in improving
performance?
Slowdown vs String Size
• How does the slowdown vary with (logical) String Length?
• High Differential in Scan at small string sizes indicates:
• high fixed-cost overheads (such as call overhead)
• Increasing differential cost indicates:
• higher variable-cost overheads (such as string handling
for function calls)
Slowdown w.r.t. Data Type
• We created Common Table with:
– Char Attributes of size 110; NChar Attributes of size 55 – Attributes have same physical size, but are of different
types
• We ran the same queries
– Call overheads are the same
– Only difference is in Datatype specific code-segments in common operator implementation
• The observed differential performance is ~10-15% of corresponding Operator Slowdowns
– Small, but not insignificant.
Slowdown wrt. String Processing
• Created Common Table (as before)
– All NChar attributes replaced with Char attributes of twice the size
• Ran the same queries
– Disk I/O & Call Overheads are the same, except the data being passed as parameter to the operator functions is different in size, same in type
– Measures any differences in in-memory handling of different sized strings in a common operator implementation
• The observed differential performance is ~80-90%
of the corresponding Operator Slowdowns
– This component contributes primarily to the operator slowdown
Overall Performance Analysis
• The slowdown of NChar over Char (including Disk I/O) is very large (several hundred percent)
• The slowdown of NChar over Char (considering only in-memory processing) is still large (50 to 100%):
– Primarily, due to size of the NChar Strings (~85%) – Secondarily, due to the type-specific implementation
(~10%)
• Hence, to improve performance the size of the
NChar storage must be tackled …
Cuniform Storage Format
Two Observations…
#2: An Attribute Value is likely to be in ONE script.
Rs 175
®d¦Qî «dd £d Ta (Vol 1)
¦dyUµè 590 127 ªddT£d HI¶ šddy¡d Rs 975
¦dyUµè ISBN Title Price
Author
$12.95 Discovery of India
Nehru 992
ISCII Unicode Tamil
Tamil
A8.BE.B0.BE.AF.A9.CD
0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CD Representation
(Hexadecimal) String
Encoding Lang
ï£ó£ò¡
ï£ó£ò¡
#1: Unicode = Character Block + Offset
about half the bits for character block
Cuniform – Compressed UNIcode FORMat
• … After skinning into Cuniform pair
00
0B 0C Narayan
ï£ó£ò¡
£ÁgÁAiÀÄ£ï
NULL
E4.16.27.16.97.16.E6 A8.BE.B0.BE.AF.A9.CD A8.BE.B0.BE.AF.A3.CD
00.27.00.EF.0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD
RK £ÁgÁAiÀÄ£ï
00.E4.00.16.00.27.00.16.00.97.00.16.00.E6 Narayan
0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CD
ï£ó£ò¡
0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD
£ÁgÁAiÀÄ£ï
00.27.00.EF.0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD
RK £ÁgÁAiÀÄ£ï
• Original Unicode Strings …
• Store each string data item as an ordered pair
• Common Character Block
• Offset of each character, in the common Char Block
Implementation of Cuniform
• Transparently remap all the SQL queries to work on the Cuniform Pairs
• For presentation of results, conversion from
Cuniform to Unicode is trivial
Operator Performance on Cuniform
• Outside-the-engine Implementation
• Space Occupancy
– Only 2% larger than Char; compare this with NChar’s 100 % overhead
1.03
Join (Nested-Loop) 1.04
1.22
Join (Hash) 2.74
1.15
Join (Sort-Merge) 1.99
1.99
IndexScan 1.88
1.05
TableScan 2.56
Cuniform
Slow-down
Unicode
Slow-down
Operator
Largely the Differential Performance is eliminated (Except, Index Scan)
Cuniform Performance Summary
• Overall,
– Generally, Better Performance
• Index Tree is built on a pair of attributes, resulting in worse performance
– Multilingual Efficiency up to 0.81 from 0.57
• With inside-engine implementation, can be made even better
– The performance is made almost natural
language-neutral …
Remarks
There is a performance barrier separating languages in Latin script (e.g., English)
from those in other scripts (e.g., Indic
languages), but this barrier can be largely
broken down with the Cuniform format …
Organization
• Motivation
• Standards and Systems
• Multilingual Performance
• New Multilingual Query Operators
• Multilingual Relational Query Algebra
• Multilingual Implementation
• Conclusions & Future Research
New Multilingual Operators
Objective:
To assess the new operators desired from a
multilingual database system
MLNameJoin Operator
† Referred to as LexEQUAL in published work
Multilingual Books.com
(with Language Information)
INR 250
ÝCò «ü£F
€ 19.95 L'Histoire De La France
SAR 95
€ 75.00 Il Coronation del Virgin
Price Title
üõý˜ô£™
«ï¼
François Lebrun
Bicci Nero
Author_FN
Author Category
êKˆFó‹
Arti Fini Histoire
Language
$ 49.95 History of Civilization
Will/Ariel
Durant History English
Tamil Italian French Arabic
INR 175
ªddT£d HI¶ šddy¡d
¡d®ddUµT¬dd¬d
¦dyUµè Be£d²d±d Hindi
£ 35.00 History and Historians
Mark T.
Gilderhus Historiography English
€ 12.00 Κατερινα
Σαρρη Μουσικη′ Greek
£ 15.00 Letters to My Daughter
Jawaharlal
Nehru Autobiography English
€ 99.95 Les Méditations Metaphysiques
René
Decates Philosophie French
¥ 7500
無門關
無門 慧開
禅 Japanese
€ 99.95
êˆFò «ê£î¬ù
«ñ£è¡î£v
裉F ²òêKî‹ Tamil
Παιχνι′δια στο Πια′νο
Multilingual Selection
INR 250 ÝCò «ü£F
£ 15.00 Letters to My Daughter
INR 175
ªddT£d HI¶ šddy¡d
Price Title
Tamil English
Hindi üõý˜ô£™
«ï¼
Jawaharlal Nehru
¡d®dUµT¬dd¬d
¦dyUµè
Author_FN
Author Language
Suppose a User wants the books of “Nehru”
in English, Tamil and Hindi …
Current SQL Approach
Select Author, Title, ... From Books Where Author = “ Nehru “
or Author = “ «ï¼ “
or Author = “ ¦dyUµè “ ...
• Problems with this approach
– User needs to be fluent in all the target languages
– Need specialized lexical resources (fonts, keyboard mappings, etc.) for input
– Input prone to Error, due to the lack of Directory support
• Further, CANNOT BE USED TO EXPRESS JOIN ACROSS
MULTILINGUAL COLUMNS
Proposed MLNameJoin Query
where Author MLNameJoin “Nehru”
InLanguages { English, Tamil, Hindi } Select Author, Title, ... From Books
• Input in a convenient language, with Multilingual output
• Equivalence based on intuitive Phonetic correspondence
– restricted to proper names (form about 20 percent of text corpora)
• Customizable Fuzzy Matching
– Threshold Parameter
• Most Importantly, extensible…
– “Retrieve in All Languages” ( InLanguages { * } )
– Join ( Author MLNameJoin Faculty )
Threshold 0.2
Matching Strategy
• Store Multilexical Strings in Database
– In Unicode (or Cuniform)
• To match, transform to equivalent phonemic strings in IPA alphabet using standard Text-to- Phoneme(TTP) converters …
• … and compare transformed strings using Approximate Matching Techniques
– Incompatibilities in Phonemes of different languages
MLNameJoin transforms matching from
textual space to phonemic space
Example MLNameJoin Operation
Books Table ( the last column is generated ):
The Query :
where Author MLNameJoin “Nehru” Threshold 0.2 InLanguages { English, Tamil, Hindi }
Select Author ... From Books
Will be executed as:
Transform “ Nehru ” to Phonemic string (in English TTP) as “ næhru ”
Retrieve all records whose Language is one of (English, Tamil or Hindi) and whose phoneme strings are within edit distance of 1 from “ næhru ”
næhru
ÝCò «ü£F Discovery of India
ªddT£d HI¶ šddy¡d
Author ( Phonemes ) Title
Tamil English Hindi
«ï¼ Nehru
¦dyUµè
Author
næhru næru næhru The Coronation of the Virgin
Ein Amerikanischer Autobiographie
English German Nero
Franklin
nerou
fr A ŋklın
Language
MLNameJoin Implementation Goals
• Accurate & Efficient Matching across languages
• Minimum changes to the DB Server
State of the Art
• No Support for Multilexical Matching in Commercial DBMS
– Soundex approximations for Latin-based scripts
• Approx. Matching Algos
– No Approximate Matching supported in DBMS
• However, UDF’s can solve this problem, partially
• Phonetic Matching in IR & Speech Processing Community
– [Zobel-SIGIR96] In English using Soundex type algorithms – Speech Processing research in online-speech processing
• Proprietary Solutions
– LASA (look-alike-Sound-alike) system for FDA
MLNameJoin Function
Steps:
Convert input strings to phonemes and find edit-distance between the phonemic equivalents
If (edit-distance < threshold * Size of Query String) return TRUE
MLNameJoin Parameters
• Match Threshold
– Specifies the level of tolerance for mismatch between the phonemic strings
– Tunable for Matching
• User-Settable (per Query), or Global (for an application)
– Threshold varied in [0, 1]
• 0 => only perfect matches are accepted, 1 => anything can be matched
• Set of Output Languages
– Those languages of interest to the multilingual user
EditDistance Function
Steps:
Basic Dynamic Programming Algorithm to find Edit-Distance InsCost, DelCost and SubCost are Parameterized
Cost for Operation
Clusters of Phonemes and Intra-Cluster Substitution Cost
EditDistance Parameters
• Cost Functions
– InsCost, DelCost and SubCost may be set – Simulates different types of edit-distances
• Intra-Cluster Substitution Cost
– Phonemes may be clustered based on their like-ness
• Clusters to be formed with linguist’s input
– Matching phonemes within a cluster may be more acceptable, than from outside a cluster
– Cost varied in [0, 1]
• 0 implies all phonemes are equivalent within a cluster
Implementation Architecture
• Query String
• Match Threshold
• Matched Strings Server
Manager
Database
TTS q
Query Proc.
Engine Approximate
Matching
TTS n
Unicode Cost Fn
Clusters
Performance Experiments
Data Sets
798 198 300 300
# of Strings
Combined Set All
Generic Names (Places/Chemicals/Objects) 3
Occidental Names (San Francisco Physicians Directory) 2
Indian Names (Bangalore Telephone Directory) 1
Description Set
Three Data Sets:
Equivalent Phonemic Strings were stored in IPA Alphabet, Unicode Format
Hand-tagged each set of Matchable strings with
a Group ID
Data Sets ( Continued …)
Generated Data Sets:
Concatenated each string with every other string in the same language
Each set of matchable Strings have Same Group ID
Generated about 200,000 Strings
Phonetic Transformation
• Used standard Text to Phoneme Conversion to IPA alphabet
• For English:
– A web-based TTP Convertor (http://www.foreignword.com)
• Dictionary: Oxford English Dictionary
• For Indic:
– Dhvani TTP (http://dhvani.sourceforge.com) after source modifications
• All Indic Languages are Phonetic; Hence almost a 1-to-1 mapping exists.
nArAIQn
haIdr´dZ´n Qm´zAn
jun´vŒrsIti krIst´p´r k´mpjut´r
Phoneme Name (IPA Alphabets)
Tamil
ï£ó£ò¡
English Hydrogen
English Amazon
English University
Tamil
‚Kvìð˜
English Computer
Language
Character Name
Metrics Measured
• Parameters varied
– User Match Threshold
– Intra-Cluster Substitution Cost
• Metrics, Measured
• M 1 : # of Correctly Reported Matches
• M 2 : # of Reported Matches
• n : number of groups of equivalent names, and n i : #of elements in i th group
Precision - Fraction of the returned results that are correct Precision = M 1 / M 2
Recall = M 1 / Σ ( n i C 2 )
i =1
Recall - Fraction of the correct results that are returned n
Correctness Experiments
Recall & Precision Metrics
• Desired Quality of Match: (Recall 1, Precision 1)
Analysis of observed results:
Recall Rate is reasonable (≥ 0.90 ) only with e ≥ 0.20
Precision Rate is reasonable (≥ 0.90 ) only with e ≤ 0.30
Correctness Experiments
Tuning the Matching
• Best Matching Point (Empirically, with Precision-Recall curves)
The best possible point of matching is reached with:
Match Threshold ∈ [0.25, 0.3] and Intra-Cluster Subs Cost ∈ [0.25, 0.5]
For this dataset Recall is 95% and Precision is 85%
Performance Experiments
Baseline
• Edit-Distance function was implemented as a UDF
Reasons
UDF incurs large costs and runs in an Interpreted Environment Edit-Distance using O(mn) algorithm
All the strings in the Database were compared – O(N)
How to Improve the Performance?
Q-Gram Technique
Approximate Phonetic Indexing
Approximate (using UDF ) 4004 Join
Approximate (using UDF )
Matching Methodology
1418 Scan
Time
(Sec)
Query
Performance Experiments
Q-Grams
• Key Idea:
– Generate and store all q-grams
– Use q-grams filter properties to generate a candidate set (cheap) and prune false positives using UDF (expensive)
• Three different filters are used
– Length Filter: Matching strings cannot differ by more than k – Count Filter: Matching q-grams ≥ max (|a|, |b|) –1 –(k-1) q
– Position Filter: Matching q-grams cannot be more than k positions apart
Performance Experiments
Q-Grams (continued … )
Approximate (using q-grams ) 856 Join
Approximate (using q-grams )
Matching Methodology
13.5 Scan
Time (Sec) Query
• From baseline, improved two orders of magnitude in Scan and five times in Join
– # of Calls to UDF reduced tremendously
• Caveats:
– No false-dismissals , only false-positives
– At the cost of 15X storage space, for storing the Q-Grams
Performance Experiments
Phonetic Index
• Key Idea:
– Use the “Phonemes Clusters” to generate a string of integers (like Soundex) – Convert to a single number
– Index the phonemic strings in standard B+ Tree index, on Number
• For searching:
– Transform search string to its Index String, and search in index
• Returned set is a Candidate Set for approximate matching strings
• Use UDF to weed out false positives
Performance Experiments
Phonetic Index (Continued…)
Approximate (using Phon.Index ) 15.2 Join
Approximate (using Phon.Index )
Matching Methodology
0.71 Scan
Time (Sec) Query
From Baseline, three orders of magnitude improvement
Index speeds up look-up of “ like-strings ” tremendously
Caveats:
False-dismissals possible now.
Indexing cost is added
Performance Improvements
(Three Alternative Techniques)
• Technique #1: Metric Distance Index
– Pre-compute edit-distance of Phonetic Strings from a given Key String.
– Use Properties of edit-distances to reduce calls to UDF
• Technique #2: Q-Grams
– Generate and Store all Q-Grams of Phonetic Strings – Use Properties of Q-Grams to reduce Calls to UDF
• Technique #3: Phonemic Indexes
– Convert Phoneme Strings to Numbers (corresponding to Phoneme Clusters)
– Index resulting numbers, using B+ Trees
~5-8% False-Dismissals 15.2
Matching using Phonemic Indexes 0.71
13.5 Scan-Time
856 Join-Time
15X Storage Overheads for Q-Grams Matching using Q-Grams
Matching Methodology
356 1728
Matching using Metric Distance Index Not much improvement on baseline
Remarks
Remarks
The MLNameJoin operator employing phonetic matching can complement the standard lexicographic operators, for cross-language name searches…
…Further, it may be implemented
efficiently on existing systems.
MLSemJoin Operator
† Referred to as MLSemJoin in published work
INR 250
ÝCò «ü£F
€ 19.95 L'Histoire De La France
SAR 95
€ 75.00 Il Coronation del Virgin
Price Title
üõý˜ô£™
«ï¼
François Lebrun
Bicci Nero
Author_FN
Author Category
êKˆFó‹
Arti Fini Histoire
Language
$ 49.95 History of Civilization
Will/Ariel
Durant History English
Tamil Italian French Arabic
INR 175
ªddT£d HI¶ šddy¡d
¡d®ddUµT¬dd¬d
¦dyUµè Be£d²d±d Hindi
£ 35.00 History and Historians
Mark T.
Gilderhus Historiography English
€ 12.00 Κατερινα
Σαρρη Μουσικη′ Greek
£ 15.00 Letters to My Daughter
Jawaharlal
Nehru Autobiography English
€ 99.95 Les Méditations Metaphysiques
René
Decates Philosophie French
¥ 7500
無門關
無門 慧開
禅 Japanese
€ 99.95
êˆFò «ê£î¬ù
«ñ£è¡î£v
裉F ²òêKî‹ Tamil
Παιχνι′δια στο Πια′νο
Multilingual Books.com
(with Category Information)
Multilingual Semantic Selection
INR 250 ÝCò «ü£F
$ 49.95 History of Civilization
€ 19.95 L'Histoire De La France
Price Title
êKˆFó‹
History
Histoire üõý˜ô£™
«ï¼
Will/Ariel Durant
François Lebrun
Author_FN
Author Category
Suppose a user wants to retrieve all “History” Books in English, Tamil and French ...
Currently, an equivalent SQL expression is as follows…
where Category = “History” or Category = “ êKˆFó‹ ” or Category = “Histoire” ...
Select Author, Title, Category ... From Books
where Category MLSemJoin “History” InLanguages {English,Tamil,French}
Select Author, Title, Category... From Books
We propose a simpler syntax as follows…
Multilingual Semantic Selection
(Adding Expressive Power)
INR 250 ÝCò «ü£F
$ 49.95 History of Civilization
€ 19.95 L'Histoire De La France
Price Title
êKˆFó‹
History Histoire üõý˜ô£™
«ï¼
Will/Ariel Durant
François Lebrun
Author_FN
Author Category
Suppose a User wants to retrieve all “History-type” Books in English, Tamil and French
where Category MLSemJoin All “History” InLanguages {English,Tamil,Hindi}
Select Author, Title, ... From Books
Currently, no equivalent SQL expression is available for this query…
£ 15.00
Letters to my Daughter Autobiography
Jawaharlal Nehru
£ 35.00
History and Historians Historiography Mark T.
Gilderhus
MLSemJoin Features
• Simpler Query Specification
– Input in any convenient language, with Multilingual output
• When Appropriate Linguistic Resources are not available
• Specially Suited for PDA, Cell Phone Interfaces
• Robustness of Query Processing
– Query String is more robust with respect to meaning, spelling etc, as the matching relies not just on Lexicographic Matching
– Equivalence based on intuitive Semantic correspondence
• Semantic ⇒ A Specified Ontology Based
• Restrictions
– Restricted to Specific types of Attributes (Categorical)
Is MLSemJoin just Syntactic Sugar?
• Extends SQL Expressive Power
– “Retrieve in All Languages”
• To Retrieve books irrespective of language of Publication ( InLanguages { * } )
– Join Functionality based on Semantics
• To Retrieve “Books published by Publishers in their Specialty”
( Book.Category MLSemJoin Publisher.Specialty )
• Query Processing with Domain Specific Ontologies
– The same Mechanism may be extended to any Ontological Query Processing
• Use domain-specific ontologies in specific domains
Background
WordNet Linguistic Resource
WordNet Basics:
Words vs. Meaning
• WordNet is a Psycholinguistic Dictionary
– It organizes concepts, similar to human mind – CS-speak: Semantic Network
• Word is an association with a concept & string
– Given by a Lexical Matrix, as follows:
Synonymy Polysemy
English has ~110K Noun Words and ~75K Noun Synsets
~150K Associations between them Need about 5MB to Store on-line
Aero- plane Auto- mobile Car Flight Leo