Multi-lingual Database Systems

124  Download (0)

Full text

(1)

COMAD Tutorial on

Multi-lingual Database Systems

Jayant Haritsa

Indian Institute of Science Bangalore, India

Database

Systems

Laboratory

(2)

Tutorial Contents

• Motivation

• Multilingual Standards and Systems

• Multilingual Performance

• New Multilingual Query Operators

• Multilingual Relational Query Algebra

• Multilingual Implementation

• Conclusions & Future Research

(3)

Organization

• Motivation

• Standards and Systems

• Multilingual Performance

• New Multilingual Query Operators

• Multilingual Relational Query Algebra

• Multilingual Implementation

• Conclusions & Future Research

(4)

Tracks History ( User History )

Generates Recommendations ( User Pref. & Mined Data )

Generates Incentives ( User Pref. & Mined Data ) Product Categories

( Meta-Data )

In Database-speak…

Data: 1TB Live; 25TB Warehouse DBMS: 26

Systems: 1500 Servers

Deployment: A set of Monolingual DBMS Sub-second Response time

What is needed in DBMS to make such a Portal

Multilingual?

Multilingual Portal

(5)

Example Multilingual Database

• Books.com

INR 250

ÝCò «ü£F

€ 19.95 L'Histoire De La France

SAR 95

€ 75.00 Il Coronation del Virgin

Price Title

üõý˜ô£™

«ï¼

François Lebrun

Bicci Nero

Author_FN

Author Category

êKˆFó‹

Arti Fini Histoire

Language

$ 49.95 History of Civilization

Will/Ariel

Durant History English

Tamil Italian French Arabic

INR 175

ªddT£d HI¶ šddy¡d

¡d®ddUµT¬dd¬d

¦dyUµè Be£d²d±d Hindi

£ 35.00 History and Historians

Mark T.

Gilderhus Historiography English

€ 12.00 Κατερινα

Σαρρη Μουσικη′ Greek

£ 15.00

Letters to My Daughter Autobiography English

€ 99.95 Les Méditations Metaphysiques

René

Descartes Philosophie French

¥ 7500

無門關

無門 慧開

Japanese

€ 99.95

êˆFò «ê£î¬ù

«ñ£è¡î£v

裉F ²òêKî‹ Tamil

Παιχνι′δια στο Πια′νο

(6)

Granularity of Multi-lingualism

• Uni-lingual rows, multi-lingual columns

• Uni-lingual columns, multi-lingual rows

• Multi-lingual rows, multi-lingual columns

• Multi-lingual attribute values

(7)

Why Worry about Multilingual Data?

• Growing Multilingual On-line Users and Data

– By 2010, most of the web-pages will be multilingual

• Today English down to 35% from 90% in 1995

– Non-native English speaking population has grown rapidly

from about one-third in mid-90’s to about two-thirds in mid-00’s

• E-commerce Implications

– Opens up enormous new markets

– Users are four-times more likely to buy a product or a service, if the information is presented in their native language [Aberdeen]

• E-governance Implications

– Opens up communication to native communities

(8)

Sample Applications

• Vidyanidhi

– Portal for all Indian research theses, hosted at Univ. of Mysore, Karnataka, India

– Contains close to 100000 records in English, Hindi, Kannada

• Bhoomi

– Computerized Land Record System in State of Karnataka, India

• storing 20 million records with composite information in Kannada and English

– to be followed in all other states as well

(9)

MLDB Research Questions (1)

• Are today’s database systems equally (a) functional

(b) efficient

across all human languages?

– i.e. is the DBMS “natural-language-neutral” ?

– Specifically, is there a preference for Latin-script based languages (English, French, German, …) as compared to those based on other scripts

(Arabic, Cyrillic, CJK, Indic, …) ?

(10)

MLDB Research Questions (2)

• Are new functionalities desired from a multilingual database system?

– i.e. multi-lingual SQL operators ?

(11)

Organization

• Motivation

• Multilingual Standards and Systems

• Multilingual Performance

• New Multilingual Query Operators

• Multilingual Relational Query Algebra

• Multilingual Implementation

• Conclusions & Future Research

(12)

Multilingual Functionality

To assess the support offered by current database

systems and standards for multilingual data

(13)

Background – Character Encoding

• Character is smallest component of a written language that has a semantic value. The set of all the characters in a language is called a

repertoire.

• Character Encoding assigns unique numerical value to each of the characters in a repertoire.

– Several encodings available

(14)

Multilingual Character Encoding

• ASCII [&ISO:8859] Encoding

– 7-bit [8-bit] for English [Western European]

• ISCII Encoding

– 8-bit (proprietary) encoding for Indic Languages

• ISO:10646 – Universal Character Set (UCS-2)

– Uniform 2-Byte encoding for all languages

• Unicode Encoding

– 2-byte encoding along the lines of UCS-2 (UTF-16)

• The default standard for Multilingual Data Storage in DBMS

– Has a variable-byte encoding (UTF-8) that favors

ASCII (Western European Languages)

(15)

Unicode

• Unicode is a uniform 2-byte encoding standard that allows storage of characters from any known

alphabet or ideographic system irrespective of platform or programming environments.

• Unicode codes are arranged in Character Blocks, which encode contiguously the characters of a

given Script (usually single language).

• Unicode has 3 different byte encodings – UTF-8,

UTF-16 and UTF-32 to store same character in a

byte, half-word or double-word formats.

(16)

Sample Encodings

E4.16.27.16.97.16.E6

00.E4.00.16.00.27.00.16.00.97.00.16.00.E6 E4.16.27.16.97.16.E6

Narayan Narayan Narayan ASCII

Unicode (UTF-16) Unicode (UTF-8) English

English English

Representation (Hexadecimal) Multilingual

String Encoding

Lang.

A8.BE.B0.BE.AF.A9.CD

0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CD

E0.AE.A8.E0.AE.BE.E0.AE.B0.E0.AE.BE.E0.AE.AF.E0.AE .A9.E0.AF.CD

ï£ó£ò¡

ï£ó£ò¡

ï£ó£ò¡

ISCII

Unicode (UTF-16) Unicode (UTF-8) Tamil

Tamil Tamil

A8.BE.B0.BE.AF.A3.CD

0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD

£ÁgÁAiÀÄ£ï

£ÁgÁAiÀÄ£ï

ISCII

Unicode (UTF-16) Kannada

Kannada

5B.FA.4E.95.6B.63.53.5A

BF.BA.AF.BA.E4.E6.95.A3.AD.8D.E5.9A Unicode (UTF-16)

Unicode (UTF-8) Kanji

Kanji

(17)

Database Systems

Commercial Systems

Oracle, Microsoft SQL Server, IBM DB2

Public-domain:

MySQL, PostgreSQL

For legal reasons, will randomly refer to them

as Systems A, B, C, D, E

(18)

Standards and Systems

System E System

D

None None

None No Spec

Cross-Language Query Support

No Spec

~25 Similar to Character

Any Collations

System Defined &

User Definable

National Char SQL:1999

Standard

Data + Meta-data OS Defined Similar to Character

System Collations

OS Specified

(

Pre-defined

)

UCS-2 System

B

Data Data

Support Level

~40

~50 Locales

Similar to Character Similar to

Character Query

Processing

System Collations System

Collations Indexing

Pre-defined Pre-defined

Collations

Unicode UTF-8/16 Unicode

UTF-8/16 Storage Format

System C System

Database A

None Data

~30 Similar to Character

Any Collations

User Definable

(

source

changes

)

Unicode UTF-8

None Data

~25 Binary

Any Collations

User Definable

(

source

changes

)

Binary

(19)

Remarks

Database systems are generally

equivalent in their storage and querying of multilingual data, and offer similar

SQL querying power ...

However,

- Uniformly, no cross-language support

- Unknown differential performance

(20)

10,000’ View …

Proper Names

Documents String Values Other Attributes Category Attributes

Visual

Grapheme Image

Encoding ( ASCII, Unicode … ) Scripts

( English, Hindi … )

Nehru «ï¼ ¦dyUµè Amazon

Text Strings

Semantics

Representation

TEXT DATA

Phonemic Transformation

Aural

Encoding ( ITrans, Unicode … ) Normalized

Representation ( IPA, Arpabet … )

Qm´zAn /N/ /ae/ /R/ /oo/

Phoneme Strings

unicodecuniform

Concepts

Abstracted Synsets (WordNet Taxonomies)

Multilingual Synsets (WordNet)

Semantic Transformation

(21)

Organization

• Motivation

• Standards and Systems

• Multilingual Performance

• New Multilingual Query Operators

• Multilingual Relational Query Algebra

• Multilingual Implementation

• Conclusions & Future Research

(22)

Multilingual Performance

Are the DBMS’s natural-language neutral,

wrt performance?

(23)

Database Setup

• Generated a 1 GB TPC-H Database

• Tables modified to hold both original CHAR [ASCII] Data and equivalent NCHAR [Unicode] Data

• Experiments conducted with

– separate CHAR and NCHAR tables

– common tables (to eliminate the impact of disk I/O)

• Example PartSuppCommon table with equivalent Tamil data

ð£è‹ ªðò˜

#000018 Part Name #000018

îò£KŠðõ˜ 18

#2503 Supplier

#2503 2503

PartName_NChar (Unicode) PartName

(ASCII) Part

ID SuppName

(ASCII) Supplier

ID

SuppName_NChar

(Unicode)

(24)

System and DBMS Setup

• Stand-alone Pentium-IV running Windows 2000

– Cold-start ensured before each experiment

• DBMS

– Oracle, DB2, SQL Server, Postgres – Installed with Default Configurations

• Display time nullified through aggregate functions in

the select clause

(25)

Database Operators Measured

• Table-Scan

– Time for scanning for a specific key

• Index-Create and Index-Scan

– Time for creating index

– Time for retrieving 20% of search keys in index

• Sort

– Time for sorting the attribute

• Join [Nested-Loop, Hash, Sort-Merge]

– Join types forced by Optimizer Hints, Setting Optimization Levels, etc.

– Plan pictures verified the use of appropriate join type in a query

(26)

Sample Queries

„ Join Operator

select count(*)

from PartSuppCommon PS1, PartSuppCommon PS2 where PS1.SuppName = PS2.SuppName

{ PS1.SuppName_NChar = PS2.SuppName_NChar } and PS1.PartName <> PS2.PartName

{ PS1.PartName_NChar <> PS2.PartName_NChar }

„ Table-Scan Operator

select count(*)

from PartSuppCommon PS1

where PS1.SuppName = ‘Supplier #2503’

{ PS1.SuppName_NChar = ‘ îò£KŠðõ˜ #2503’}

(27)

Performance Metrics

Operator Performance

– Measure Operator Differential

Performance between Char and NChar

Database Relative Efficiency

– Measures Database Differential

Performance between Char and NChar

T NChar / T char (Ideal Value: 1)

Optimizer Prediction Equity

– Optimizer Prediction [In]Equity between Char and NChar

GM NChar / GM Char (Ideal Value: 1)

(Ideal Value: 1)

(O NChar /O Char )

(T NChar /T Char )

(28)

Operator Performance (Common Table)

• MRO Oper Metric (Ideal Value = 1)

• In a Nutshell:

Wide variation in System PerformanceSlowdown can be as much as 200%

Generally, 30-100% Slower for NChar

All operators are slow on Multilingual Data

2.70 1.79

1.01 1.24 1.36

1.12 1.32 Database System D

1.34 1.55 1.92

Join (Sort- Merge)

1.29 1.35 2.60

Join (Hash)

1.97 1.35 2.75

Index- Scan

1.23 1.48 1.81

Sort

1.59 1.03 1.03

Join (Nested

-Loop)

1.06 1.33 2.72

Table- Scan

1.39 1.25 1.21

Index- Create

Database System C Database System B Database System A

Database

System

(29)

Overall DBMS Performance

• ME DBMS Metric (Ideal Value = 1)

• Databases are about 50% to 100% inefficient in multilingual world – Note, conservative estimate since only considering in-memory

differentials because of common table

– With separate tables, the inefficiency jumps to several hundred percent (e.g. slowdown was upto 475% for Scans and upto 275% for Joins)

0.69 Database System D

0.70 Database System C

0.80 Database System B

0.57 Database System A

Efficiency

Database

(30)

Query Optimizer Performance

• MPE Oper Metric (Ideal Value = 1)

• In a Nutshell:

– Generally, 5-100% mis-prediction

– Could be due to erroneous cost-models between Char and NChar.

0.74 0.55

0.37 0.99

0.89 Database System D

0.95 1.20 0.89

Join (Sort- Merge)

1.22 0.75 1.26

Join (Hash)

0.31 1.55 0.38

Index- Scan

1.16 0.97 0.97

Join (Nested

-Loop)

0.94 0.75 0.37

Table- Scan

Database System C Database System B Database System A

Database System

Could lead to grossly inefficient plans.

(31)

Analysis of Performance

• Experiments on DBMS system A

– System A exhibited worst differential performance

• Our Objective:

– What are the components of the slowdown?

– How can these be addressed in improving

performance?

(32)

Slowdown vs String Size

• How does the slowdown vary with (logical) String Length?

High Differential in Scan at small string sizes indicates:

• high fixed-cost overheads (such as call overhead)

Increasing differential cost indicates:

• higher variable-cost overheads (such as string handling

for function calls)

(33)

Slowdown w.r.t. Data Type

• We created Common Table with:

– Char Attributes of size 110; NChar Attributes of size 55 – Attributes have same physical size, but are of different

types

• We ran the same queries

– Call overheads are the same

– Only difference is in Datatype specific code-segments in common operator implementation

• The observed differential performance is ~10-15% of corresponding Operator Slowdowns

– Small, but not insignificant.

(34)

Slowdown wrt. String Processing

• Created Common Table (as before)

– All NChar attributes replaced with Char attributes of twice the size

• Ran the same queries

– Disk I/O & Call Overheads are the same, except the data being passed as parameter to the operator functions is different in size, same in type

– Measures any differences in in-memory handling of different sized strings in a common operator implementation

• The observed differential performance is ~80-90%

of the corresponding Operator Slowdowns

– This component contributes primarily to the operator slowdown

(35)

Overall Performance Analysis

• The slowdown of NChar over Char (including Disk I/O) is very large (several hundred percent)

• The slowdown of NChar over Char (considering only in-memory processing) is still large (50 to 100%):

– Primarily, due to size of the NChar Strings (~85%) – Secondarily, due to the type-specific implementation

(~10%)

• Hence, to improve performance the size of the

NChar storage must be tackled …

(36)

Cuniform Storage Format

(37)

Two Observations…

#2: An Attribute Value is likely to be in ONE script.

Rs 175

®d¦Qî «dd £d Ta (Vol 1)

¦dyUµè 590 127 ªddT£d HI¶ šddy¡d Rs 975

¦dyUµè ISBN Title Price

Author

$12.95 Discovery of India

Nehru 992

ISCII Unicode Tamil

Tamil

A8.BE.B0.BE.AF.A9.CD

0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CD Representation

(Hexadecimal) String

Encoding Lang

ï£ó£ò¡

ï£ó£ò¡

#1: Unicode = Character Block + Offset

about half the bits for character block

(38)

Cuniform – Compressed UNIcode FORMat

• … After skinning into Cuniform pair

00

0B 0C Narayan

ï£ó£ò¡

£ÁgÁAiÀÄ£ï

NULL

E4.16.27.16.97.16.E6 A8.BE.B0.BE.AF.A9.CD A8.BE.B0.BE.AF.A3.CD

00.27.00.EF.0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD

RK £ÁgÁAiÀÄ£ï

00.E4.00.16.00.27.00.16.00.97.00.16.00.E6 Narayan

0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CD

ï£ó£ò¡

0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD

£ÁgÁAiÀÄ£ï

00.27.00.EF.0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD

RK £ÁgÁAiÀÄ£ï

• Original Unicode Strings …

• Store each string data item as an ordered pair

• Common Character Block

• Offset of each character, in the common Char Block

(39)

Implementation of Cuniform

• Transparently remap all the SQL queries to work on the Cuniform Pairs

• For presentation of results, conversion from

Cuniform to Unicode is trivial

(40)

Operator Performance on Cuniform

• Outside-the-engine Implementation

• Space Occupancy

– Only 2% larger than Char; compare this with NChar’s 100 % overhead

1.03

Join (Nested-Loop) 1.04

1.22

Join (Hash) 2.74

1.15

Join (Sort-Merge) 1.99

1.99

IndexScan 1.88

1.05

TableScan 2.56

Cuniform

Slow-down

Unicode

Slow-down

Operator

Largely the Differential Performance is eliminated (Except, Index Scan)

(41)

Cuniform Performance Summary

• Overall,

– Generally, Better Performance

• Index Tree is built on a pair of attributes, resulting in worse performance

– Multilingual Efficiency up to 0.81 from 0.57

• With inside-engine implementation, can be made even better

– The performance is made almost natural

language-neutral …

(42)

Remarks

There is a performance barrier separating languages in Latin script (e.g., English)

from those in other scripts (e.g., Indic

languages), but this barrier can be largely

broken down with the Cuniform format …

(43)

Organization

• Motivation

• Standards and Systems

• Multilingual Performance

• New Multilingual Query Operators

• Multilingual Relational Query Algebra

• Multilingual Implementation

• Conclusions & Future Research

(44)

New Multilingual Operators

Objective:

To assess the new operators desired from a

multilingual database system

(45)

MLNameJoin Operator

Referred to as LexEQUAL in published work

(46)

Multilingual Books.com

(with Language Information)

INR 250

ÝCò «ü£F

€ 19.95 L'Histoire De La France

SAR 95

€ 75.00 Il Coronation del Virgin

Price Title

üõý˜ô£™

«ï¼

François Lebrun

Bicci Nero

Author_FN

Author Category

êKˆFó‹

Arti Fini Histoire

Language

$ 49.95 History of Civilization

Will/Ariel

Durant History English

Tamil Italian French Arabic

INR 175

ªddT£d HI¶ šddy¡d

¡d®ddUµT¬dd¬d

¦dyUµè Be£d²d±d Hindi

£ 35.00 History and Historians

Mark T.

Gilderhus Historiography English

€ 12.00 Κατερινα

Σαρρη Μουσικη′ Greek

£ 15.00 Letters to My Daughter

Jawaharlal

Nehru Autobiography English

€ 99.95 Les Méditations Metaphysiques

René

Decates Philosophie French

¥ 7500

無門關

無門 慧開

Japanese

€ 99.95

êˆFò «ê£î¬ù

«ñ£è¡î£v

裉F ²òêKî‹ Tamil

Παιχνι′δια στο Πια′νο

(47)

Multilingual Selection

INR 250 ÝCò «ü£F

£ 15.00 Letters to My Daughter

INR 175

ªddT£d HI¶ šddy¡d

Price Title

Tamil English

Hindi üõý˜ô£™

«ï¼

Jawaharlal Nehru

¡d®dUµT¬dd¬d

¦dyUµè

Author_FN

Author Language

„ Suppose a User wants the books of “Nehru”

in English, Tamil and Hindi …

(48)

Current SQL Approach

Select Author, Title, ... From Books Where Author = Nehru

or Author = «ï¼

or Author = ¦dyUµè ...

• Problems with this approach

– User needs to be fluent in all the target languages

– Need specialized lexical resources (fonts, keyboard mappings, etc.) for input

– Input prone to Error, due to the lack of Directory support

• Further, CANNOT BE USED TO EXPRESS JOIN ACROSS

MULTILINGUAL COLUMNS

(49)

Proposed MLNameJoin Query

where Author MLNameJoin “Nehru”

InLanguages { English, Tamil, Hindi } Select Author, Title, ... From Books

• Input in a convenient language, with Multilingual output

• Equivalence based on intuitive Phonetic correspondence

– restricted to proper names (form about 20 percent of text corpora)

• Customizable Fuzzy Matching

– Threshold Parameter

• Most Importantly, extensible…

– “Retrieve in All Languages” ( InLanguages { * } )

– Join ( Author MLNameJoin Faculty )

Threshold 0.2

(50)

Matching Strategy

• Store Multilexical Strings in Database

– In Unicode (or Cuniform)

• To match, transform to equivalent phonemic strings in IPA alphabet using standard Text-to- Phoneme(TTP) converters …

• … and compare transformed strings using Approximate Matching Techniques

– Incompatibilities in Phonemes of different languages

MLNameJoin transforms matching from

textual space to phonemic space

(51)

Example MLNameJoin Operation

„ Books Table ( the last column is generated ):

„ The Query :

where Author MLNameJoin “Nehru” Threshold 0.2 InLanguages { English, Tamil, Hindi }

Select Author ... From Books

„ Will be executed as:

Transform “ Nehru ” to Phonemic string (in English TTP) as “ næhru

Retrieve all records whose Language is one of (English, Tamil or Hindi) and whose phoneme strings are within edit distance of 1 from “ næhru

næhru

ÝCò «ü£F Discovery of India

ªddT£d HI¶ šddy¡d

Author ( Phonemes ) Title

Tamil English Hindi

«ï¼ Nehru

¦dyUµè

Author

næhru næru næhru The Coronation of the Virgin

Ein Amerikanischer Autobiographie

English German Nero

Franklin

nerou

fr A ŋklın

Language

(52)

MLNameJoin Implementation Goals

• Accurate & Efficient Matching across languages

• Minimum changes to the DB Server

(53)

State of the Art

• No Support for Multilexical Matching in Commercial DBMS

– Soundex approximations for Latin-based scripts

• Approx. Matching Algos

– No Approximate Matching supported in DBMS

• However, UDF’s can solve this problem, partially

• Phonetic Matching in IR & Speech Processing Community

– [Zobel-SIGIR96] In English using Soundex type algorithms – Speech Processing research in online-speech processing

• Proprietary Solutions

– LASA (look-alike-Sound-alike) system for FDA

(54)

MLNameJoin Function

Steps:

Convert input strings to phonemes and find edit-distance between the phonemic equivalents

If (edit-distance < threshold * Size of Query String) return TRUE

(55)

MLNameJoin Parameters

• Match Threshold

– Specifies the level of tolerance for mismatch between the phonemic strings

– Tunable for Matching

• User-Settable (per Query), or Global (for an application)

– Threshold varied in [0, 1]

• 0 => only perfect matches are accepted, 1 => anything can be matched

• Set of Output Languages

– Those languages of interest to the multilingual user

(56)

EditDistance Function

Steps:

Basic Dynamic Programming Algorithm to find Edit-Distance InsCost, DelCost and SubCost are Parameterized

Cost for Operation

Clusters of Phonemes and Intra-Cluster Substitution Cost

(57)

EditDistance Parameters

• Cost Functions

– InsCost, DelCost and SubCost may be set – Simulates different types of edit-distances

• Intra-Cluster Substitution Cost

– Phonemes may be clustered based on their like-ness

• Clusters to be formed with linguist’s input

– Matching phonemes within a cluster may be more acceptable, than from outside a cluster

– Cost varied in [0, 1]

0 implies all phonemes are equivalent within a cluster

(58)

Implementation Architecture

• Query String

• Match Threshold

• Matched Strings Server

Manager

Database

TTS q

Query Proc.

Engine Approximate

Matching

TTS n

Unicode Cost Fn

Clusters

(59)

Performance Experiments

(60)

Data Sets

798 198 300 300

# of Strings

Combined Set All

Generic Names (Places/Chemicals/Objects) 3

Occidental Names (San Francisco Physicians Directory) 2

Indian Names (Bangalore Telephone Directory) 1

Description Set

Three Data Sets:

Equivalent Phonemic Strings were stored in IPA Alphabet, Unicode Format

Hand-tagged each set of Matchable strings with

a Group ID

(61)

Data Sets ( Continued …)

Generated Data Sets:

Concatenated each string with every other string in the same language

Each set of matchable Strings have Same Group ID

Generated about 200,000 Strings

(62)

Phonetic Transformation

• Used standard Text to Phoneme Conversion to IPA alphabet

• For English:

– A web-based TTP Convertor (http://www.foreignword.com)

• Dictionary: Oxford English Dictionary

• For Indic:

– Dhvani TTP (http://dhvani.sourceforge.com) after source modifications

• All Indic Languages are Phonetic; Hence almost a 1-to-1 mapping exists.

nArAIQn

haIdr´dZ´n Qm´zAn

jun´vŒrsIti krIst´p´r k´mpjut´r

Phoneme Name (IPA Alphabets)

Tamil

ï£ó£ò¡

English Hydrogen

English Amazon

English University

Tamil

‚Kvìð˜

English Computer

Language

Character Name

(63)

Metrics Measured

• Parameters varied

– User Match Threshold

– Intra-Cluster Substitution Cost

• Metrics, Measured

• M 1 : # of Correctly Reported Matches

• M 2 : # of Reported Matches

• n : number of groups of equivalent names, and n i : #of elements in i th group

Precision - Fraction of the returned results that are correct Precision = M 1 / M 2

Recall = M 1 / Σ ( n i C 2 )

i =1

Recall - Fraction of the correct results that are returned n

(64)

Correctness Experiments

Recall & Precision Metrics

• Desired Quality of Match: (Recall 1, Precision 1)

Analysis of observed results:

Recall Rate is reasonable (≥ 0.90 ) only with e0.20

Precision Rate is reasonable (≥ 0.90 ) only with e0.30

(65)

Correctness Experiments

Tuning the Matching

• Best Matching Point (Empirically, with Precision-Recall curves)

The best possible point of matching is reached with:

Match Threshold [0.25, 0.3] and Intra-Cluster Subs Cost [0.25, 0.5]

For this dataset Recall is 95% and Precision is 85%

(66)

Performance Experiments

Baseline

• Edit-Distance function was implemented as a UDF

Reasons

UDF incurs large costs and runs in an Interpreted Environment Edit-Distance using O(mn) algorithm

All the strings in the Database were compared – O(N)

How to Improve the Performance?

Q-Gram Technique

Approximate Phonetic Indexing

Approximate (using UDF ) 4004 Join

Approximate (using UDF )

Matching Methodology

1418 Scan

Time

(Sec)

Query

(67)

Performance Experiments

Q-Grams

• Key Idea:

– Generate and store all q-grams

– Use q-grams filter properties to generate a candidate set (cheap) and prune false positives using UDF (expensive)

• Three different filters are used

– Length Filter: Matching strings cannot differ by more than k – Count Filter: Matching q-grams ≥ max (|a|, |b|) –1 –(k-1) q

– Position Filter: Matching q-grams cannot be more than k positions apart

(68)

Performance Experiments

Q-Grams (continued )

Approximate (using q-grams ) 856 Join

Approximate (using q-grams )

Matching Methodology

13.5 Scan

Time (Sec) Query

• From baseline, improved two orders of magnitude in Scan and five times in Join

– # of Calls to UDF reduced tremendously

• Caveats:

– No false-dismissals , only false-positives

– At the cost of 15X storage space, for storing the Q-Grams

(69)

Performance Experiments

Phonetic Index

• Key Idea:

– Use the “Phonemes Clusters” to generate a string of integers (like Soundex) – Convert to a single number

– Index the phonemic strings in standard B+ Tree index, on Number

• For searching:

– Transform search string to its Index String, and search in index

• Returned set is a Candidate Set for approximate matching strings

• Use UDF to weed out false positives

(70)

Performance Experiments

Phonetic Index (Continued…)

Approximate (using Phon.Index ) 15.2 Join

Approximate (using Phon.Index )

Matching Methodology

0.71 Scan

Time (Sec) Query

From Baseline, three orders of magnitude improvement

Index speeds up look-up of “ like-strings ” tremendously

Caveats:

False-dismissals possible now.

Indexing cost is added

(71)

Performance Improvements

(Three Alternative Techniques)

• Technique #1: Metric Distance Index

Pre-compute edit-distance of Phonetic Strings from a given Key String.

Use Properties of edit-distances to reduce calls to UDF

• Technique #2: Q-Grams

Generate and Store all Q-Grams of Phonetic StringsUse Properties of Q-Grams to reduce Calls to UDF

• Technique #3: Phonemic Indexes

Convert Phoneme Strings to Numbers (corresponding to Phoneme Clusters)

Index resulting numbers, using B+ Trees

~5-8% False-Dismissals 15.2

Matching using Phonemic Indexes 0.71

13.5 Scan-Time

856 Join-Time

15X Storage Overheads for Q-Grams Matching using Q-Grams

Matching Methodology

356 1728

Matching using Metric Distance Index Not much improvement on baseline

Remarks

(72)

Remarks

The MLNameJoin operator employing phonetic matching can complement the standard lexicographic operators, for cross-language name searches…

…Further, it may be implemented

efficiently on existing systems.

(73)

MLSemJoin Operator

Referred to as MLSemJoin in published work

(74)

INR 250

ÝCò «ü£F

€ 19.95 L'Histoire De La France

SAR 95

€ 75.00 Il Coronation del Virgin

Price Title

üõý˜ô£™

«ï¼

François Lebrun

Bicci Nero

Author_FN

Author Category

êKˆFó‹

Arti Fini Histoire

Language

$ 49.95 History of Civilization

Will/Ariel

Durant History English

Tamil Italian French Arabic

INR 175

ªddT£d HI¶ šddy¡d

¡d®ddUµT¬dd¬d

¦dyUµè Be£d²d±d Hindi

£ 35.00 History and Historians

Mark T.

Gilderhus Historiography English

€ 12.00 Κατερινα

Σαρρη Μουσικη′ Greek

£ 15.00 Letters to My Daughter

Jawaharlal

Nehru Autobiography English

€ 99.95 Les Méditations Metaphysiques

René

Decates Philosophie French

¥ 7500

無門關

無門 慧開

Japanese

€ 99.95

êˆFò «ê£î¬ù

«ñ£è¡î£v

裉F ²òêKî‹ Tamil

Παιχνι′δια στο Πια′νο

Multilingual Books.com

(with Category Information)

(75)

Multilingual Semantic Selection

INR 250 ÝCò «ü£F

$ 49.95 History of Civilization

€ 19.95 L'Histoire De La France

Price Title

êKˆFó‹

History

Histoire üõý˜ô£™

«ï¼

Will/Ariel Durant

François Lebrun

Author_FN

Author Category

„ Suppose a user wants to retrieve all “History” Books in English, Tamil and French ...

Currently, an equivalent SQL expression is as follows…

where Category = “History” or Category = “ êKˆFó‹ ” or Category = “Histoire” ...

Select Author, Title, Category ... From Books

where Category MLSemJoin “History” InLanguages {English,Tamil,French}

Select Author, Title, Category... From Books

We propose a simpler syntax as follows…

(76)

Multilingual Semantic Selection

(Adding Expressive Power)

INR 250 ÝCò «ü£F

$ 49.95 History of Civilization

€ 19.95 L'Histoire De La France

Price Title

êKˆFó‹

History Histoire üõý˜ô£™

«ï¼

Will/Ariel Durant

François Lebrun

Author_FN

Author Category

„ Suppose a User wants to retrieve all “History-type” Books in English, Tamil and French

where Category MLSemJoin All “History” InLanguages {English,Tamil,Hindi}

Select Author, Title, ... From Books

Currently, no equivalent SQL expression is available for this query…

£ 15.00

Letters to my Daughter Autobiography

Jawaharlal Nehru

£ 35.00

History and Historians Historiography Mark T.

Gilderhus

(77)

MLSemJoin Features

• Simpler Query Specification

– Input in any convenient language, with Multilingual output

• When Appropriate Linguistic Resources are not available

• Specially Suited for PDA, Cell Phone Interfaces

• Robustness of Query Processing

– Query String is more robust with respect to meaning, spelling etc, as the matching relies not just on Lexicographic Matching

– Equivalence based on intuitive Semantic correspondence

• Semantic ⇒ A Specified Ontology Based

• Restrictions

– Restricted to Specific types of Attributes (Categorical)

(78)

Is MLSemJoin just Syntactic Sugar?

• Extends SQL Expressive Power

– “Retrieve in All Languages”

• To Retrieve books irrespective of language of Publication ( InLanguages { * } )

– Join Functionality based on Semantics

• To Retrieve “Books published by Publishers in their Specialty”

( Book.Category MLSemJoin Publisher.Specialty )

• Query Processing with Domain Specific Ontologies

– The same Mechanism may be extended to any Ontological Query Processing

• Use domain-specific ontologies in specific domains

(79)

Background

WordNet Linguistic Resource

(80)

WordNet Basics:

Words vs. Meaning

• WordNet is a Psycholinguistic Dictionary

– It organizes concepts, similar to human mind – CS-speak: Semantic Network

• Word is an association with a concept & string

– Given by a Lexical Matrix, as follows:

Synonymy Polysemy

English has ~110K Noun Words and ~75K Noun Synsets

~150K Associations between them Need about 5MB to Store on-line

Aero- plane Auto- mobile Car Flight Leo

Sy:1 Sy:2 Sy:3 Sy:4

Gloss 1

Gloss 2

Gloss 3

Gloss 4

(81)

WordNet Basics: Noun Hierarchy

• Nouns are Grouped into 25 “Semantic Primes”

• Under each, concepts are arranged in a Taxonomic Hierarchy

– Can Specialize / Generalize a “Synset”

Mouse

1

Land Whale

Fauna

Bird Mammal

Water-Based Dolphin

Artifact

Computer Peripherals Pointing Devices

Mouse

2

Biography

Knowledge

Philosophy History

Personal History

Autobiography

Subject History

Historiography Synsets

Corresponding to

Synonyms

(82)

WordNet Basics: Interlinked Synsets

• With Multilingual WordNets, the English WN Hierarchy is taken as the base, and modified for Target Languages

– Interlinking provided between the Synsets

Biography

Knowledge

Philosophy History

Personal History

Autobiography

Subject History

Historiography Biographie Wissen

Philosophie Geschichte

Persönliche Geschichte

Autobiography

Inter-language Semantic Links

Intra-language Is-A Links

(83)

Multilingual WordNet Initiatives

• There are several WordNet initiatives around the globe, coordinated by Global WordNet Organization

– Euro WordNet: Covers all major European Languages – Indo WordNet: Covers ~15 Official Indic Languages – CJK WordNet: Between CJK Languages

– …

• Most of them take English WordNet as the Base

– Maintain a structural similarity with English WordNet and specialize for their specific languages

– Provide Inter-lingual Index between “Equivalent” Synsets

(84)

Implementation

Derived Operator Approach

(85)

Semantic Matching Strategy

• Integrate WordNet Linguistic Ontological Resources to the DBMS

• Map [Multilingual] Words to Canonical Semantic Primitives

– WordNet provides rich Ontological Hierarchies for nouns – Inter-linked WordNets between languages provide cross-

lingual mappings

• Match on Semantic Primitives

– Directly as a not-null intersection of Primitives

– Or as not-null intersection on Transitive Closures for Matching

on Specializations

(86)

MLSemJoin Example

• Database:

– All books are tagged with Category

– The following Hierarchy is given as a “Resource”

Biography

Knowledge

Philosophy History

Personal History

Autobiography

Subject History

Historiography Biographie Wissen

Philosophie Geschichte

Persönliche Geschichte

Autobiography õ£›¬è

êKî‹

ÜP¾Þò™

õ‹ êKˆFó‹

üùêKî‹

²òêKî‹

Query: Retrieve all History-Type Books in English, Tamil & German

English German

Tamil

(87)

MLSemJoin Algorithm

Steps:

Convert Query String to a Synset

Find Transitive Closure ( TC Q ) of Query Synset in Interlinked WordNet Hierarchies

If ( Synset of DataString ∈ TC Q ) then return TRUE else FALSE

(88)

Implementation Details

• In MLSemJoin Algorithm

– Computing Recursive Closure (Line#3) in Relational Systems is expensive

• Takes about 98% of the time of Query

– Can implement a UDF, but is very expensive

• We took a “Derived-Operator” approach, in an unmodified RDBMS using Recursive SQL feature

– Transparently re-write the MLSemJoin query into one that uses WITH and IN clauses of SQL:1999

• Caveat: System should support SQL:1999

– The query may be optimized using standard Relational Query

Optimizer

(89)

Derived Operator Implementation of MLSemJoin

• The query is transformed from MLSemJoin query to a standard SQL:1999 query as follows:

where Category MLSemJoin ALL “History”

InLanguages { English, Tamil, German } Select Author, Title, ... From Books

where Category in { ‘History’,‘Biography’,‘Autobiography’, Select Author, Title, ... From Books

‘ êKˆFó‹ ’,‘ ²òêKî‹ ’,…,‘Geschichte’... }

The data for IN Clause is the Recursive Closure

of “History” across target Languages

(90)

MLSemJoin Performance

(91)

Data

• WordNet (Ver 1.5) Stored in DBMS

– ~110,000 Word-Forms and ~80,000 Word-Senses and

~140,000 Relationships between them

• Stored in DBMS in Plain Taxonomy Tables

– Plain vanilla <Parent,Child> Relationships

– Occupies 4 MB for ASCII and ~8 MB for Unicode

• Multilingual WordNets were simulated by copies of English WordNet in Unicode

– Inter-Language-Links created between all pairs (p:0.95)

• For Performance experiments, this approach gives a good

approximation

(92)

WordNet Profiles

Structural Characteristics of WordNets

Different WordNets are at different stages of development They are highly correlated

More so in Euro-WordNets, than in Indo- WordNets Confirms their design goal to ...

“ keep the basic taxonomies as much as possible

NA 0.908

1.080 0.999

1.000 Equivalence Links to English

2.286 2.162

1.352 1.442

1.985 Avg. (Word Form / Synset)

3.889 2.360

2.301 2.176

2.236 Avg. (Synset / Word Form)

7,868 23,378

15,132 22,745

80,000 Word Sense (Synsets)

22,522 50,526

20,453 32,809

114,648 Word Form (Words)

Hindi Spanish

German French

English

Characteristics

(93)

Queries Run

• Three Commercial Database Systems were studied

– Identified only as Systems A, B and C to protect identities – WordNet stored in ASCII (English) and in Unicode (others)

• MLSemJoin queries that compute closures of various sizes

– Measured the Wall-Clock time for queries

– In MLSemJoin queries, the TC Computation takes ~98% of time

• A Typical Query requires a Closure size of ~2,000

– Average of Top-100 Query Nouns from popular Web-Search Engine, on English WordNet

– Assuming user is interested in 3 languages

(94)

Performance: Baseline MLSemJoin

Highlights:

Runtime proportional to the closure cardinality

Runtime for a typical query (TC of size ~2,000) is in tens of seconds (no index)

and close to a second (with index)!

(95)

Optimization #1:

Pre-Computed Closures

• Pre-Compute the Closures for all nodes in WordNet W ,

and store in a W TC Table

– Closure of x in W can be computed by a scan of W TC

– Index W TC for performance

• Positives:

– Expected to have much better performance

– More importantly, linear scale-up wrt Closure Size

• Negatives:

– Space overheads are substantial

– For WordNet, the size goes from 4 MB/language to ~120

MB/Language

(96)

Performance:

Pre-Computed Closures

Highlights:

Runtime near-Constant irrespective of the magnitude of Closure Cardinality

Runtime for all queries is sub-second (~700 mSec) with Index

(97)

Optimization #2:

Reversed Traversals

• Traverse the Taxonomic Hierarchy in Reverse

– Use the Same Taxonomic Table W

– Instead of Checking if (Data ∈ Descendents of Query), check if (Query ∈ Ancestors of Data).

• Positives:

– Expected Closure Cardinalities are smaller

• Though WordNet is a DAG, its

Average In-degree << Average Out-degree

• Negatives:

– Computation of Closure for every Data String

(98)

Performance:

Reversed Traversals

Highlights:

Clearly much better performance for a single TC computation

The query is very expensive since Closure needs to be computed for every record

(99)

Optimization #3:

Reorganized Schema

• Leveraging the Structural Characteristics of the WordNet

– A large number of nodes have a few children, obeying Power Law

Inline up to 16 Children

Covers ~90% of them

(100)

Performance:

Re-Organized Schema

Highlights:

Runtime is ~3 orders of magnitude better than Baseline Performance and

~1 order of magnitude better than Pre-computed Closures No Space Overhead !

Runtime for typical query (TC of size ~2,000) is ~25 mSec

(101)

Scaling up wrt Languages

Highlights:

Increased number of [simulated] languages up to 8

The performance of Optimized Versions remain efficient

(102)

Implementation Architecture

• Query String

• Match Parameters

• Result Set Server

Manager

Database

Query Proc.

Engine Semantic

Equality Function

Unicode Ontology

RecSQL

(103)

MLSemJoin: Take Away

The MLSemJoin operator employing

WordNet-based matching can complement the standard lexicographic operators,

for multilingual semantic searches …

… Further, the performance may be tuned to a

level acceptable for online user interaction

(104)

Organization

• Motivation

• Multilingual Standards and Systems

• Multilingual Performance

• New Multilingual Query Operators

• Multilingual Relational Query Algebra

• Multilingual Implementation

• Conclusions & Future Research

(105)

Multilingual Relational Algebra

(Mural)

(106)

Why a Query Algebra?

• For expressing complex queries declaratively

• For evaluating alternative query execution plans

– Critical for leveraging the Query Optimizer

– For a core implementation of the multilingual operators

• What is Needed?

– Functionality defined as Operators

– Composition Rules, Cost Models and Selectivity Estimates

(107)

Multilingual Datatype and Operators

• Uniform ( Unicode Format ) Datatype

– A Representation that is tagged with language

– E.g.: <“Sample String”, English>, <“ àõñ£ù êóñ¢ ”, Tamil> or <“Corde Témoin”, French>

• Operators on Uniform datatype

– Composing: Z: <Text, ID> → Uniform

– Decomposing: Y: <Uniform> → <Text, ID>

– Uniform Equality: Ξ: <Uniform, Uniform> → <Boolean>

– MLNameJoin: Ψ : <Uniform,Uniform>→< Uniform,Uniform,Integer>

• Edit-Distance between the phonemic-equivalents of input Uniform strings

– MLSemJoin: Φ : <Uniform,Uniform>→< Uniform,Uniform,Boolean>

• Boolean indicates if LHS is a sub-class of RHS

(108)

Composition Rules

(109)

MLNameJoin Operator

• Simplified version of Earlier Definition:

Ψ Commutative and Associative with all relational operators Cost of Ψ Scan of a Table:

O (R L l L k / √Σ ) and P L Disk I/O (Without Index) O (R L l L k 2 / √Σ ) and A L Disk I/O (With Index) Cost of Ψ Join of a pair of Tables:

O (R L R R l L k / √Σ ) and (P L + P R )Disk I/O (Without Index) O (R L R R l L k 2 / √Σ ) and (A L + A R ) Disk I/O (With Index)

Selectivity estimates based on End-Serial histograms and

(110)

MLSemJoin Operator

• Simplified version of Earlier Definition:

Φ is not a commutative operator; but, associates with others Cost of Φ Scan of a Table:

O (R L + R H (h+1) ) and P H (h+1) Disk I/O (Without Index) O (R L l L k 2 / √Σ ) and A L Disk I/O (With Index)

Cost of Φ Join of a pair of Tables:

O (R L +R R + U R R H (h+1) ) and (P L + P R )Disk I/O (Without Index) O (R L +R R + U R logE H (h+1)) and 3(P L + P R ) E H Disk I/O (With Index)

Selectivity estimates based on structural characteristics of the

hierarchy

Figure

Updating...

References

Related subjects :