• No results found

2 bytes 6M rows

N/A
N/A
Protected

Academic year: 2022

Share "2 bytes 6M rows"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Processing in Data Warehouses and OLAP

Anindya Datta

, Debra VanderMeer

, Krithi Ramamritham y

Abstract

On-Line Analytical Processing (OLAP) refers to the technologies that allow users to

eÆcientlyretrievedatafromthedatawarehousefordecision-supportpurposes. Dataware-

housestendtobeextremelylarge- itisquitepossibleforadatawarehousetobehundreds

ofgigabytes to terabytes insize [3 ]. Queries tend to be complex and ad-hoc, oftenrequir-

ing computationally expensive operations such as joins and aggregation. Given this, we

are interested in developing strategies forimproving query processing in data warehouses

by exploring the applicability of parallel processing techniques. In particular, we exploit

the natural partitionability of a star schema and render it even more eÆcient by apply-

ing DataIndexes { a storage structure that serves both as an index as well as data and

lends itself naturally to vertical partitioning of the data. Dataindexes are derived from

the various special purpose access mechanisms currently supported in commercial OLAP

products. Specically, we propose a declustering strategy which incorporates both task

and datapartitioning and present theParallelStar Join (PSJ) Algorithm, which provides

ameans to performa star join inparallel usingeÆcient operationsinvolving onlyrowsets

andprojectioncolumns.

We compare theperformanceof thePSJ Algorithmwith two parallelquery processing

strategies. The rst is a parallel join strategy utilizing the Bitmap Join Index (BJI),

arguably the state of the art OLAP join structure in use today. For the second strategy

we choose a well known parallel join algorithm, namely the pipelined hash algorithm. To

assistintheperformancecomparison,werst develop acost model of thediskaccess and

transmissioncosts forallthree approaches.

PerformancecomparisonsshowthattheDataIndexbasedapproachleadstodramatically

lowerdiskaccesscoststhantheBJIaswellasthehybridhashapproaches,inbothspeedup

andscaleupexperiments,whilethehash-basedapproachoutperformstheBJIindiskaccess

costs. Withregard to transmission overhead, ourperformanceresults show that PSJ and

BJI outperform the hash-based approach. Overall, our parallel star join algorithm and

dataindexesform a winningcombination.

Keywords: parallelstar join,OLAP,query processing,dataindexes

Contact: AnindyaDatta, adatta@cc.gatech.edu, Phone: 404-442-9911,Fax: 404-442-9088

GeorgiaInstituteofTechnology,Atlanta,GA30332

y

IITBombay

(2)

On-Line Analytical Processing (OLAP) refers to the technologies that allow users to eÆciently

retrieve data from the data warehouse for decision-support purposes. A data warehouse can

be dened asan on-line repository of historical enterprise data that is used to supportdecision

making [15]. Data warehouses tend tobe extremely large - it is quitepossible for a data ware-

house to be hundreds of gigabytes to terabytes in size [3]. The information in a warehouse is

usually multidimensionalin nature, requiring the capability to view the data from a variety of

perspectives. In this environment, aggregated and summarized data are much more important

than detailed records. Queries tend to be complex and ad-hoc, often requiring computation-

ally expensive operations such as joins and aggregation. Further complicating this situation is

the fact that such queries must be performed on tables having potentially millions of records.

Moreover, the resultshave tobedeliveredinteractivelytothe businessanalyst usingthe system.

Given these characteristics, it is clear that the emphasis in OLAP systems is on query pro-

cessing and response times. OLAP scenarios in data warehousing dier from standard OLTP

environmentsin two importantways: (1) thesize ofthe data store, and (2)the underlyingdata

modelofthewarehouse. Intermsofsize,adatawarehouseistypicallyordersofmagnitudelarger

thaninstandardoperationaldatabases (i.e.,hundreds ofGBs, even TBs). Thesedatabasesstore

historicaldata,not operationaldata, and are used primarilyfor decisionsupport. Decision sup-

portrequirescomplexqueries,e.g.,multi-wayjoins. Intermsofthedatamodel,mostwarehouses

are modeled with a star schema, i.e., a fact table and a set of data dimensions. Star schemas

have an importantproperty in terms of join processing {all dimensions join only with the fact

table (i.e., the fact table contains foreign keys for each dimension). As a result, all join paths

leadthrough the facttable,whichistypicallythelargesttable by far{usually several times the

sum of the sizes of the dimensions.

Given the above, we note that joins in data warehouses are particularly expensive { the

fact table (the largest table in the warehouse by far) participates in every join, and multiple

dimensionsare likelytoparticipateineachjoin. Clearly,applyingparallelprocessingto thejoin

operationin this case would be benecial.

(3)

others noted in Section 9), will clearly function in an OLAP environment. The question of

interest is this: \Can more eÆcient techniques be developed, given the particular characteristics

of the read-mostly OLAP environment?". For instance, the current state of the art in parallel

jointechniques, asexemplied by the PipelinedHash Joinusing right-deeptrees [4, 5],requires

thateither(a)allhash tablesforparticipatingbuildtables beco-residentinmemoryor(b) that

temporary results be spooled to disk, allowing reclamation of memory for the hash tables used

tobuild thatintermediateresult. Given thelarge datasizes inadatawarehousing environment,

itisunlikely thatsuÆcientmemory willbeavailable,particularlyinthecase of multi-way joins.

In this paper, we propose a novel parallel processing technique to specically address the large

data sizes inherent inOLAP queryprocessing, and provide eÆcientquery processing.

Performance ina parallelsystem is typically measured using these two key properties:

Property 1 : In system with linear scale-up, an increase in hardware can perform a propor-

tionately larger task in the same amount of time. Data warehouses tend to grow quite rapidly.

For example,AT&Thasa datawarehousecontaining calldetailinformationthat grows ata rate

of approximately 18 GB per day [20]. Thus, a scalable architecture is crucial in a warehouse

environment.

Property 2 : Inasystemwithlinearspeedup,anincreaseinhardwareresultsinaproportional

decrease in processing time. As we shall show, by partitioning data among a set of processors,

and by developing query processing strategies that exploit this partitioning, OLAP queries can

potentially achieve good speedup, signicantly improving query response times.

Therstpropertyisobvious,whilethelatterpointisbestillustratedusinganexample. Recall

thatinaROLAPenvironment,thedataisstoredinarelationaldatabase usingastarschema. A

star schemausually consists ofa singlefact table and set of dimensiontables. Consider the star

schema presented in Figure 1A, which was derived from the TPC-D benchmark database [27]

(with a scale factor of 1). The schema models the activities of a world-wide wholesale supplier

overaperiodof seven years. The facttable isthe SALES table,and the dimensiontables are the

PART,SUPPLIER, CUSTOMER, and TIMEtables. The facttable contains foreign keys toeach of the

dimension tables. This schema suggests aneÆcient data partitioning aswe willsoonshow.

A commontype ofqueryinOLAP systemsis thestar-join query. In astar-join,one ormore

dimension tables are joined with the fact table. For example, the following query is a three-

(4)

2,557 rows 2,557 rows 200,000 rows

10,000 rows

6M rows 2 bytes 6M rows

2 bytes 6M rows

2 bytes 6M rows

4 bytes

6M rows 4 bytes

6M rows 4 bytes

ShipInstruct ShipMode Comment

25 bytes 10 bytes 44 bytes Quantity

Tax Discount ExtPrice

RetFlag Status

8 bytes 8 bytes 8 bytes 1 byte 1 byte 8 bytes

6,000,000 rows 113 bytes PartKey

SuppKey CustKey Quantity

Tax Discount ExtPrice

RetFlag Status ShipDate CommitDate ReceiptDate ShipInstruct ShipMode Comment

4 bytes 4 bytes 4 bytes 8 bytes 8 bytes 8 bytes 1 byte 1 byte 8 bytes

2 bytes 2 bytes 2 bytes 25 bytes 10 bytes 44 bytes

6,000,000 rows

150,000 rows

131 bytes PartKey

SuppKey CustKey Quantity

Tax Discount ExtPrice

RetFlag Status ShipDate CommitDate ReceiptDate ShipInstruct ShipMode Comment

4 bytes 4 bytes 4 bytes 8 bytes 8 bytes 8 bytes 1 byte 1 byte 8 bytes

2 bytes 2 bytes 2 bytes 25 bytes 10 bytes 44 bytes

6,000,000 rows

150,000 rows

131 bytes

TIME

Month TimeKey Alpha

Week Day

10 bytes Year 4 bytes 4 bytes 4 bytes 4 bytes 2 bytes

28 bytes TIME

Month TimeKey Alpha

Week Day

10 bytes Year 4 bytes 4 bytes 4 bytes 4 bytes 2 bytes

28 bytes PartKey 4 bytes

Name 55 bytes Mfgr 25 bytes Brand 10 bytes Type 25 bytes Size 4 bytes Others...41 bytes

PART

164 bytes

SUPPLIER SuppKey Name Address Nation Region Phone AcctBal Comment

4 bytes 25 bytes 40 bytes 25 bytes 25 bytes 15 bytes 8 bytes 101 bytes

243 bytes SALES

ReceiptDate SALES

CommitDate SALES

ShipDate SALES

SuppKey PartKey

SALES SALES

CustKey

SALES SALES

CustKey 4 bytes Name 25 bytes Address 40 bytes Nation 25 bytes Region 25 bytes Phone 15 bytes

269 bytes 117 bytes 10 bytes Comment

MktSegment AcctBal 8 bytes

CUSTOMER

SALES

CustKey 4 bytes Name 25 bytes Address 40 bytes Nation 25 bytes Region 25 bytes Phone 15 bytes

269 bytes 117 bytes 10 bytes Comment

MktSegment AcctBal 8 bytes

CUSTOMER

Attribute

: Non-key Attribute : Dimension Table/Column

: Foreign-key Relation : Ordinal Mapping : Key Attribute Attribute

: Fact Table

[B]

[A]

Figure 1: A Sample Warehouse Star Schema and Projection Index

dimensional star-join that identies the volumes sold locally by suppliers in the United States

forthe periodbetween 1996 and 1998 [27]:

Query 1

SELECT U.Name, SUM(S.ExtPrice)

FROM SALES S, TIME T, CUSTOMER C, SUPPLIER U

WHERE T.Year BETWEEN 1996 AND 1998

AND U.Nation='United States' AND C.Nation='United States'

AND S.ShipDate = T.TimeKey AND S.CustKey = C.CustKey

AND S.SuppKey = U.SuppKey

GROUP BY U.Name

A set of attributes that is frequently used in join predicates can be readily identied in the

structure of a star schema. In the example star schema, ShipDate, CustKey, SuppKey, and

PartKey of the SALES table can be identied as attributes that will often participate in joins

with the corresponding dimension tables. We can thus use this informationto apply a vertical

partitioningmethodonthese attributestoachievethe benetsofparallelism. Thispapershows,

in fact, that one can use a combination of vertical and horizontal partitioning techniques to

extract the parallelisminherentin star schemas.

Specically, wepropose adeclusteringstrategy which incorporatesbothtask anddata parti-

tioningand present the Parallel Star Join (PSJ) algorithm, which provides a means to perform

a star join in parallel using eÆcient operations involving only rowsets and projection columns.

(5)

is a parallel join strategy utilizing the Bitmap Join Index (BJI), arguably the state of the art

OLAPjoinstructureinusetoday. Forthesecondstrategywechoose thePipelinedHashstrategy

[4](HASH), one of the best performingparallel query processingstrategies from the traditional

OLTP literature.

Our performance results indicatethat the PSJapproachleads to dramaticallybetterperfor-

mancethanthepipelinedhashapproach,withregardtodiskaccesscostsandtransmissioncosts,

in both speedup and scaleup experiments. The pipelined hash approach, in turn, outperforms

the BJI approachin terms of disk access costs (although not in terms of transmission costs). A

fulldiscussion of our results can be found inSection 8.

A large body of work exists inapplying parallelprocessingtechniques to relationaldatabase

systems (e.g., [8, 26, 28, 25]). From this work has emerged the notion that highly-parallel,

shared-nothingarchitectures can yieldmuch betterperformance thanequivalentclosely-coupled

systems[24,17,9]. Shared-nothingarchitectureshavebeenshowntoachievenearlinearspeedups

and scale-ups inOLTP environments as wellas oncomplex relationalqueries [10].

The primary contribution of this paperis in its basic theme, i.e., the explorationof parallel

processingwith regardtoOLAP.To the best of our knowledge, this isone of the initialendeav-

ors in this direction (we have not come across many such reports in the published literature).

Specically,the contributionof this paperismanifold: (1)It proposes a parallelphysicaldesign

for data warehousing. (2) It proposes a parallel star join strategy based on this physical design

and evaluatesits performance. (3) It demonstratesthe applicabilityof parallelOLTP strategies

inthe OLAPcontext. Notethat some of the major DBMS vendors oer products that support

various levels of parallel processing. We describe this work in more detail inSection 9 and con-

trastthese toourwork. Notealsothatintegrationwith existingsystems isaseparate issue, and

outsidethe scope of this paper.

The remainder of the paper is organized as follows. In Section 2 we introduce an approach

to structure data warehouses by exploiting our proposed indexing strategies. The associated

physical design and star join processing strategies are discussed in Section 3. This is followed

(6)

the pipelined hash approach inSection 5. We then present a cost modelof the disk access and

transmission costs of the three approaches in Section 6, and a system model for performance

comparisoninSection7. Wecomparethe performance oftheseapproaches inSection8. Finally,

inSection 9we discuss relatedwork, and in Section10 weconclude the paper.

2 A Physical Design Principle to Exploit Parallelism

Inthis section weshowhow, byjudiciously using manyof the indexingschemes proposedin the

literature, we can structure a data warehouse to make it amenableto parallel queryprocessing.

Four index types are shown in[22] tobe particularly appropriatefor OLAP systems: B+ trees,

indexes based on bitmaps [22, 21], projection indexes and bit-sliced indexes [22]. Consider the

division ofthe SALES tablein Figure1A, intoseven smaller tables,as shown in Figure1B. This

scheme is composed of 7 vertical partitions: one for each of the dimensional attributes and one

for the remaining columns from the original SALES table. With this division, a record in the

originalSALES table isnow partitioned into 7 records, one in each of the resulting tables. Each

of the 7 new tables is akin to a projection index. A projection index contains the copy of a

particular column, namely, the column being indexed. In this sort of partitioning, the columns

being indexed are removed from the original table and stored separately, with each entry being

in the same position as its corresponding base record. The isolated columns can then be used

for fast access to data in the table. When indexing columnsof the fact table, storing both the

index and the corresponding column in the fact table results in a duplication of data. In such

situations, it is advisable to only store the index if original table records can be reconstructed

easilyfromthe index itself. This is how Sybase IQ stores data [12, 22].

In what follows, we extend the original notion of the projection index to allow a single

projection index to contain multiple columns. A graphical representation of this structure is

shown in Figure2. In this gure, we show the actual storage congurations of the two cases: a

base table (Figure2a) and the correspondingpartitioned structure (Figure2b). The base table

consists of the attributes TimeStamp, Tax, Discount, Status and two projection indices are

(7)

Record Base SALES Table

Projection Index on SALES.TimeStamp

Projection Index on (SALES.Tax, SALES.Discount, SALES.Status)

π π

TimeStamp Tax Discount Status TimeStamp Tax Discount Status TimeStamp Tax Discount Status

Status π π TimeStamp

TimeStamp TimeStamp TimeStamp TimeStamp TimeStamp TimeStamp TimeStamp TimeStamp TimeStamp TimeStamp

Tax Discount Status Tax Discount Status Tax Discount Status Tax Discount Status Tax

Conventional Relational Representation

(b) (a)

π π

Projection Index Representation

Status

Figure2: Projection Index

constructed, one on the TimeStamp column, and another on the Tax, Discount and Status

columns. As indicated by the dottedlines joiningrecords fromthe two indices,the order of the

recordsinthebasetableisconservedinbothindices. ThisallowsforaneÆcientmappingbetween

the entries in the two projection indexes. This mapping is accomplished through the use of

positionalindexing,whichrefers toaccessing tuplesbasedontheirordinalposition. Thisordinal

mappingiskey totheidea ofpositionalindexing. Forexample,intheschema inFigure1B,ifwe

need to determine the ShipDate for the third SALES record, we would do this by accessing the

thirdentryof theprojectionindex forSALES.ShipDate. Positionalindexing ismadepossibleby

rowidentiers (RIDs), afeature provided by most commercial DBMSproducts [21, 6].

Indecisionsupportdatabases,alargeportionoftheworkloadconsistsofqueriesthatoperate

on multiple tables. Many queries on the star schema of Figure 1A would access one or more

dimension tables and the central SALES table. Access methods that eÆciently support join

operationsthus become crucial in decisionsupport environments [21]. The idea of a projection

index presented inthe previous sectioncan very easily be extended to support such operations.

Considerforinstance,ananalystwhoisinterested inpossibletrendsorseasonalitiesindiscounts

oered tocustomers. This analysis would bebased on the following query:

Query 2

SELECT TIME.Year, TIME.Month, average(SALES.Discount)

FROM TIME, SALES

WHERE TIME.TimeKey = SALES.ShipDate

GROUP BY TIME.Year, TIME.Month

O'Neil and Graefe [21] introduced the idea of a bitmapped join index (BJI) for eÆciently

supporting multi-table joins. A BJI associates related rows from two tables [21], as follows.

(8)

1 2

relationship(i.e., one recordof T

1

isreferenced by many recordsof T

2

). A bitmapped join index

from T

1 to T

2

can be seen as a bitmapped index that uses RIDs of T

1

to index the records

of T

2

. (Further details of BJIs are presented in Section 4). In fact, we can further reduce the

numberofdata blockstobeaccessed whileprocessingajoinby storingthe RIDsofthematching

dimension table records { instead of the corresponding key values { in a projection index for a

foreign key column. Such an index from T

2 to T

1

is called a Join Index (JI) in the sequel. For

instance, the JI on SALES.ShipDate would consist of a list of RIDs on the TIME table. (One

can also achieve an equivalent, and sometimes more eÆcient, representation by storing actual

ordinalpositions correspondingto the TIMEtable, rather than the RIDs). Such a JI isshown in

Figure3. Asbefore, we showboth theconventionalrelationalandthe JI representations. In the

conventional approach,weshowreferentialintegritylinksbetween the SALES and TIMEtablesas

dashedarrows. FortheJIapproach,weusesolidarrows toshowtherowstowhichdierentRIDs

point and dotted lines to show that the order of the records in the JI and the SALES projection

index is preserved fromthe base table.

1

Base TIME Table

Base TIME Table Base SALES Table

Join Index on SALES.TimeStamp

Projection Index on (SALES.Tax,SALES.Discount, SALES.Status)

π π

TimeStamp Tax Discount Status TimeStamp Tax Discount Status TimeStamp Tax Discount Status π π

Timestamp DoW DoM Month ... Timestamp DoW DoM Month ... Timestamp DoW DoM Month ... π ...

Month π Timestamp

Timestamp DoW DoM Month ... Timestamp DoW DoM Month ... Timestamp DoW DoM Month ... π

π Timestamp

...

Month

RowID RowID RowID RowID RowID RowID RowID RowID RowID RowID RowID RowID RowID RowID

Tax Discount Status Tax Discount Status Tax Discount Status

RowID Conventional Relational Representation

Status TimeStamp

π π

Status Tax Discount Status Tax

Join Index Representation

Figure3: The JoinIndex

1

Throughoutthepaper,thedescriptionsforthestoragestructuresandthealgorithmsassumethatreferential

integrityis maintained. With simpleextensions, our approachcanbemade to tolerateacceptable violationsof

referentialintegrity.

(9)

column, the JI provides a direct mapping between individual tuples of the SALES and TIME

tables. Because of this, the join required to answer Query 2 can thus be performed in a single

scan ofthe JI.This property of JIs isindeed attractive, since the size of this index is,of course,

proportional tothe numberof tuples inthe table fromwhichit was derived.

The application of the above ideas to a data warehouse results in a physical design that

exploitsparallelism. This designprinciplerequiresthe storage ofeachforeign keycolumn inthe

fact table as JIs and the rest of the columns in the star scheme (for both dimension as well as

facttables) asprojection indexes.

In summary, the data warehouse structure discussed in this section takes the best aspects

of vertical partitioning, projection indexes, and join indexes and integrates them such that, as

shown in the next section, an eective parallel join algorithm can be developed. Performance

studies,discussed in Section8, show that this join algorithmenhances the performance of star-

joinqueries in parallelenvironments compared to traditionallyused approaches.

3 The Parallel Star Join

Inthis section,wedescribeadata placementstrategy basedonthephysicaldesign strategyout-

lined inthe previous section assuminga shared-nothing architecture with N processors. Subse-

quently,wepresentaneÆcientParallelStarJoinAlgorithmthatexploitsthisplacementstrategy.

In Table 1,we show the various notations used throughout the paper.

3.1 Data Placement Strategy

Assume a d-dimensional data warehouse physically designed according to the strategy outlined

in the previous section. The basic approach to data placement is as follows: partition the N

processors into d+1 (potentially mutually non-exclusive) processor groups. Assign, to proces-

sor group j, the dimension table j, i.e., D

j

, and J

j

, the fact table JI corresponding to the key

attribute of D

j

. Inside processor group j, a hybrid strategy is used to allocate records to indi-

(10)

N Numberofprocessors d Numberofdimensions

D

j

Dimensionj F FactTable

Jj JoinIndexonDj Pcname ProjectionIndexonattributecname

S

md

Aggregatesizeofmetricdata S

j

AggregatesizeofD

j

m Memoryperprocessor(bytes) G

j

ProcessorgroupforD

j

P

ij

ProcessoriofG

j

A m

P

setofprojectionpredicatesonmetricdata

A d

P

Setofprojectionpredicatesondimensions P

Setofrestrictionpredicates

P

./

Setofjoinpredicates R

global

Globaljoinrowset

R

dim;i

DimensionrestrictionrowsetforD

i

w(D) WidthofatupleinadimensionD

B Sizeofadiskblock(bytes) R SizeofaRID

a

ij

Attributejofdimensioni A

R

Setofrestrictionattributes

a

Fj

Metricattributejofthefacttable A

J

Setofjoinattributes

W

aij

Widthofattributea

ij

,inbytes A

P

Setofprojectionattributes

W

JI

WidthofJI,inbytes D

J

NumberofDimensionsparticipatinginjoin

N

i

Numberofprocessorsingroupi D

P

NumberofDimensionsparticipatinginoutput

N

F

Numberofprocessorsinthemetricgroup Sm Sizeofavailablememory

SF ScaleFactor(aectssizeofdatabase) P OrderofBIorBJI

V NumberofdistinctvaluesindexedinBIorBJI f Bitmapcompressionfactor

K Numberofsearchkeyvaluesperindexnode B BlockSize

& Selectivityofdimension T

B

Tuplesperblock

Table 1: Table of Notation for Cost Model

vidualprocessors. The metric PIs (that is, PIs of columnsnot associated with foreign keys) are

allocatedto groupd+1.

There are three fundamental motivations behind this approach. (1) The task of data place-

ment can be hinted by the structure of the star schema. For example, the primary key of a

dimension table and its associated foreign key in a fact table can be the most appropriate can-

didates for the partitioning attributes, because they are expected to be used as join attributes

frequently. (2)TheuseofJIsmakesitpossibletoco-locatethefacttablewithmultipledimension

tables at the same time by groupingeach dimension table with its associated JI and partition-

ing them by the same strategy. (In general, with a traditional horizontal partitioning method,

a relation can be co-located with only one other relation.) Therefore, the join of a dimension

table and a fact table can be computed eÆciently without data redistribution, and completely

independentofotherjoincomputationsthatinvolvedierentdimensiontablesandthesamefact

table. (3) It is generally the case that the size of a dimension table is much smaller than that

of a fact table, and often small enough to be t in main memory. Thus, given the number of

available processors and aggregate main memory capacity of a particular processor group, the

relativesizes ofdimensiontablescan beusedtodetermineanidealdegreeofparallelismforeach

dimension,that is, a dimension tableand itsassociated JI.

Now we describe our strategy in detail. Essentially there are two phases: (a) a processor

(11)

(b) a physical data placementphase where we allocate data fragments toindividual processors.

3.1.1 Processor Group Partitioning

Therst phase computes for eachdimension j, the composition of the processor group (i.e., the

physicalprocessors assigned to each group)where the j th

dimension table and its associated JI

are stored. Since every JI has the same cardinality as the fact table and consequently identical

datavolumes,thesize(S

j

)ofthej th

dimensiontable(inbytes)isusedtodeterminethesizeofits

correspondingprocessorgroup. This isnot justtobalancethe data distributionacross available

processorsbutalsotominimizethedataaccessesrequiredtoprocess ajoincomputationbetween

adimension table and its JI and thereby improveresponse times.

Group d +1, i.e, the group that houses the metric PIs uses a dierent criterion than the

dimensionalgroups fordetermining itscomposition. There are two main reasonsfor this. First,

the metric attributes do not appear in any join predicates. (However, they may appear in

restrictions). Second, the volume ofthe metricdataislargely independent ofthoseof dimension

tables. Thus, we choose touse the metric datavolume, S

md

,relative tothe volume of the entire

data warehouse, todetermine the composition of the metric group.

Beforedelving intothe precisedetailsofour approach,werst enumerate anumberof issues

thatneedtobeconsideredintacklingthisproblem. Notethattheoptimizationstrategydescribed

belowconsiders the size of dimensionsin formingprocessor groups. Clearly, this could easilybe

extended to include constraints based on knowledge of the query workloads on each dimension

inadditionto dimension size, further improvingthe grouping.

(1)Werstmakearemarkregardingthecomputationofthesizesof,i.e.,thenumberofprocessors

in, the dierent dimensional groups, denoted by N

1

;:::;N

d

. The fundamental goal here is to

havegroupsizeslargeenoughsuchthattheentiredimensiontabletsintotheaggregatememory

ofthe processors comprisingthisgroup. Intuitively then,ifthe dimensiontable canbeloadedin

theaggregatememoryofitsprocessorgroup,thejoincomputationcanbedonewithonlyasingle

scan of the JI. It is, of course, possible that this goal cannot be achieved, given the available

(12)

the j th

processor group(i.e., G

j

), 1j d, is given by N

j

=min(N;dS

j

=me), where m is the

size of the mainmemory attached to each processor.

2

This assumes that allthe processorshave

anequal amountof memory.

(2) Next we comment on the minimum size of the metric group, for which we use a dierent

logic, as it does not participate in joins, unlike the dimensional groups. We choose to use the

metric data volume relative to the entire data warehouse in order to determine the minimum

size of G

d+1

. In other words, N

d+1

=N

AggregateSize of Metric PIs

TotalSize of the Data Warehouse

(3)Notethat aprocessor may participatein morethan one group. Theremay be many reasons

forthis. A trivialcase would be whenthere aremore dimensionsthan processors. A more likely

case would bewhen the data sizes (dimensional,metric orboth) are signicantenough that the

sum of the sizes of dierent groups (given the criteria outlinedin items (1)and (2)above) may

exceed the number of processors available, which would mandate the assignment of the same

processortodierentgroups. Thisphenomenonadds thefollowingrequirementtotheproblem{

the overlap ofthe processorgroups must beminimized. We havedeveloped anoptimalsolution

to the processor group partitioning problem by formulating it as a constrained optimization

problem solvable as a linear integer mathematical program. Due to space limitations, we are

unableto includeit inthis paper; however, readersare referredto [7]for full details.

Inthecontextofadatawarehousingenvironment,wherestarschemasarecommon,theabove-

mentioned optimization strategy results typically results inthe assignment of each processor to

twogroups,(a)asingledimensionD

i

,and(b)thefacttable. Thisscenarioisdepictedgraphically

inFigure4,andoccursbecauseof therelativedierenceinthesizes ofdimensionsversusthe size

P1 P2 P3 P4 P5 P6 P7 P8

G1: Fact table

G2: D−Customer G3: D−Part G5:

D−Tim D−Sup

G4:

Figure4: Example Processor Groupingfor a Star Schema

2

Tobeprecise,thememoryshould havespaceforloading(atleast)oneblockoftheJI.This\+1"factoris

notincludedhereso astoavoidclutter.

(13)

allocationshown in Figure4 allocates processors to dimensionsbased on the relativesize of the

dimension; however, if query workload is known, processor allocationfor dimensions could take

this into account. Note, however, that even in this case, the fact table would still be allocated

a full set of processors, since it is not only the largest table (by far), but it also participates in

every join (alljoinpaths lead through it).

3.1.2 Physical Data Placement

In this phase, the actual data are placed on the individual processors in the groups. To state

our approach for data placement we will simplydiscuss the approach in the context of a single

group. Consider processor group j, denoted by G

j

, consisting of N

j

processors, P

1

;P

2

;:::P

N

j .

The exact processorto groupassignmentis done by solving the optimizationproblemdescribed

previously. Clearly the contents of G

j

include the PIs correspondingto the j th

dimension table

and the associated JI, denoted by J

j

,from the fact table.

We rst horizontally partition the JI in round robin fashion among the N

i

processors. Our

rationale for adopting the above-mentioned strategy has two salient features: (a) the JI parti-

tioningstrategyand(b)thedimensionalreplicationstrategy. Thedimensionalstrategyiseasyto

understand. ReplicatingPIs acrossallprocessorsinagroupensures thatallJIrecordsand their

associated primary keys are co-located. Thus, allmatching recordssatisfyingany join predicate

willbe found onthe same processor, ensuring that joins can be performed and outputscreated

eÆciently in parallel. Note that this allocation scheme is not based on xed sizes, but rather

onrelative sizes. Even thoughtable sizes may change, tables in data warehouses tend toretain

similarsizes relative toone another, thusallowinggroupingon this basis.

Analternativedimensionalstrategy,partitioningthePIsacrosstheprocessorsofagroup,was

considered. However, this strategy presented two problems. First, since many (not necessarily

co-located) JI records may point to the same PI record, some replication of PI data across

processors would be required to ensure co-location of matching PI and JI records. Second, the

RIDs in the PI records would change with partitioning, which would require updating the JI

(14)

The objective of the JI partitioning strategy is to preserve the ordinal position mapping

property whichisthreatenedbypartitioningJIs acrossdierentprocessors. It iseasytosee that

when the records are partitioned, it is important to regenerate the original ordinal position of

a record, i.e., given a partitioned JI record i in processor j for group k, we want to be able to

say that this record occupied the o th

ordinal position in the original, unpartitioned JI. This is

important for several reasons, e.g., to form the nal output of a join by putting together the

output from various processor groups. Of the well known horizontal partitioning mechanisms

(suchashash,rangeandround-robin),onlytheroundrobinpolicyiscapableofnaturallyensuring

this mapping.

3.2 The Parallel Star Join Algorithm

In this sectionwepresent our algorithmtoperform star joinsin parallel. Weassume aphysical

designstrategy asdescribed inSection 2and apartitioningstrategy asdescribed in Section3.1.

We represent ageneral k-dimensional star-join queryas follows.

Query 3 SELECT A d

P , A

m

P

FROM F, D

1

, :::, D

k

WHERE P

./

AND P

HereD

1

;:::;D

k

arethe kdimensionaltablesparticipatinginthejoin. P

andP

./

denoteasetof

restrictionandjoinpredicates respectively. Weassumethateachindividualrestriction predicate

inP

onlyconcernsone tableand isof theform(a

j

hopi constant),wherea

j

isany attributein

the warehouse schema and hopi denotes a comparison operator(e.g., =;;). We assumeeach

join predicate in P

./

is of the form a

l

= a

t

where a

t

is any dimensional key attribute and a

l is

the foreign key referenced by a

t

inthe facttable.

Based onthe partitioningstrategy described earlier, a join query such as the one above will

reduce to a number of one dimensional joins in each processor group. These can be performed

in parallel. These smaller joins will produce local groupwise rowsets that will be processed to

generateaglobal rowsetwhichwillbeused toproduce the naloutput. Accordingly,todescribe

our Parallel Star Join (PSJ) algorithm we will subdivide it into three phases: (a) The Local

(15)

Output Preparation (OP) phase.

3.2.1 Local Rowset Generation

In the LRG phase, each dimensional processor group generates a rowset (with fact table cardi-

nality)representing thefacttablerows thatqualifybasedonthe restrictionsandjoinrelevantto

thatgroup. Thisproceedsasfollows. Considerdimensionalprocessorgroupi,whichconsistsofc

processorsandhouses thePIs correspondingtoD

i

and theassociatedJI fromthefacttable, J

i .

The restriction and join predicates that apply todimension i, will be shipped to this group for

evaluation. Let us assume,for simplicity,that groupi receives, in additionto ai th

dimensional

joinpredicate, arestriction predicate for a dimensional PI (note that more than one restriction

predicate may be received in reality).

The rst step of the LRG phase, Load PI Fragments, performedat each participatinggroup

igenerates adimensional rowset, R

dim;i

, based on the restriction(s). This rowset is a bitvector

of cardinality of D

i

in which bits are set corresponding to rows in D

i

that meet the restriction

criterion. This rowset is developed in the following manner. First, each processor isallocateda

range of the PI amounting to 1

c th

of the dimensional PI. Forexample, the rst processor in the

grouploads records1 to jDij

c

, the secondloads jDij

c

+1to 2jDij

c

and processor cloads

(c 1)jDij

c

+1

tojD

i

j. Then, eachprocessor scans the fragmentallotted toit, setting the corresponding bit(s)

in the rowset for rows meeting the restriction. This process can easily be expanded to handle

more than one restriction predicate by considering all the restrictions during the PI scan, and

settingthecorrespondingrowset bitonlyfor thoserecordsmeetingallthe restrictionconditions.

The secondstepof the localrowsetgeneration process,Merge DimensionRowsetFragments,

involvesthe mergingof the restriction rowsets generated oneachprocessor.

Therestrictionrowsetsaremergedinparallelviatransmissionthroughabinarytreestructure

of c leaf nodes, one for each processor of the group. The restriction rowsets are rst merged

pairwise,thenthepairwiseresultsaremerged,andsoon,untilanalmergedrowsetisgenerated.

We now describe the actual merging operation. This operation takes as input two rowsets

(16)

PI records to examine for the restriction condition(s), each processor is responsible for a non-

overlappingset of bits inthe nalrestriction rowset. Thus, mergingtworowsets involves simply

OR-ingthem together.

When the nal restriction rowset has been generated, the next step, Distribute Dimension

Rowset, takes place. Here, the nal dimension rowset is transmitted back to the individual

processors of the group through the same binary tree transmission mechanism through which

the mergingprocess tookplace.

Oncethe dimensionalrestriction rowset has been constructedand distributed,the next step,

Load JIFragments,loads the JIfragmentsallocatedtogroupiinpreparationforthe creationof

the localfact rowsets, R

fact;i

. This isa bitvector of cardinalityof the fact table, wherea bitis

set if the corresponding rowof the facttable satises the joincondition for this dimension. The

precise logic to set the bits in R

fact;i

is given below. This procedure assumes that the rowset

structure is already dened and initialized (i.e., the insertion pointer points to the rst bit of

R

fact;i

). The above discussion, for expository ease, assumes a centralized join in group i. In

Algorithm 1 Load JI FragmentsAlgorithm

1: start scanning J

i

from the top

2: for each row j in J

i

(1jjFj) do

3: read the value of the current element, which yields a RID

4: map this RID to an ordinal position in D

i

, say k

5: if the k th

bit in R

dim;i

is set then

6: set the j th

bit of R

fact;i

reality, a segment of R

fact;i

is generated at each processor of group i and then merged. Note

thatinordertoperformthis merging,the systemneeds tomapordinalpositionsoffragmentsat

eachphysicalprocessorintoordinalpositionsattheunpartitionedtables. This isdonebysimple

arithmetictransformations - the details of this are given later inthis section.

Finally, a note regarding group d+1, i.e., the metric group. If there exists one (or more)

metricrestriction(s)inthesubmitted query,thentheseare evaluatedatthis groupandarowset,

i.e., R

fact;(d+1)

constructed. Clearly, nojoin takes place here.

In the nal step of the rst phase, Merge Partial Fact Rowsets, the partial rowsets created

(17)

the Local Rowset Generation phase of the algorithm,i.e., localto eachgroup.

3.2.2 Global Rowset Synthesis

In rst step of the GRSphase, Merge Local Fact Rowsets, a global rowset, denoted by R

global ,

is constructed by combining the rowsets R

fact;i

, for all i, generated in the LRG phase by each

group. Weremindthe readerthateachsuchrowsetissimplyabitvectoroffacttablecardinality

inwhich bits are set corresponding to recordsmeeting the localrestriction and join conditions.

Fora record to participate in the output of the query, it must meet allthe restriction and join

conditions, i.e., the corresponding bit must be set in all the rowsets R

fact;i

. Thus, the global

rowset is simplythe bitwise AND of allthe localrowsets.

We generate the global rowset in a manner similar to the generation of the local restriction

rowsets. The localrowsetsare transmittedand mergedthrough abinary treeconstruct inwhich

the number of leaf nodes is equal to the number of dimensions participating in the join. The

transmissionportionofthisoperationisvirtuallythe sameasthatofthe localrowsetgeneration

operation, but the merge operation consists of a bitwise AND operation. The nal rowset

contains bits set only for records that meet all the join and restriction conditions, and should

thusparticipate in the outputof the query.

Once the global rowset has been generated, it is transmitted tothose processor groups that

participate in the output phase of the query in the second and nal step of the GRS phase,

Distribute Global Rowset to Groups. Each such group houses a dimension which contributes

to the nal output. For example, if Customer.name is an output column (identied by its

presence in the SELECT clause of the query), then the group housing the customer dimension

willparticipatein the OP phase.

3.2.3 Output Preparation

The OP phase is performed by each participating processor group (to be simply referred to as

participating group henceforth) contributing a column of output and the eventual \racking" of

(18)

computed inthe GRSphase and inconjunction with the dimensional rowset already computed

intheLRGphaseandthePI(s)thatcontributetotheoutput,constructthenaloutputcolumn.

For instance, consider the previous example where Customer.name is an output column. The

corresponding participating group houses the PIs for the Customer dimension as well as the

Customer JI from the fact table, denoted by J

customer

. In this group, there will exist a PI on

the Customer.name column,denoted by P

cname

. Furthermore, assumethere exists adimensional

rowset, denoted by R

Customer

that was computed inthe LRG phase 3

.

The rst step in the OP phase, Distribute Global Rowset to Processors, involves the trans-

mission of the global rowset to allprocessors of a participating group. In the next phase, Load

PIs,the PI columnsnecessary foroutput are loaded.

When the global rowset, R

global

, has been shipped to the customer dimensional group and

allnecessary data loaded, the following procedure, Merge Output, is executed, to construct the

nal output column. In describing this procedure we assume that the nal output column will

beencapsulated inastructure calledcust name. Theprocedurealsoassumesthat thecust name

structure is already dened and initialized(i.e., the insertionpointer points to the rst slot (or

row) in the structure).

Algorithm 2 Merge Output Algorithm

1: start scanning R

global

from the top, i.e., the first bit

2: for each bit in R

global do

3: let the ordinal position of current bit in R

global

be denoted by i

4: if the i th

bit (i.e., the current bit) in R

global

is set then

5: read the i th

element of J

customer

, which yields a RID of the primary key PI of the

customer dimension

6: map this RID to an ordinal position, say j

7: read the element in the j th

position in P

cname

8: insert this element into the custname structure

9: move the insertion pointer of custname to the next insertion position

Again, note that the above description assumes (for ease of explanation), a \centralized"

structure. Inreality,however, eachphysicalprocessorinaparticipatinggroupwould executethe

aboveprocedureand produce afragmentof the outputcolumn, which would thenbemergedto

3

Thisisavalidassumption. Iftherewerenorestrictionclausesin thesubmittedquerybasedonthecustomer

dimension,onecanassumethatallbitsin R

customer

areset.

(19)

6), one can think of the following phases: (a) distributing global rowset to all processors, (b)

loading PIs and JIs at each processor in every participating group to produce the output (as

indicatedintheprocedure above),(c) mergingtheoutput fragmentsproducedby eachindividual

processor in a participating group to produce a local output, i.e., one column of nal output,

and (d) merging local outputs toproduce nal output.

In a centralized system, a query is executed as a series of disk accesses { which load the

relevant portions of the database to memory { interleaved with bursts of CPU activity, when

the loaded data is operated upon. Mapping functions are required to determine the specic

disk block that needs to be accessed and these depend on the index structure used. With this

strategy, in most cases, the delays associated with the mapping computationswill be negligible

comparedtothemuchslowerstorageaccesstimes[6]. This expectationiscorroboratedby other

studies [22], which have shown that I/O related costs (disk access plus I/O related CPU costs)

are several orders of magnitude morethan other CPU costs relatingtoquery processing. Based

onthesendings, inacentralizedsystemonecan focus onanalyzingthequeryperformancewith

respect todisk access delays.

In a parallel system, while the focus is still on the delays in obtaining needed data blocks,

the dierence (from centralized systems) arises from the fact that the required data can come

fromother nodes/processors as wellas fromthe disk. Hence, in this paper, weare interested in

response time(as measured indiskI/Odelaysaswellasdelays inobtainingthe datafromother

nodes) and the volume of data transmittedbetween processors.

Toaidthe readerinunderstandinghowthe response timeiscomputed, weprovideapictorial

exampleofthePSJalgorithmatwork. Considera2-dimensionaljoinquerythatwillbeexecuted

across two processor groups, G

1

and G

2

, consisting of 2 processors each as shown in Figure 5

below. Essentially, this gure shows the various stages that occur in PSJ and the associated

operations and time instants when each operation starts and ends. Note that we assume all

processors start execution at the same time (time t

0

in the gure). Note further that we only

considerthose operationswhichresultindata blocksarriving ataprocessor(eitherfromdiskor

(20)

LPF: Load PI Fragments

MDRF: Merge Dimension Rowset Fragments DDR: Distribute Dimension Rowset LJF: Load JI Fragments

MPFR: Merge Partial Fact Rowsets MLFR: Merge Local Fact Rowsets DGRG: Distribute Global Rowset to Groups DGRP: Distribute Global Rowset to Processors LP: Load PIs

MOF: Merge Output Fragments MLO: Merge Local Output

LEGEND G2

G1

P1 P2 P1 P2

LPF1 LPF2 LPF3 LPF4

MDRF MDRF DDR

DDR

LJF1 LJF2

LJF3 LJF4

MPFR

MPFR

MLFR

DGRG

DGRP DGRP

LP1 LP2 LP3 LP4

MOF

MOF

MLO

TIME TIME

t 2 t 4 t 6

t 9 t 0

t 11

t 12 t 13

t 16 t

t 19 t 19

t 17 t 15 t 14 t 8 t 7 t 5 t 3 t 1 t 0 t 0

18

Figure5: Response Time Computation Examplefor the PSJStrategy

other processors) orleavinga processor.

Let us examine this process in detail by considering a specic processor in the gure, say

P

3 . P

3

rst loads its PI fragment (denoted by LPF

3

) { this activity ends at time t

1

. Then it

performs some CPU activity, as described before in the algorithm, to produce a dimensional

rowset fragmentbased onthe PI fragment fetched inthe LPF step. ThisCPU activitydoes not

show up inthe gureforreasonsalready explained. The next costphase consistsof the merging

the dimensional rowset fragments (MDRF) and subsequently distributing the full dimensional

rowset (DDR) to all processors in G

2

. This ends at t

5

. Upon receipt of the full dimensional

rowset, P

3

loads the JI fragment allocated to it (LJF

3

). This step, nishing at t

7

, is used to

produce a partial fact table rowset (cost ignored as this is CPU activity) which is then merged

with the other partial fact table rowsets produced by the other member of G

2

, namely P

4 .

This shows up in the gure as the transmission cost MPFR. Note that until this time, P

3 was

continuously busy either fetching data from disk or receiving/sending transmissions. At this

point though, according to the scheme of gure 5 it must wait until the other group, namely

G

1

nishes producingits localfact rowset. This occurs at time t

10

, whichsignals the end of the

(21)

phase,whichrequires the merging ofthe dierent localfacttable rowsets (MLFR) into aglobal

rowsetandthesubsequentdistributionofthisglobalrowset(DGRG)tothedierentgroups. The

DGRGphaseends att

12

,and indicatesthe endofthe GRSphase. TheOutput preparation(OP)

phasestartsnowwhereP

3

goesthroughthespecicstepsoutlinedinthealgorithm. Thisconsists

of loading dimensional PI (LP

3

), producing an output fragment which is then merged with the

outputfragmentsproducedby othergroupmembers,producingacomplete outputcolumn. The

onlynon-CPUcostfor thisstep isatransmissioncostfor mergingthe outputfragments(MOF).

Finally, all the individual output columns produced by the dierent participating groups are

racked together to produce the nal output, which requires a transmission step (MLO). The

querynishes at t

19

, whichis the response time for the query.

We simulate this exact process in the performance experiments reported later (for varying

number of dimensions and processors, of course). In order to extract the total I/O cost of the

query we need to compute the costs for the various steps outlined (e.g., LPF

i , LJF

i

, etc.). We

have developed cost models for these steps, which are detailedin Section6.

4 The Parallel BJI Join

We now consider a warehouse environment utilizing Bitmapped Join Indexes (BJIs). Join pro-

cessingwithBJIsisdescribedin[22],andanoverviewisprovidedin[7]. Duetospacelimitations,

wemust referthe readertothispaperforadescriptionofBJIjoinprocessing. Ofinterestinthis

area, however, is the memory requirement for BJI join processing. Here, all relevant columns

and rows from the dimension tables (including the primary key column)are extracted from the

dimension tables and pinned in memory. In [6], we have shown that the memory requirement

forBJI, inblocks, isM

JOIN

(BJI)=1+jDj+

D2D l

jDjw(D)

B m

whereD is the set of dimension

table participating in the join, w(D) refers the width of a tuple in a dimension table, and B is

the size of a disk block. The rst term in this expression corresponds to a block of memory for

the facttable, the second term corresponds toa block of memory for each dimension table, and

the third term corresponds to the memory required for pinning the relevant dimension tables

(22)

warehouse, leadingto the potentialfor losses in joineÆciency (see [7]for details).

Intermsofprocessorgrouping,theschemeisexactlythesameasthePSJscheme. Thisresults

inG

d+1

processor groups, one for each dimension, plus one for the fact table. In terms of data

placement, the following are loaded oneach disk in processor group G

j

: (1) For dimension D

j :

1

N

j th

ofD

j

(horizontalpartition),BIfor theresident fragmentforeachnon-primary key attribute

inD

j

,DimensionfragmenttofacttableBJI,and B+Tree index forthe primarykeyattribute of

D

j

. (2) Forthefacttable: Theentire facttable (neededfor OPphase)and BIforeachattribute

inthe fact table.

TheParallelBJIJoinalgorithmhasthesamegeneralstructureasthePSJalgorithm;onlythe

Local Rowset Generation phase, and the Generate Partial Output Fragments step of the Output

Preparation phase dier. In the Local Rowset Generation phase, to generate the restriction

rowset fragments, the BJI algorithmuses the BIsto perform restrictions, rather than PIs, asin

the PSJ algorithm. For each (dimension or fact)table on which there is a restriction predicate,

the following is done on each processor of each processor group. For each restriction predicate,

the BI for the attribute referenced inthe predicate istraversed, and a partial restriction rowset

constructed. For each distinct attribute value meeting the restriction, the bitmap pointed to

by the leaf node that represents that attribute value is loaded. All loaded bitmaps are bitwise

ORed. The result is a bitmap of size jD

j

j (for dimensions) or F (for the fact table), where

each set bit represents an attribute value that meets the restriction for attribute values in the

residentfragment. Subsequently, D

j

isjoined tothefact tableasfollows. Foreach bitset inthe

dimensionrestriction rowset of D

j

, the correspondingvaluein the BJIis found,and the bitmap

pointed toby thatleaf node isloadedintomemory. Allloadedbitmapsare mergedintoasingle

partial fact rowset using bitwise OR. This results in a bitmap of size jFj, where each bit set

represents a tuple inthe fact table that meets the restriction predicates on attributes in D

j for

the resident fragment. Within the Output Preparation phase, the only step that is dierent for

theBJI approachisthe Generate Partial Output Fragmentsstep. Recall thateach bitset inthe

global rowset translates to a row of output. Output for participating dimensions is produced

(23)

projectedattribute values fromD

j

. Theblockscontainingthefacttabletuplescorrespondingto

theset bitsinthe globalrowsetare loaded 4

andtheneeded dimensionalkeyvaluesareprojected

fromthe loadedtuples. Foreachkeyvalue,the appropriatediskblockis accessednext using the

index. Here, we assume a B+tree index, since it performs better for dense indexes. Once this

rowset is produced, processingproceeds in the same fashion as inPSJ.

Eectively, using the approach outlined above, a one-dimensional join is performed at each

processor. In this context, the BJI memory requirement equation above may be restated as

follows: at a processor housing dimension i (D

i

), the amount of memory required to perform

the BJI-based join as outlined previously is given by the following equation: M

JOIN

(BJI) =

1+1+ l

jD

i jw(D

i )

B m

.

5 The Parallel Hash Join

Here, we describe a parallel hash join strategy (based on work in [4, 5]), a well-known join

strategy based on the conventional relational data model using segmented right-deep trees and

pipelining,andapply the techniqueina datawarehousing environment. In the remainderof the

paper, we willreferto this algorithmasthe HASH algorithm.

The HASH algorithmprovidesameans of performingmulti-wayhash joins,using pipelining

techniques toimproveperformance. This technique assumesa shared-disk architecture,whereas

we assume a shared-nothing architecture for the other algorithms in this paper. Rather than

embarking on a discussion of optimal data placement policies for the HASH algorithm (the

authors of [4] note the diÆculty of this problem; it is well beyond the scope of this paper), we

retain the shared-disk assumption for the HASH algorithmin our discussions and performance

comparisons. Note thatthis confers asignicant advantageto theHASH algorithm,by removing

the need for additional data transfer phases. It also removes the need for an a priori data

placement and processor allocation phase; since each processor can access data on any disk,

4

Toobtainasimplercost model, but onethat aordsBJI join asignicant advantagewhenweevaluateits

performance,wemakeasimplifyingassumptionforthedimensionoutputgenerationphase-weassumethatthe

bitssetinthefact tableareclustered,i.e., thetuplesareco-locatedin sequentialdiskblocks.

(24)

query plan. We note that we assume an ideal hash function for each stage, such that the load

is distributed evenly across allprocessors allocatedto a stage. Due to space considerations, we

refer the reader to [4] for the full details of the HASH algorithm(an overview is also provided

in[7]), and move onto discussthe cost models for each algorithm.

6 Cost Model

In this section, we develop cost models for the PSJ, BJI, and HASH algorithms. Throughout

this discussion, multiplicationand division by the constantvalue8 denotes conversionfrom bits

(theunit oftransmissionmeasurement)tobytes(the unitof diskaccessmeasurement)andfrom

bytes tobits, respectively. The notation used inthis paper issummarized inTable 1.

COST MODEL FOR PSJ

(1)LocalRowset Generation (LRG)

(1a)LoadPIFragments(LPF):Diskaccesscosttoload 1

N

i th

ofthePIsneeded forrestriction

inasingleprocessor group(inblocks) is

jD

i j

N

i

P

Wa

ij

B

,wherea

ij 2A

R

. Diskaccess cost toload

1

N

i th

ofthePIs neededforrestrictioninthemetricdatagroup(inblocks)is

jFj

N

F

P

W

a

Fj

B

,where

a

Fj 2A

R .

(1b) Merge Dimension Rowset Fragments (MDRF): Transmission cost to merge dimension

restrictionrowsetfragmentsintoasingle dimensionrestrictionrowsetinasingleprocessorgroup

is log

2 N

i jD

i

j. Transmission cost to merge metric restriction fragments into a single metric

grouprestriction rowset islog

2 N

F

jFj.

(1c)Distribute DimensionRowset (DDR):Transmissioncost todistribute dimension rowset

to all members of a processor group is log

2 N

i jD

i

j. Transmission cost to distribute metric

restriction rowset toall members of a processor group is log

2 N

i jFj

(1d) Load JI Fragments (LJF): Disk access cost to load 1

Ni th

of the JI in a single processor

group(in blocks) is

jFj

N

i W

JI

B

.

(1e)Merge PartialFactRowsets (MPFR)Transmission costtomergepartial factrowsetsin

(25)

2 i

(2)Global Rowset Synthesis

(2a)Merge LocalFact Rowsets (MLFR): Transmission cost tomerge alllocalfactrowsetsis

log

2 D

J

jFj.

(2b)DistributeGlobalRowsettoGroups (DGRG):Transmissioncosttodistributetheglobal

rowset toall groupsparticipating inthe output phase islog

2 D

P

jFj.

(3)Output Preparation (OP)

(3a) Distribute Global Rowset to Processors (DGRP): Transmission cost to distribute the

globalrowset toallprocessorsinagroup(inbits) islog

2 N

i

jFj. Transmissioncosttodistribute

the global rowset toall processorsin the metric group(in bits) islog

2 N

F

jFj.

(3b)LoadPIs(LP):DiskaccesscosttoloadallPIsinvolvedinoutputforasingledimension.

LetB represent the PI cost l

jD

i j

P

W

a

ij

B m

, wherea

ij 2A

P

,J represent the JI cost

jFj

N

F

P

W

JI

B

,

R represent the Rowset cost

jFj

N

F 8

P

W

JI

B

, and OU represent the Output cost l

jOj P

Wa

ij

B m

,

where a

ij 2 A

P

. If either B <

S

m

N

i

or J <

S

m

N

i

, i.e., if either the PI or the JI fragment ts in

memory, then the cost is B +J+R+OU. Otherwise, either the JI or the PI must be loaded

multipletimestoperformthejoin. IfB >J,thecost is(BJ)+J+R+OU. Otherwise,when

B <J, the cost is B +(JB)+R+OU. Disk access cost to load allPIs involved in output

forthe metrictable (inblocks) is

jFj

N

F

P

Wa

Fj

B

+

jFj

N

F 8

B

a

Fj 2A

P .

(3c) Merge Output Fragments (MOF): Let O represent the output relation, and jOj be its

cardinality. Transmission cost to merge outputfrom single processorsintoa localgroup output

on a single processor group is jOj

Ni

P

(W

a

ij

8), where a

ij 2 A

P

. Transmission cost to merge

outputwithin the metric processorgroup is jOj

N

F

P

(W

a

Fj

8),where a

Fj 2A

P .

(3c) Merge Local Output (MLO): Transmission cost tomerge output from localgroups into

asingle nal output. Likethe costs formerging the outputswithin aprocessor group, the costs

here assume serial transmission to a single target jOj( P

(W

a

ij

8)+ P

(W

a

F j

8)), where

a

ij

;a

Fj 2A

P .

COST MODEL FOR BJI

(1)LocalRowset Generation (LRG)

(26)

(1a) Load BI Fragments (LBF): Disk access cost to load index and bitmaps for 1

N

i of

dimension D needed for processing a single restriction predicate in a single processor group (in

blocks)isdlog

P

i V

i

1e+

V

i

K

i N

i

+f

&jV

i j

8BN

i

. Diskaccess costtoload indexandbitmapsfor 1

N

i th

of the fact table needed for processing a single restriction predicate in a single processor group

(in blocks) is d log

P

F V

F

1e+

V

F

K

F N

F

+f

&jFj

8BN

F

. Here, the costs for index access are taken

from[6]

(1b) Merge Dimension Rowset Fragments (MDRF): Transmission cost to merge dimension

restrictionrowsetfragmentsintoasingle dimensionrestrictionrowsetinasingleprocessorgroup

is log

2 N

i jD

i

j. Transmission cost to merge metric restriction fragments into a single metric

grouprestriction rowset islog

2 N

F

jFj.

(1c)Distribute DimensionRowset (DDR):Transmissioncost todistribute dimension rowset

to all members of a processor group is log

2 N

i jD

i

j. Transmission cost to distribute metric

restriction rowset toall members of a processor group is log

2 N

F jFj

(1d)GeneratePartialFactRowsets(GPFR):DiskaccesscosttotraverseBJIandloadappro-

priateRIDsfor 1

N

i th

ofadimensioninasingleprocessorgroup(inblocks)is

&jDij

N

i dlog

Pi V

i 1e

+

R

&jD

i j

N

i

. Here, the costs for index access are taken from [6]

(1e) Merge Partial Fact Rowsets (MPFR): Transmission cost to merge partial fact rowsets

ina single processorgroup is log

2 N

i

jFj.

(2)Global Rowset Synthesis

(2a)Merge LocalFact Rowsets (MLFR): Transmission cost tomerge alllocalfactrowsetsis

log

2 D

J

jFj.

(2b)DistributeGlobalRowsettoGroups (DGRG):Transmissioncosttodistributetheglobal

rowset toall groupsparticipating inthe output phase islog

2 D

P

jFj.

(3)Output Preparation (OP)

(3a) Distribute Global Rowset to Processors (DGRP): Transmission cost to distribute the

globalrowset toallprocessorsinagroup(inbits) islog

2 N

i

jFj. Transmissioncosttodistribute

the global rowset toall processorsin the metric group(in bits) islog

2 N

F

jFj.

(3b) Generate Partial Output Fragments (GPOF):Let Orepresent the output relation,and

(27)

dimensionkeys, traversing the B+tree for eachkey value, and loadingthe blockscontaining the

dimension tuples):

jOj jFj

T

BF

+

jOj l

log

P

i V

i

Ni 1

m

+min

jOj

Ni

; jD

i j

T

Bi Ni

. Disk access cost

to load metric data for metric output fragments: min

jOj

N

F

; jFj

T

BF N

F

. Here, the costs for index

accessare taken from [6]

(3c)MergeOutputFragments (MOF):Transmissioncosttomergeoutputfromsingleproces-

sors intoa localgroupoutput ona singleprocessor groupis jOj

N

i

P

(W

a

ij

8),wherea

ij 2A

P .

Transmissioncosttomergeoutputwithinthemetricprocessorgroupis jOj

N

F

P

(W

a

Fj

8),where

a

Fj 2A

P .

(3d) Merge Local Output (MLO): Transmission cost tomerge outputfrom local groupsinto

asingle nal output. Likethe costs formerging the outputswithin aprocessor group, the costs

here assume serial transmission to a single target jOj( P

(W

aij

8)+ P

(W

a

F j

8)), where

a

ij

;a

Fj 2A

P .

COST MODEL FOR HASH

5

(1)TableBuilding

(1a) Load dimension data for segment (LDD): Disk access cost to load dimension data

for a segment is the max across all dimensions (where data is loaded in parallel), given the

processor allocation for the segment, and assuming a B+ tree index on restriction attributes:

max

i

dlog

N

i V

i

1e+ l

jD

i j

P

W

a

i

N

i B

m

, where D

i

is inthe segment.

(1b) Transmit dimensional data (TDD): Transmit dimensional data to appropriate pro-

cessor, according to the partitioning function. Assumes a uniform distribution of data across

disks 6

, where 1

N th

i

of a group's data is loaded on the appropriate processor in the LDD phase:

N

i

j=1 l

&jD

i j(

P

W

a

RP 8)

&jD

i j(

P

W

a

RP 8)

Ni

m

, where W

a

RP 2A

P orW

a

RP 2A

R .

(2)Tuple Probing

(2) Load probing table (LPT): Let Q be the probing table, i.e., the fact table for the rst

5

We begin ourdiscussionof thecostswith thetable building phase,and ignorethe costsofthe preliminary

phase, where the segments and processor allocation are determined. Since the HASH algorithm repeats over

segments,wemodel thecostofexecutingasegmentin phases(1)and(2), whilephase(3)considers thecostof

generatingthenal output.

6

This representsabestcasescenario,wheredataloadingin parallelminimizesI/Ocosts.

References

Related documents

The stellar data within a radius of a few arcmin is used to study the star formation history (SFH) of the region under consideration, whereas the star clusters are identified within

During the deep minima of 1994 May and 1996 June, an emission-line spectrum temporarily appeared superimposed on a weak continuum ; in addition to the previously reported

In the gravitational field of a mass, of the atom in a strong gravitational clocks slow down. T.his time extent' of slowing down depends on dilation leads to

asymptotically flat solution of field equations of g ra v ita tio n is that when the angular momentum of the source producing the field tends to zero, the solution tends

AmYr ‘ZmV dmMyZ ‘J àH$Q&gt; dmMZ Mmbob..

AmVm ‘wbmZo dmMboë¶m CVmè¶mda AmYm[aV Imbrb

AmYr ‘ZmV dmMyZ ‘J àH$Q&gt; dmMZ Mmbob..

AmVm ‘wbmZo dmMboë¶m CVmè¶mda AmYm[aV Imbrb