• No results found

We will take some sample data and will consider some of the many ways it could be represented in storage (At the level of Stored Record Interface).

N/A
N/A
Protected

Academic year: 2023

Share "We will take some sample data and will consider some of the many ways it could be represented in storage (At the level of Stored Record Interface)."

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

UNIT – II

Storage Structures & File Organizations

The purpose of this Unit is to provide an introduction to the way in which data may be organized in secondary storage.

By secondary storage we mean directly accessible media such as disks, drums etc.

Sequential Media such as tape are not being considered here, since such media imposes a lot of restrictions to be useful in a database system. For example with a tape it is generally not possible to perform “Update in Place” i.e. to update a particular record all the previous records need to be accessed.

(2)

• The Storage considered here are general in nature and do not relate to specific systems.

• Clearly, this Unit is concerned with the internal level of the system and to some extent what lies below that level.

• Remember that Internal Level is still slightly above the level of Physical Storage.

• As we already know that user operations are expressed (via DML) in terms of external records, and must be converted by the DBMS into corresponding operations on Internal or Stored Records.

(3)

• These latter operations must be converted in turn to operations at the actual hardware level, i.e. to the operations on physical records or blocks.

• The component responsible for this internal/

physical conversion is called an Access Method (Fig. 1)

• The access method consists of a set of routines whose function is to hide all device dependent details from the DBMS and to present the DBMS with a Stored Record Interface.

(4)
(5)

• The Stored Record Interface thus corresponds to the internal level, just as the user interface corresponds to the external level.

• The Physical Record Interface corresponds to the actual hardware level.

• Thus the stored record interface allows the DBMS to view the storage structure as a collection of stored files, each one consisting of all occurrences of one type of stored record.

• Specifically the DBMS knows the following:

What stored files exist, and, for each one,

(6)

The structure of the corresponding stored record.

The stored filed(s), if any, on which it is sequenced.

The stored filed(s), if any, that can be used as search arguments for direct access.

• The above information will be specified as part of the storage structure definition. The last two points clearly indicate that the DBMS knows what access statement it can issue against a stored file.

• It is also clear that the unit that crosses the stored record interface is one stored record occurrence.

(7)

• Also, the DBMS doesn’t know the following:

Anything about Physical Records (Blocks).

How stored fields are associated to form stored records.

How sequencing is performed (It may be by means of physical contiguity, an index or a pointer chain)

How direct access is performed?

• This information is specified to the access method, not to the DBMS.

• As an illustration consider that the storage structure consists of a stored file STUDENT_FILE.

(8)

• Each stored record occurrence consists of an Enrollment Number (ENo), Student Name (SNAME), Student Height (SHEIGHT) and Student Weight (SWEIGHT).

• Part of the storage structure definition then go as follows:

STORED FILE STUDENT_FILE

STUDENT LENGTH=……., SEQUENCE= ASCENDING(Eno) Eno TYPE=……, INDEX Eno_INDEX

SNAME TYPE=……

SHEIGHT TYPE=……

(9)

• The above definition is intended to convey the following information:

The Stored file of STUDENTS exists.

The corresponding stored record has a particular structure.

The stored record occurrences are sequenced on ascending Eno values.

The Eno. Stored filed is indexed, so that direct access may be performed by supplying a value for this field.

• When a stored record occurrence is first created and entered into the database, the access method is responsible for assigning it a unique Stored Record Address.

(10)

• This value distinguishes that stored record occurrence from all other in the database.

• It may be simply the physical address of the occurrence within the storage volume or may be some other thing.

• The SRA for a particular record is returned to the DBMS for subsequent direct access to the occurrence concerned.

• An assumption we make here is that the SRA for a given occurrence does not change until the occurrence is physically moved as part of the database is reorganized.

(11)

• The SRA concept permits the DBMS to build its own access mechanism (indexes, pointer chains etc.) over and above those maintained by the access method.

• This is illustrated by many of the following examples.

(12)

Possible Representations for Some Sample Data:

We will take some sample data and will consider some of the many ways it could be represented in storage (At the level of Stored Record Interface).

Consider the SUPPLIER file.

We assume that each supplier has a unique Supplier Number, and also each Supplier has one Name, one Status value, and one location.

We also assume that each stored file is

sequenced by the access method on its primary

key.

(13)

• The first (and simplest) representation consists of a single stored file containing five stored

record occurrences one for each supplier.

• The representation is advantageous in terms of simplicity but inadequate in practice

S# SNAME STATUS CITY

S1 Smith 20 London

S2 Jones 20 Paris

S3 Blake 30 Paris

S4 Clark 20 London

S5 Adams 30 Athens

(14)

For example say we had 10,000 suppliers instead of just five, but they were located in only ten different cities, then the representation would lead to wastage of valuable storage space.

If we assume that the amount of storage required for a pointer is less than that required for a city name, the representation illustrated in following figure will clearly save some storage space in such situation.

S# SNAME STATUS CITY Athens

S1 Smith 20

S2 Jones 10 London

S3 Blake 30

S4 Clark 20 Paris

Supplier

File City

File

(15)

Here we have two stored files – a Supplier file and a City file, with pointers out from the former into the later. These pointers are SRAs.

The only advantage of this representation over the previous one is the saving in space.

In this case a query to find all the properties of a given Supplier will take at least one more access than before (First Pointer is read and then the corresponding City value).

A query to find all suppliers in a given City will

require several more accesses. (How???)

(16)

• Note that it is the DBMS not the access method that maintains the pointers.

• The Access Method is not aware of the connection between the two files.

• The Connections must be stated in the definition of the mapping from the conceptual schema to storage.

An Important Note

(17)

If the query “Find all suppliers in a given city” is an important one then the DBA may choose the following representation:

CITY SUPPLIER

pointers S# SNAME STATUS

Athens S1 Smith 20

London S2 Jones 10

S3 Blake 30

Paris S4 Clark 20

S5 Adams 30

CITY File SUPPLIER File

(18)

• In this case we again have two files – A Supplier file and a City file, but this time pointers out of the later into former.

• It is obvious that this is better for queries asking for all suppliers in a given city but worst for queries asking for all attributes of a given supplier.

• The City file in the above figure servers as an index to the Supplier file. (Or the Supplier file is Indexed by the City file).

• Clearly, the purpose of an Index file is to provide an access path to the file it is indexing i.e. a way of getting records in the indexed file.

(19)

A given file may have many associated access paths.

So, a file have many index files.

An index file is a file in which each record consists of a data value together with one or more pointers. The data value is a value for some field of the indexed file (the indexed field) and the pointers identify records in the indexed file having that value for that field.

In general then the advantage of indexing is that it usually speeds up retrieval. The disadvantage is that it may slow down update. For example, suppose supplier S1 moves from Paris to London and consider what must be done to support this change in the above figure compared with what must be done in the figure previous to the above figure.

(20)

The City file shown in the above figure is a DENSE, SECONDAY index.

Dense means that it contains an entry for every stored record occurrence in the indexed file. It means that the indexed file need not contain the indexed field. In the example the Supplier file no longer includes a city field.

Secondary means that it is an index on a field other than the primary key.

Following figure shows a representation of data that combines the previous two representations and thus possesses the advantages of each. Of course it also has the Update disadvantage discussed earlier

(21)
(22)

• Another disadvantage of secondary indexes in general is that each stored record occurrence in the index must contain an unpredictable number of pointers. This fact complicates the job of DBMS in applying changes to the database. An alternate to the above representation, that avoids this problem is shown in the figure on the next page.

• In this representation each stored record occurrence (Supplier or City) contains just one pointer.

• Each city points to the first supplier in that city.

(23)
(24)

• That supplier then points to the second supplier in the same city, who points to the third, and so on up to the last, who points back to the city.

Thus for each city we have a chain of all suppliers located in that city.

• The advantage of this representation is that it is easier to apply the changes. The disadvantage is that, for a given city the only way to access the nth Supplier is to follow the chain and access the 1st, 2nd,… (n-1)th suppliers, too. If the list of Suppliers in a city is long the time taken to access the nth supplier may be considerable.

(25)

• The representation shown in the above figure is an example of multilist organization. In this figure we have chained together all suppliers in the same city. In other words for each city we had a list of suppliers located in that city.

• In exactly the same way (by means of additional pointers) we could also have a list of suppliers for each distinct status value. (Sketch this representation for this sample data).

• In general a multilist organization can clearly contain any number of such lists.

(26)

• Just as it is possible to provide any number of lists in the multilist organization, it is also possible to provide any number of secondary indexes in an indexing organization.

• In the extreme case we have the situation illustrated in the figure on the next page, i.e. an index on every secondary field.

• This is known as inverted organization.

• Inverted organization will perform well for a request for all suppliers with a given property (say city or status or..), a request for all properties of a given supplier will take a long

(27)
(28)

Which Storage Structure We Should Use?

In the above discussion we represented the sample data into number of ways and discussed the pros and cons of each storage representation.

Which representation we should use in practice is still a big question.

Generally in practice we frequently compromise by providing a regular organization (as shown in the first figure) together with secondary indexes on certain selected fields. Note that though this involves redundant storage of the indexed field values but is one of the most common storage structures in practice.

(29)

HASHING

• Hash addressing or simply hashing is another example of an access path.

• The basic idea of hash addressing is that each stored record occurrence is placed in the database at a location whose address (SRA) may be computed as some function (Hash Function) of a value that appears in that occurrence – usually the primary key value.

• Thus, to store a record occurrence initially, the DBMS computes the SRA and instructs the access method to place the occurrence at that position.

(30)

To retrieve the occurrence subsequently, the DBMS performs the same calculation as before and then requests the access method to fetch the occurrence at the computed position. The advantage of this organization is that it provides very fast direct access on the basis of values of the hashed field.

As an example of hash addressing consider that the S# values are S100, S200, S300, S400 and S500 (instead of S1, S2, S3, S4 and S5) and let us assume that the hash function is defined as:

(31)

The SRAs for the five suppliers are then 9,5,1,10 and 6 respectively, giving us the representation shown in the figure on the next page.

It would be theoretically possible to use the numeric part of primary key value for any given occurrence directly as SRA for that occurrence. However this is usually inadequate in practice, because the range of primary key values is generally much wider than the range of available SRAs. For example, suppose that supplier numbers are in fact three digits wide, as specified above. This allows a maximum of 1000 different suppliers, whereas in practice there may be a maximum of only 10. So, to avoide a colossal waste of storage space we ideally require a hash function that will reduce any value in the range 0-999 to one in the

(32)
(33)

• One of the obvious disadvantages of hash addressing is that the sequence of stored record occurrences within the stored file will certainly not be the primary key sequence. In fact a stored file in hash addressing organization is usually considered to have no particular sequence.

• Another disadvantage of hash addressing is the possibility of collisions – that is two distinct stored record occurrences whose keys hash to the same SRA.

• For example consider that the sample data also includes a supplier with S# value of S1400.

(34)

• This supplier would collide (at SRA 9) with supplier S100, given the same function as before.

• The implication is that the hash function has to be made more complicated to handle this sort of situation.

(35)

Conclusions

Our survey of some storage structures permitted by the Stored Record Interface is now complete.

In conclusion we stress on the fact that there is no such thing as “best storage structure”.

What is best depends what is important to the enterprise.

It is the responsibility of the DBA to balance a large number of conflicting requirements in choosing a storage structure.

The considerations that must be taken into account include – retrieval performance, the difficulty of applying changes, the amount of available storage space, the ease with which the database may be

(36)

ASSIGNMENT

 What do you understand by Dense Indexing? Explain various indexing techniques.

 What are B-Trees and B+ Trees.

Explain differences among them and

show how B+ tree is better compared

to B-Tree.

References

Related documents

SaLt MaRSheS The latest data indicates salt marshes may be unable to keep pace with sea-level rise and drown, transforming the coastal landscape and depriv- ing us of a

Although a refined source apportionment study is needed to quantify the contribution of each source to the pollution level, road transport stands out as a key source of PM 2.5

These gains in crop production are unprecedented which is why 5 million small farmers in India in 2008 elected to plant 7.6 million hectares of Bt cotton which

INDEPENDENT MONITORING BOARD | RECOMMENDED ACTION.. Rationale: Repeatedly, in field surveys, from front-line polio workers, and in meeting after meeting, it has become clear that

3 Collective bargaining is defined in the ILO’s Collective Bargaining Convention, 1981 (No. 154), as “all negotiations which take place between an employer, a group of employers

With respect to other government schemes, only 3.7 per cent of waste workers said that they were enrolled in ICDS, out of which 50 per cent could access it after lockdown, 11 per

Of those who have used the internet to access information and advice about health, the most trustworthy sources are considered to be the NHS website (81 per cent), charity

Harmonization of requirements of national legislation on international road transport, including requirements for vehicles and road infrastructure ..... Promoting the implementation