Homework 10: File Structures, Indexing, and Query Optimization

This homework will develop your understanding of database indexes and query optimization.

Instructions

You may work either individually or with a partner.

Problem 1: Hash Index

A PARTS file with Part# as the hash key includes records with the following Part# values: 2369, 3760, 4692, 4871, 5659, 1821, 1074, 7115, 1620, 2428, 3943, 4750, 6975, 4981, 9208. The file uses eight buckets, numbered 0 to 7. Each bucket is one disk block and holds two records. Load these records into the file in the given order. Use the hash function h(K) = K mod 8. Use chaining for collision resolution.

Draw the final contents of each bucket, using the notation shown in Figure 16.10
- Note: Your number of buckets and bucket size for this problem should differ from Figure 16.10
Calculate the average number of block accesses for a random retrieval on Part#. Round to one decimal place.
- Note: You can assume that the table mapping bucket numbers to disk block addresses is already stored in memory (Figure 16.9). Thus, once you know the bucket number, retrieving the corresponding main bucket requires a single disk access.

Problem 2: B+ Index

A PARTS file with Part# as the key field includes records with the following Part# values: 37, 60, 46, 48, 71, 59, 18, 10, 74, 15, 16, 20, 24, 43, 47, 50, 69, 8, 49, 33. Suppose that the search field values are inserted in the given order in a \(B^+\)-tree of order \(p = 4\) (each internal node can hold up to 4 search field values) and \(p_\text{leaf} = 3\) (each leaf node can hold up to 3 search field values).

Show what the final tree will look like, using the notation below
Assume each tree and leaf node is stored on a different block. How many block accesses are needed to:
- Locate the data pointer corresponding to the value 60?
- Locate the data pointers corresponding to values 20 to 50?

You may also find it helpful to refer to Figure 17.12 in the textbook.

Problem 3: Query Optimization

Consider the following query on the depicted schema:

SELECT instructor, student_id
FROM enrolled JOIN course ON course_num = number
JOIN teaches ON c_number = number

The student, enrolled, and course tables.

This query could be evaluated using the execution plan:

\[\pi_\text{instructor, student_id} \big( \sigma_\text{course_num = number AND c_number = number} ( \text{ENROLLED} \times \text{COURSE} \times \text{TEACHES} ) \big)\]

The database has the following table and column statistics:

Table	\(n_r\)	\(l_r\)
student	5,000	58 bytes
enrolled	20,000	16 bytes
course	500	98 bytes
teaches	700	58 bytes

V(A, r)
V(id, student) = 5,000
V(student_id, enrolled) = 4,500
V(course_num, enrolled) = 480
V(number, course) = 500
V(c_number, teaches) = 500

Calculate the number of rows generated by the inefficient plan’s cross product
Calculate the size (in bytes) of the intermediate table generated by the inefficient plan’s cross product
- Hint: Multiply the number of rows by the size of each row
A more efficient plan would use natural joins instead of a cross product. Based on the statistics, which tables should be joined first, to have fewest number of rows in the intermediate table?
For the most efficient plan, estimate the number of rows in the intermediate table

Submit

Submit your responses on Gradescope.