Lab: Query Processing and Optimization

In this lab, you will practice using PostgreSQL’s tools for query optimization.

Instructions

You may work either individually or with a partner.

Prepare by reading this overview of the PostgreSQL Planner/Optimizer.

Note: Windows users: if you haven’t already, complete Step 2 from my guide to Configuring Windows for Docker Development.

Step 1: Add Fake Users to MiniFacebook

For this lab, we will use the Django MiniFacebook project.

First, follow the documentation in the repo’s README to run Django MiniFacebook on your device.

Next, use the add_fb_bots.py program to add fake data to the MiniFacebook application:

docker compose exec django python add_fb_bots.py --help
docker compose exec django python add_fb_bots.py

The command will run for two minutes. You can use the docker stats command to monitor its progress.

Note: The faster your computer, the more data will be generated within the time limit.

Step 2: Enable Performance Extensions

By default, PostgreSQL doesn’t make all performance statistics visible to users. To enable this functionality, edit your docker-compose.yml file so it contains:

  postgres:
    image: postgres:16.1
    command: ["postgres", "-c", "shared_preload_libraries=pg_stat_statements"]

Next, rerun docker compose up -d, so these changes to the compose file will take effect.

Finally, enter psql in the PostgreSQL container, and run the following CREATE EXTENSION commands:

docker compose exec postgres bash
psql --username="$POSTGRES_USER" --dbname="$POSTGRES_DB"
CREATE EXTENSION pg_stat_statements;
CREATE EXTENSION pg_buffercache;

Step 3: Review Database Schema

Run the following commands at the psql prompt to view the schema of the MiniFacebook application:

\d minifacebook_profile
\d minifacebook_status
\d minifacebook_poke

Run the following commands to view the size of the entire django_data database, and the size of the MiniFacebook application’s tables, respectively.

SELECT pg_size_pretty( pg_database_size('django_data') );
SELECT pg_size_pretty( pg_total_relation_size('minifacebook_profile') );
SELECT pg_size_pretty( pg_total_relation_size('minifacebook_status') );
SELECT pg_size_pretty( pg_total_relation_size('minifacebook_poke') );

Finally, run the following commands to show the number of rows in each table:

SELECT COUNT(*) FROM minifacebook_profile;
SELECT COUNT(*) FROM minifacebook_status;
SELECT COUNT(*) FROM minifacebook_poke;

Before proceeding, ensure you are familiar with the columns in each table, and the size of each table.

Step 3: Analyze SQL Queries

We will analyze queries which could be used to generate a page showing the most recent status updates.

Carefully read the following SQL command, which gets the 10 most recent statuses:

EXPLAIN (ANALYZE,VERBOSE,COSTS,BUFFERS)
  SELECT id, message, date_time, profile_id
  FROM minifacebook_status
  ORDER BY date_time DESC
  LIMIT 10;

Notice that it includes the EXPLAIN command, which describes the execution plan.

Run the command now, and carefully read the output. Record the Execution Time.

Next, read the following SQL command, which gets the 10 most recent statuses, along with the associated profile’s email address:

EXPLAIN (ANALYZE,VERBOSE,COSTS,BUFFERS)
  SELECT email, message, date_time
  FROM minifacebook_status AS status JOIN minifacebook_profile AS profile
    ON status.profile_id = profile.id
  ORDER BY date_time DESC
  LIMIT 10;

Compare the output from this EXPLAIN to that of the previous query. In particular, notice the increase in total Execution Time, and the additional steps which were performed to get this query’s results. Record the Execution Time.

Finally, run the queries without the EXPLAIN line:

SELECT id, message, date_time, profile_id
FROM minifacebook_status
ORDER BY date_time DESC
LIMIT 10;
SELECT email, message, date_time
FROM minifacebook_status AS status JOIN minifacebook_profile AS profile
  ON status.profile_id = profile.id
ORDER BY date_time DESC
LIMIT 10;

Without EXPLAIN, these queries will execute much faster. This is due to caching performed by PostgreSQL. This shows the importance of using EXPLAIN, since EXPLAIN will show the time needed when the results cannot be simply read from cache.

Finally, run the following commands to display statistics on all the queries you have run. To avoid errors, copy and run the commands one at a time:

\x on
\pset pager off
SELECT * FROM pg_stat_statements ORDER BY total_exec_time;
\x off

total_exec_time can help you identify the queries the database spends the most time on, in total. mean_exec_time can help you identify slow queries, even if they are not run very often. Refer to the pg_stat_statements documentation for more information.

Note: You should see that the queries take tens or hundreds of milliseconds. This is quite slow for a database query, and could be a performance bottleneck for the application.

Step 4: Add An Index

The performance of these two queries can be greatly improved by creating an index on a particular field. This is easy to do using Django. Simply add db_index=True to the appropriate field in django/djangoproject/minifacebook/models.py, like so:

some_field = models.SomeKindOfField(..., db_index=True)

Next, you should actually create this index in the database by generating and running a migration:

docker compose exec django python manage.py makemigrations
docker compose exec django python manage.py migrate

Finally, confirm that the migration was created as expected by running viewing the details of the table you modified in the psql shell:

\d TABLE_NAME

Record the kind of index that was created.

View the query statistics, and record the CREATE INDEX statement run by your migration.

\x on
\pset pager off
SELECT * FROM pg_stat_statements ORDER BY total_exec_time;
\x off

Step 5: Reanalyze SQL Queries

Finally, go through the same steps as in Step 3, to reanalyze the two SQL queries (i.e., “recent statuses” and “recent statuses, with profile’s email”). You should find that they run much faster than before.

Record the new Execution Time for each query.

Calculate how many times faster the queries executed.

Based on the execution plans, explain how the index accelerated the queries.

You may notice that the pg_stat_statements results will still contain results from running the queries before the index was created, so it may be helpful to reset those statistics:

select pg_stat_statements_reset();

Submit

Submit your response on Gradescope, including:

  1. The CREATE INDEX statement used to create the index
  2. The type of index you created
  3. The execution times of the two queries, before and after creating the index
  4. How many times faster each query was after creating the index
  5. A description of how the queries’ execution plans changed after creating the index

This lab will be graded as part of your assignments grade.

Note: After submitting the assignment, I recommend erasing the PostgreSQL database, since the autogenerated data occupies a significant amount of storage. You can erase the containers and volumes using:

docker compose down -v