Lab 8: Query Processing and Optimization

This lab is designed to help you practice using PostgreSQL’s tools for query optimization.

You may work with your project team.

Instructions

Prepare by reading this overview of the PostgreSQL Planner/Optimizer.

Step 1: Add Fake Users to MiniFacebook

For this lab, we will use the Django MiniFacebook example.

First, ensure you have the application running on your device. I’ve summarized the commands needed below, though you should refer to the repo’s documentation for more information:

docker compose up -d
docker compose exec django python manage.py migrate

Next, you will use the add_fb_bots.py program to add 50,000 fake profiles to the MiniFacebook application:

docker compose exec django python add_fb_bots.py --help
docker compose exec django python add_fb_bots.py --new-profiles 50000 --max-statuses 4 --max-pokes 2

This should take about a minute to run. You can use the docker stats command to monitor its progress.

Step 2: Enable Performance Extensions

By default, PostgreSQL doesn’t make all performance statistics visible to users. To enable this functionality, edit your docker-compose.yml file so it contains:

  postgres:
    image: postgres:13.1
    command: ["postgres", "-c", "shared_preload_libraries=pg_stat_statements"]

Next, rerun docker compose up -d, so these changes to the compose file will take effect.

Finally, enter psql in the PostgreSQL container, and run the following CREATE EXTENSION commands:

docker compose exec postgres bash
psql --username="$POSTGRES_USER" --dbname="$POSTGRES_DB"
CREATE EXTENSION pg_stat_statements;
CREATE EXTENSION pg_buffercache;

Step 3: Review Database Schema

Run the following commands at the psql prompt to view the schema of the MiniFacebook application:

\d minifacebook_profile
\d minifacebook_status
\d minifacebook_poke

Run the following commands to view the size of the entire django_data database, and the size of the MiniFacebook application’s tables, respectively.

SELECT pg_size_pretty( pg_database_size('django_data') );
SELECT pg_size_pretty( pg_total_relation_size('minifacebook_profile') );
SELECT pg_size_pretty( pg_total_relation_size('minifacebook_status') );
SELECT pg_size_pretty( pg_total_relation_size('minifacebook_poke') );

Finally, run the following commands to show the number of rows in each table:

SELECT COUNT(*) FROM minifacebook_profile;
SELECT COUNT(*) FROM minifacebook_status;
SELECT COUNT(*) FROM minifacebook_poke;

Before proceeding, ensure you are familiar with the columns in each table, and the size of each table.

Step 3: Analyze SQL Queries

We will analyze queries which could be used to generate a page showing the most recent status updates.

Carefully read the following SQL command, which gets the 10 most recent statuses:

EXPLAIN (ANALYZE,VERBOSE,COSTS,BUFFERS)
  SELECT id, message, date_time, profile_id
  FROM minifacebook_status
  ORDER BY date_time DESC
  LIMIT 10;

Notice that it includes the EXPLAIN command, which will describe how the following statement is run.

Run the command now, and carefully read the output. Record the Execution Time. Notice that it should be close to a second, which is quite slow for a database query.

Next, read the following SQL command, which gets the 10 most recent statuses, along with the associated profile’s email address:

EXPLAIN (ANALYZE,VERBOSE,COSTS,BUFFERS)
  SELECT email, message, date_time
  FROM minifacebook_status AS status JOIN minifacebook_profile AS profile
    ON status.profile_id = profile.id
  ORDER BY date_time DESC
  LIMIT 10;

Compare the output from this EXPLAIN to that of the previous query. In particular, notice the increase in total Execution Time, and the additional steps which were performed to get this query’s results. Record the Execution Time.

Finally, run the queries without the EXPLAIN line:

SELECT id, message, date_time, profile_id
FROM minifacebook_status
ORDER BY date_time DESC
LIMIT 10;
SELECT email, message, date_time
FROM minifacebook_status AS status JOIN minifacebook_profile AS profile
  ON status.profile_id = profile.id
ORDER BY date_time DESC
LIMIT 10;

Notice that they complete nearly instantaneously, because of caching performed by PostgreSQL. This shows the importance of using EXPLAIN, since EXPLAIN will show the time needed when the results cannot be simply read from cache.

Finally, run the following commands to display statistics on all the queries you have run:

\x on
\pset pager off
SELECT * FROM pg_stat_statements ORDER BY total_exec_time;
\x off

total_exec_time can help you identify the queries the database spends the most time on, in total. mean_exec_time can help you identify slow queries, even if they are not run very often. Refer to the pg_stat_statements documentation for more information.

Step 4: Add An Index

The performance of these two queries can be greatly improved by creating an index on a particular field. This is easy to do using Django. Simply add db_index=True to the appropriate field in django/djangoproject/minifacebook/models.py, like so:

some_field = models.SomeKindOfField(..., db_index=True)

Next, you should actually create this index in the database by generating and running a migration:

docker compose exec django python manage.py makemigrations
docker compose exec django python manage.py migrate

Finally, confirm that the migration was created as expected by running the following in the psql shell:

\d minifacebook_status

Record the kind of index that was created.

View the query statistics, and record the CREATE INDEX statement run by your migration.

\x on
\pset pager off
SELECT * FROM pg_stat_statements ORDER BY total_exec_time;
\x off

Step 5: Reanalyze SQL Queries

Finally, go through the same steps as in Step 3, to reanalyze the two SQL queries (i.e., “recent statuses” and “recent statuses, with profile’s email”). You should find that they run much faster than before.

Record the new Execution Time for each query, and calculate how many times faster each queries is.

Describe how the execution plans differ.

You may notice that the pg_stat_statements results will still contain results from running the queries before the index was created, so it may be helpful to reset those statistics:

select pg_stat_statements_reset();

Submit

If you worked with your team, only one of you should upload the PDF to Moodle. You should submit a PDF containing:

  1. The names of your teammates
  2. The CREATE INDEX statement used to create the index
  3. The type of index you created
  4. The execution times of the two queries, before and after creating the index
  5. How many times faster each query is after creating the index
  6. A description of how the queries’ execution plans changed after creating the index

The assignment will be graded as part of your assignment grade.