Lab: Query Processing and Optimization
In this lab, you will practice using PostgreSQL’s tools for query optimization.
Instructions
You may work either individually or with a partner.
Prepare by reading this overview of the PostgreSQL Planner/Optimizer.
Note: Windows users: if you haven’t already, complete Step 2 from my guide to Configuring Windows for Docker Development.
Step 1: Add Fake Users to MiniFacebook
For this lab, we will use the Django MiniFacebook project.
First, follow the documentation in the repo’s README to run Django MiniFacebook on your device.
Next, use the add_fb_bots.py
program to add fake data to the MiniFacebook application:
docker compose exec django python add_fb_bots.py --help
docker compose exec django python add_fb_bots.py
The command will run for two minutes. You can use the docker stats
command to monitor its progress.
Note: The faster your computer, the more data will be generated within the time limit.
Step 2: Enable Performance Extensions
By default, PostgreSQL doesn’t make all performance statistics visible to users. To enable this functionality, edit your docker-compose.yml
file so it contains:
postgres:
image: postgres:16.1
command: ["postgres", "-c", "shared_preload_libraries=pg_stat_statements"]
Next, rerun docker compose up -d
, so these changes to the compose file will take effect.
Finally, enter psql
in the PostgreSQL container, and run the following CREATE EXTENSION
commands:
docker compose exec postgres bash
psql --username="$POSTGRES_USER" --dbname="$POSTGRES_DB"
CREATE EXTENSION pg_stat_statements;
CREATE EXTENSION pg_buffercache;
Step 3: Review Database Schema
Run the following commands at the psql
prompt to view the schema of the MiniFacebook application:
\d minifacebook_profile
\d minifacebook_status
\d minifacebook_poke
Run the following commands to view the size of the entire django_data
database, and the size of the MiniFacebook application’s tables, respectively.
SELECT pg_size_pretty( pg_database_size('django_data') );
SELECT pg_size_pretty( pg_total_relation_size('minifacebook_profile') );
SELECT pg_size_pretty( pg_total_relation_size('minifacebook_status') );
SELECT pg_size_pretty( pg_total_relation_size('minifacebook_poke') );
Finally, run the following commands to show the number of rows in each table:
SELECT COUNT(*) FROM minifacebook_profile;
SELECT COUNT(*) FROM minifacebook_status;
SELECT COUNT(*) FROM minifacebook_poke;
Before proceeding, ensure you are familiar with the columns in each table, and the size of each table.
Step 3: Analyze SQL Queries
We will analyze queries which could be used to generate a page showing the most recent status updates.
Carefully read the following SQL command, which gets the 10 most recent statuses:
EXPLAIN (ANALYZE,VERBOSE,COSTS,BUFFERS)
SELECT id, message, date_time, profile_id
FROM minifacebook_status
ORDER BY date_time DESC
LIMIT 10;
Notice that it includes the EXPLAIN
command, which describes the execution plan.
Run the command now, and carefully read the output. Record the Execution Time
.
Next, read the following SQL command, which gets the 10 most recent statuses, along with the associated profile’s email address:
EXPLAIN (ANALYZE,VERBOSE,COSTS,BUFFERS)
SELECT email, message, date_time
FROM minifacebook_status AS status JOIN minifacebook_profile AS profile
ON status.profile_id = profile.id
ORDER BY date_time DESC
LIMIT 10;
Compare the output from this EXPLAIN
to that of the previous query. In particular, notice the increase in total Execution Time
, and the additional steps which were performed to get this query’s results. Record the Execution Time
.
Finally, run the queries without the EXPLAIN
line:
SELECT id, message, date_time, profile_id
FROM minifacebook_status
ORDER BY date_time DESC
LIMIT 10;
SELECT email, message, date_time
FROM minifacebook_status AS status JOIN minifacebook_profile AS profile
ON status.profile_id = profile.id
ORDER BY date_time DESC
LIMIT 10;
Without EXPLAIN
, these queries will execute much faster. This is due to caching performed by PostgreSQL. This shows the importance of using EXPLAIN
, since EXPLAIN
will show the time needed when the results cannot be simply read from cache.
Finally, run the following commands to display statistics on all the queries you have run. To avoid errors, copy and run the commands one at a time:
\x on
\pset pager off
SELECT * FROM pg_stat_statements ORDER BY total_exec_time;
\x off
total_exec_time
can help you identify the queries the database spends the most time on, in total.
mean_exec_time
can help you identify slow queries, even if they are not run very often.
Refer to the pg_stat_statements documentation for more information.
Note: You should see that the queries take tens or hundreds of milliseconds. This is quite slow for a database query, and could be a performance bottleneck for the application.
Step 4: Add An Index
The performance of these two queries can be greatly improved by creating an index on a particular field. This is easy to do using Django. Simply add db_index=True
to the appropriate field in django/djangoproject/minifacebook/models.py
, like so:
some_field = models.SomeKindOfField(..., db_index=True)
Next, you should actually create this index in the database by generating and running a migration:
docker compose exec django python manage.py makemigrations
docker compose exec django python manage.py migrate
Finally, confirm that the migration was created as expected by running viewing the details of the table you modified in the psql
shell:
\d TABLE_NAME
Record the kind of index that was created.
View the query statistics, and record the CREATE INDEX
statement run by your migration.
\x on
\pset pager off
SELECT * FROM pg_stat_statements ORDER BY total_exec_time;
\x off
Step 5: Reanalyze SQL Queries
Finally, go through the same steps as in Step 3, to reanalyze the two SQL queries (i.e., “recent statuses” and “recent statuses, with profile’s email”). You should find that they run much faster than before.
Record the new Execution Time
for each query.
Calculate how many times faster the queries executed.
Based on the execution plans, explain how the index accelerated the queries.
You may notice that the pg_stat_statements
results will still contain results from running the queries before the index was created, so it may be helpful to reset those statistics:
select pg_stat_statements_reset();
Submit
Submit your response on Gradescope, including:
- The
CREATE INDEX
statement used to create the index - The type of index you created
- The execution times of the two queries, before and after creating the index
- How many times faster each query was after creating the index
- A description of how the queries’ execution plans changed after creating the index
This lab will be graded as part of your assignments grade.
Note: After submitting the assignment, I recommend erasing the PostgreSQL database, since the autogenerated data occupies a significant amount of storage. You can erase the containers and volumes using:
docker compose down -v