About Me!

This blog is about my musings and thoughts. I hope you find it useful, at most, and entertaining, at least.

Résumé [PDF]

Other Pages

Quotes

Links

Oak Island

Items for Sale

Presence Elsewhere

jim@jimkeener.com

del.icio.us

Twitter

Facebook

LinkedIn

GitHub

BitBucket

Keybase.io

Understanding PostgreSQL's EXPLAIN output and Query Planner

Date: 2015-09-07
Tags: postgres development

We’ll begin our exploration of the EXPLAIN (more information from the docs) clause and query planner by downloading the MovieLens 20 million ratings file. The README.txt file contained in the zip has informational we’ll use to build our test. (Note: The zip is 132 MB. I wanted a large dataset for reasons we’ll get into.)

(Note: While the EXPLAIN output and syntax shouldn’t change drastically, I would note that all the output comes from PostgreSQL 9.3.9.)


jim@lilith /monongahela/movielens% wget -x http://files.grouplens.org/datasets/movielens/ml-20m.zip
jim@lilith /monongahela/movielens% wget -x http://files.grouplens.org/datasets/movielens/ml-20m.zip.md5
jim@lilith /monongahela/movielens/files.grouplens.org/datasets/movielens% [ "`md5sum ml-20m.zip | awk '{print $1}'`" = "`cat ml-20m.zip.md5 | awk '{print $4}'`" ] && echo "OK"
jim@lilith /monongahela/movielens/files.grouplens.org/datasets/movielens% unzip ml-20m.zip

Then we’ll create the database and the movies table. (We’ll get to the other tables later.) We’ll then import the csv file directly. Lastly, we’ll turn the list of genres each movie has, stored as a pipe-delimeted string, into an text array field.


jim@lilith /monongahela/movielens% createdb planner_example
jim@lilith /monongahela/movielens% psql planner_example
[local]:5432 jim@planner_example=# create table movies (
  movie_id integer, 
  title text, 
  genres_text text, 
  genres text[]);
[local]:5432 jim@planner_example=# copy movies (movie_id, title, genres_text) from '/monongahela/movielens/files.grouplens.org/datasets/movielens/ml-20m/movies.csv' with (format csv, header);
[local]:5432 jim@planner_example=# update movies set genres = regexp_split_to_array(genres_text, '\|');
[local]:5432 jim@planner_example=# alter table movies drop genres_text;

To start, we will do a simple selection by the movie_id. (Note: We havn’t yet created an index, let alone a primary key). Then we’ll want to look at the EXPLAIN output.


[local]:5432 jim@planner_example=# select * from movies where movie_id = 6645;
 movie_id |      title      |             genres              
----------+-----------------+---------------------------------
     6645 | THX 1138 (1971) | {Action,Adventure,Drama,Sci-Fi}
(1 row)
Time: 7.989 ms
[local]:5432 jim@planner_example=# explain select * from movies where movie_id = 6645;
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=1 width=77)
   Filter: (movie_id = 6645)
(2 rows)
Time: 0.493 ms

The first thing to notice about the EXPLAIN output is that it’s named “Query Plan” and the second is that it has two lines, the second being indented. The default output1 indents additional steps that need to perform the step above it. Here, we are told that this query will do a “Seq Scan”, a.k.a. Sequence Scan, meaning that it will scan every row in the table in sequence. The indented row tells us that the results of the Seq Scan will be filtered, and what the filter is.

The Seq Scan row has some additional information attached to it: cost, rows, and width. Let’s take these backwards, as cost is the most complex to describe. Width is the approximate average size in bytes of the returned rows.


[local]:5432 jim@planner_example=# explain select title from movies where movie_id = 6645;
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=1 width=28)
   Filter: (movie_id = 6645)
(2 rows)
Time: 0.587 ms
[local]:5432 jim@planner_example=# explain select genres from movies where movie_id = 6645;
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=1 width=45)
   Filter: (movie_id = 6645)
(2 rows)
Time: 0.521 ms
[local]:5432 jim@planner_example=# explain select movie_id, title from movies where movie_id = 6645;
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=1 width=32)
   Filter: (movie_id = 6645)
(2 rows)
Time: 0.568 ms

In the first query we see that the integer type is 4-bytes long, which isn’t entirely unexpected. The title is a text field, which can be arbitrarily long. We see that the table statistics kept by PostgreSQL have it listed as 28-bytes long on average.


[local]:5432 jim@planner_example=# select avg(length(title)) from movies ;
         avg         
---------------------
 27.8817361976684508
(1 row)
Time: 23.780 ms

Those statistics aren’t half bad! The final query, which selects only the movie_id and title has a width of 32, which not-so-suprisingly is 4+28. The width in our original query is 77, again, which is 4+28+45.

Rows is a guess at how many rows will be returned. PostgreSQL keeps statistics on the number of total rows and the cardnality of the values in the row, along with other information, to expect how many rows will be returned. This is important because part of cost, which we cover next, is how much disk IO will be performed. Returned the majority of the table has different IO needs than returning a handful of rows.

Cost is made of two values: start-up cost and approximate cost to complete the step. Cost is not a direct measure of time, but should be proportional. The values configuration constants used by the planner are as follows

Parameter Description Default Relative cost vs seq_page_cost
random_page_cost Nonsequential disk page fetch. 4.00 4x slower
seq_page_cost Sequentially disk page fetch. 1.00
cpu_tuple_cost Processing each tuple. 0.01 100x faster
cpu_index_tuple_cost Processing each entry during an index scan. 0.005 200x faster
cpu_operator_cost Processing each operator or function call. 0.0025 400x faster

Some vocabulary:

  • Tuple is a row of information.
  • Disk page is a block of information that can be read from the disk in a single operation; this may contain multiple tuples.

Each operation that the Query Planner performs is broken into some multiple of each of these operations. As mentioned earlier, the planner attempts to estimate the number of rows returned. This information is partly used to predict if sequential or nonsequential reads will be required. If it becomes cheaper to read the entire table from disk than to read a lot of random table pages, the planner may opt to do so — manifesting in a Seq Scan as opposed to an Index Scan, for exmaple.

These values, taken from many years of experience, can be tweeked in postgresql.conf, though I wouldn’t recommend that unless you are very sure of what you’re doing. For instance, an SSD may have a random-access cost similar to a sequential read.

Additionally, this is why I opted to do this tutorial on such a large example set. When the table size is small, it is often faster to Seq Scan, than it is to read the index and then read the tuples required. When the planner thinks it’ll be reading the majority of the rows anyway, it may forgo indices, even in very large tables, for the sake of decreasing the total amount of disk IO.

Since movie_id is supposed to be the primary key, let’s create a unique index and set it as the primary key.


[local]:5432 jim@planner_example=# create unique index on movies (movie_id);
[local]:5432 jim@planner_example=# alter table movies add primary key using index movies_movie_id_idx ;
[local]:5432 jim@planner_example=# \d movies
     Table "public.movies"
  Column  |  Type   | Modifiers 
----------+---------+-----------
 movie_id | integer | not null
 title    | text    | 
 genres   | text[]  | 
Indexes:
    "movies_movie_id_idx" PRIMARY KEY, btree (movie_id)

Now, let’s rerun the query and see what’s changed.


[local]:5432 jim@planner_example=# select * from movies where movie_id = 6645; movie_id |      title      |             genres              
----------+-----------------+---------------------------------
     6645 | THX 1138 (1971) | {Action,Adventure,Drama,Sci-Fi}
(1 row)
Time: 0.737 ms

The first thing we should notice is that the query ran 10x faster! Let’s look at the explain.


[local]:5432 jim@planner_example=# explain select * from movies where movie_id = 6645;
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Index Scan using movies_movie_id_idx on movies  (cost=0.29..8.30 rows=1 width=77)
   Index Cond: (movie_id = 6645)
(2 rows)
Time: 0.551 ms

We should notice that that Seq Scan from before has been transformed to an Index Scan, which additionally tells us the index being used, and which condition is being used in the index (the indented Index Cond line). The index scan allows us to not read pages from disk and skip directly to the one that will fulfull our query. This is evident in the cost: over 1000 to 8.

Another use of EXPLAIN is EXPLAIN ANALYZE, which actually runs the query and compares the actual values to the predicted ones.


[local]:5432 jim@planner_example=# explain analyze select * from movies where movie_id = 6645;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Index Scan using movies_movie_id_idx on movies  (cost=0.29..8.30 rows=1 width=77) (actual time=0.023..0.025 rows=1 loops=1)
   Index Cond: (movie_id = 6645)
 Total runtime: 0.057 ms
(3 rows)
Time: 0.561 ms

Again, cost and actual time are in different units and will never be identical — cost is unitless and actual time is in ms. There is a rows value which is the same as in the predicted values. A hint here is that if the predicted number of rows is vastly different from the actual, a VACUUM ANALYZE may need to be run to update table and index statistics. The predicted number of rows won’t often match the actual rows exactly, but should be close. There is an additional value in the actual paramters: loops. This is an indication of how many times this part of the plan was run. We’ll come back to this value.

Now, let’s look at searching on the title.


[local]:5432 jim@planner_example=# select * from movies where title like '%THX%';
 movie_id |                  title                   |             genres              
----------+------------------------------------------+---------------------------------
     6645 | THX 1138 (1971)                          | {Action,Adventure,Drama,Sci-Fi}
   116157 | Electronic Labyrinth THX 1138 4EB (1967) | {Sci-Fi}
(2 rows)
Time: 14.442 ms

We see that the full-length motion picture and the short film that inspired it. Let’s see what the planner did to do the search.


[local]:5432 jim@planner_example=# explain select * from movies where title like '%THX%';
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=3 width=77)
   Filter: (title ~~ '%THX%'::text)
(2 rows)
Time: 0.580 ms

PostgreSQL does a Seq Scan, which isn’t surprising considering we haven’t created an index on the title field. Let’s create an index and see what happens.


[local]:5432 jim@planner_example=# create index on movies (title);
CREATE INDEX
Time: 205.577 ms
[local]:5432 jim@planner_example=# \d movies
     Table "public.movies"
  Column  |  Type   | Modifiers 
----------+---------+-----------
 movie_id | integer | not null
 title    | text    | 
 genres   | text[]  | 
Indexes:
    "movies_movie_id_idx" PRIMARY KEY, btree (movie_id)
    "movies_title_idx" btree (title)
[local]:5432 jim@planner_example=# select * from movies where title like '%THX%';
 movie_id |                  title                   |             genres              
----------+------------------------------------------+---------------------------------
     6645 | THX 1138 (1971)                          | {Action,Adventure,Drama,Sci-Fi}
   116157 | Electronic Labyrinth THX 1138 4EB (1967) | {Sci-Fi}
(2 rows)
Time: 14.207 ms
[local]:5432 jim@planner_example=# explain select * from movies where title like '%THX%';
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=3 width=77)
   Filter: (title ~~ '%THX%'::text)
(2 rows)
Time: 0.738 ms

It doesn’t use the index! What!

Taking a look at table description, we see that movies_title_idx is a btree index. These indecies are fast to use and update, but can only handle equality and inequality comparisions (<, <=, =, >=, and >), not set contains or wild-card searches. Luckily! there is the pg_trgm (trgm = trigram) which creates an index based on all substrings up-to 3 characters long: Widdershins is broken into ‘e’, ‘n’, ‘W’, ‘r’, ‘h’, ‘i’, ‘d’, ‘s’, ‘ns’, ‘er’, ‘Wi’, ‘in’, ‘de’, ‘dd’, ‘rs’, ‘hi’, ‘id’, ‘sh’, ‘dde’, ‘ers’, ‘shi’, ‘der’, ‘ins’, ‘hin’, ‘rsh’, ‘idd’, and ‘Wid’.


[local]:5432 jim@planner_example=# create extension pg_trgm;
[local]:5432 jim@planner_example=# create index on movies using gist (title gist_trgm_ops);
[local]:5432 jim@planner_example=# \d movies
     Table "public.movies"
  Column  |  Type   | Modifiers 
----------+---------+-----------
 movie_id | integer | not null
 title    | text    | 
 genres   | text[]  | 
Indexes:
    "movies_movie_id_idx" PRIMARY KEY, btree (movie_id)
    "movies_title_idx" btree (title)
    "movies_title_idx1" gist (title gist_trgm_ops)
[local]:5432 jim@planner_example=# drop index movies_title_idx;
[local]:5432 jim@planner_example=# select * from movies where title like '%THX%';
 movie_id |                  title                   |             genres              
----------+------------------------------------------+---------------------------------
     6645 | THX 1138 (1971)                          | {Action,Adventure,Drama,Sci-Fi}
   116157 | Electronic Labyrinth THX 1138 4EB (1967) | {Sci-Fi}
(2 rows)
Time: 10.321 ms
[local]:5432 jim@planner_example=# explain select * from movies where title like '%THX%';
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Bitmap Heap Scan on movies  (cost=4.30..15.74 rows=3 width=77)
   Recheck Cond: (title ~~ '%THX%'::text)
   ->  Bitmap Index Scan on movies_title_idx1  (cost=0.00..4.30 rows=3 width=0)
         Index Cond: (title ~~ '%THX%'::text)
(4 rows)
Time: 0.648 ms

There is a lot going on there. pg_trgm is a built-in extension, meaning that, like hstore, it needs enabled with CREATE EXTENSION, but that 3rd part software need not be installed. Next, we’re using GiST index. While I’ll leave an in-depth examination of the types of indices for another post, I will say that b-trees are tree-like structure that keeps data in sorted order and are fast to update and use; GiST indices can keep arbitrary, multi-dimentional indices which are more flexible, a little slower to used than b-trees, and much slower to update. (Note: GiST isn’t an index, actually. It’s an interface for an index to be presented as and therefor may have different performance characteristics for different types. However, all types should offer the same types of operations that can be accelerated by the index.)

We see our query runs a little faster, but not significantly faster. Let’s take a look at what steps go into performing our query. First, we see that a “Bitmap Heap Scan” is being done. Looking beneath, to the -> we see that to do the Bitmap Heap Scan, we need to do the Bitmap Index Scan, which traverses the index and ascertains if a disk page contains a tuple matching the Index Cond, additionally, we are told what index and what condition is being used. The Recheck Cond done by the Bitmap Heap Scan is what condition is used to filter the correct tuple from the disk page (remember! the Bitmap Index Scan only tells us a tuple is on the page, not where!)

———-

1 PostgreSQL has multiple output formats for EXPLAIN, some of which are suitable for consumption by other tools.


[local]:5432 jim@planner_example=# explain (format xml) select * from movies where movie_id = 6645;
                        QUERY PLAN                        
----------------------------------------------------------
 <explain xmlns="http://www.postgresql.org/2009/explain">+
   <Query>                                               +
     <Plan>                                              +
       <Node-Type>Index Scan</Node-Type>                 +
       <Scan-Direction>Forward</Scan-Direction>          +
       <Index-Name>movies_movie_id_idx</Index-Name>      +
       <Relation-Name>movies</Relation-Name>             +
       <Alias>movies</Alias>                             +
       <Startup-Cost>0.29</Startup-Cost>                 +
       <Total-Cost>8.30</Total-Cost>                     +
       <Plan-Rows>1</Plan-Rows>                          +
       <Plan-Width>77</Plan-Width>                       +
       <Index-Cond>(movie_id = 6645)</Index-Cond>        +
     </Plan>                                             +
   </Query>                                              +
 </explain>
(1 row)
Time: 2.919 ms
[local]:5432 jim@planner_example=# explain (format json) select * from movies where movie_id = 6645;
                 QUERY PLAN                 
--------------------------------------------
 [                                         +
   {                                       +
     "Plan": {                             +
       "Node Type": "Index Scan",          +
       "Scan Direction": "Forward",        +
       "Index Name": "movies_movie_id_idx",+
       "Relation Name": "movies",          +
       "Alias": "movies",                  +
       "Startup Cost": 0.29,               +
       "Total Cost": 8.30,                 +
       "Plan Rows": 1,                     +
       "Plan Width": 77,                   +
       "Index Cond": "(movie_id = 6645)"   +
     }                                     +
   }                                       +
 ]
(1 row)
Time: 0.601 ms
[local]:5432 jim@planner_example=# explain (format yaml) select * from movies where movie_id = 6645;
              QUERY PLAN               
---------------------------------------
 - Plan:                              +
     Node Type: "Index Scan"          +
     Scan Direction: "Forward"        +
     Index Name: "movies_movie_id_idx"+
     Relation Name: "movies"          +
     Alias: "movies"                  +
     Startup Cost: 0.29               +
     Total Cost: 8.30                 +
     Plan Rows: 1                     +
     Plan Width: 77                   +
     Index Cond: "(movie_id = 6645)"
(1 row)
Time: 0.537 ms

(The +‘s at the end of the line represent a new line and would not be returned in a programmatic fashion, such as your favorite programming languages postgresql library.)

These formats allow for nesting of the query plan as well. (Ed Note: Replace these with a better example when we get to that point).