About Me!

This blog is about my musings and thoughts. I hope you find it useful, at most, and entertaining, at least.

Résumé [PDF]

Other Pages

Quotes

Links

Oak Island

Items for Sale

Presence Elsewhere

jim@jimkeener.com

del.icio.us

Twitter

Facebook

LinkedIn

GitHub

BitBucket

Keybase.io

Interpolating Old Census Data

Date: 2016-03-15

Tags: census postgis postgres gis

Problem

A group I’m part of is trying to learn about racial displacement around Pittsburgh. To do so we looked at the census data. We quickly learned that the census geometries (e.g. blocks and tracts) aren’t static between censuses.

While this isn’t entirely surprising, it does complicate the matter of comparing the demographics of blocks and tracts, making it more difficult to compute the overlaps easily.

I was looking at the Longitudinal Tract Database and clicked maps and didn’t like the terms there. (I later found that the LTDB has a separate download section but I couldn’t find the terms of use there. Also, while I can appreciate SPSS being very common it upsets me to see people continuing to distribute data in non-free formats. Thanks be for PSPP !

Regardless, I couldn’t find terms of use and many of their menus don’t work, so I decided to figure this out on my own. Looking further into the issue, I found that the census publishes a series of Relationship Files that contains which geometries were merged and split between 2000 and 2010. Moreover, it contains the area of each AND the area of the intersection between them. (The intersection part is important, because as can be seen in the image over, the geometries don’t align properly and the 2000 geometries are “rougher” than the 2010 ones anyway. So It would be much more difficult to do the intersection area based on the published geometries.)

Solution

Having the intersection areas allows us to do “areal interpolation.” I used a simple method of $$ attribute_{target} = attribute_{source} * \frac {area_{intersection}}{area_{source}} $$. Since each geometries is only overlapping at most a handful of immediately adjacent ones, and often with 10% or more of its total area, I felt this simpler method was warranted.

Now on to the creation of the interpolation layers. First we need census data to interpolate. I’m in Pennsylvania, so we’ll focus on it. I was also specifically focusing on Race, which means all my examples are for Census 2010 Table P3 and Census 2000 Table P7

Importing the Census 2000 Summary Files

wget -x -c -m -r -np http://www2.census.gov/census_2000/datasets/Summary_File_1/Pennsylvania/ will pull all the tables for Summary File 1 which is unsampled data. (Summary File 3 is a 1-in-6 sample and contains information that was later moved into the ACS .)

I then gathered the schema for the 2000 Census from various files such as the MS Access Shell and SAS Programs. (There is a Free Software readers for MS Access called MDB Tools [Warning: SourceForge Link]. Dap is an application that can read SAS files and do much of what SAS can. For my purposes, I was able to get what I needed directly from the SAS program file itself.)

I’ve been using PostgreSQL, but you should be able to modify the queries if you choose to use another RDBMS.

For the sake of time, I created a schema file (it contains all tables, but not all indecies a full import would want) and an import query for the geoheader. Remember to import the sf1geo file into the census2000.tmp_geoheader first and then run the import query. I find it easier to work in fips ids, so I create them, as can be seen as te last line in the import query.

The census data files, once unziped, can be imported via a COPY command (\copy in psql), e.g. \copy census2000.sf10001 from "/mnt/tmp/www2.census.gov/census_2000/datasets/Summary_File_1/Pennsylvania/pa00001.uf1"

Importing the 2010 Census data is similar. I’ll dedicate another post to the details of the census and ACS imports.

I wasn’t able to find broken out views or human-readable-field-name views, so I’m working on find a file similar to the 2010 field id to name mapping put out by the census. Alternatively, I might attempt to extract that information from the text of the SF1 Technical Document or from FactFinder . Either way, that’s for another blog post!

Create Interpolations

Relationship Files

I then grabbed the Relationship File for the blocks, created this schema and used a COPY command to import the file. From there I ran a query to update the conversion factors (to make later queries simpler).

update public.tabblock_00_10 set c10_00 = case when arealand_2010 = 0 then 0 else arealand_int::double precision / arealand_2010::double precision end,
                                 c00_10 = case when arealand_2000 = 0 then 0 else arealand_int::double precision / arealand_2000::double precision end;

Interpolate!

First I began by doing the interpolation at the following summary levels [PDF] (HTML table without the block level code):

Summary Level Description
101 Block Level (State-County-County Subdivision-Place/Remainder-Census Tract-Block Group-Block)
140 State-County-Census Tract
050 State-County
040 State



create table c00p7proj2010 as select

state_2010 as state, county_2010 as county, tract_2010 as tract, blk_2010 as block, concat(state_2010, county_2010, tract_2010, blk_2010) as geoid10, sum(P007001 * c00_10) AS Total_population, sum(P007002 * c00_10) AS White_alone, sum(P007003 * c00_10) AS Black_or_African_American_alone, sum(P007004 * c00_10) AS American_Indian_and_Alaska_Native_alone, sum(P007005 * c00_10) AS Asian_alone, sum(P007006 * c00_10) AS Native_Hawaiian_and_Other_Pacific_Islander_alone, sum(P007007 * c00_10) AS Some_Other_Race_alone, sum(P007008 * c00_10) AS Two_or_More_Races from census2000.sf10002 inner join census2000.sf1geo using (logrecno) inner join tabblock_00_10 using (geoid00) where sumlev in (‘101’) group by rollup(state_2010, county_2010, tract_2010, blk_2010);

The table c00p7proj2010 now contains the Census 2000 counts interpolated into the Census 2010 blocks. Now I wanted to compute the values for geometries higher in the hierarchy [PDF] made up of the blocks we just computed. “Why not just interpolate each individually?” you may be asking. The reason is that if a tract were split in such a way that an area of low population (e.g. a park) was made its own tract or moved to a different tract, the interpolation would move people who never lived there to the other tract. By performing the interpolation at the smalled possible geometry, this issue is largely avoided as areas of low population are their own block, not incorporated into others.



create table c00p7proj2010_rollup as select

state, county, tract, block, concat(state, county, tract, block) as geoid10, sum(Total_population) AS Total_population, sum(White_alone) AS White_alone, sum(Black_or_African_American_alone) AS Black_or_African_American_alone, sum(American_Indian_and_Alaska_Native_alone) AS American_Indian_and_Alaska_Native_alone, sum(Asian_alone) AS Asian_alone, sum(Native_Hawaiian_and_Other_Pacific_Islander_alone) AS Native_Hawaiian_and_Other_Pacific_Islander_alone, sum(Some_Other_Race_alone) AS Some_Other_Race_alone, sum(Two_or_More_Races) AS Two_or_More_Races from c00p7proj2010 group by rollup(state, county, tract, block);

In order to deal with the tables that are the same name, we’re going to tease out the fields we want from the 2010 Census.



create table c10p3 as select

state, county, tract, block, geoid10, P0030001 AS Total_population, P0030002 AS White_alone, P0030003 AS Black_or_African_American_alone, P0030004 AS American_Indian_and_Alaska_Native_alone, P0030005 AS Asian_alone, P0030006 AS Native_Hawaiian_and_Other_Pacific_Islander_alone, P0030007 AS Some_Other_Race_alone, P0030008 AS Two_or_More_Races from census2010.sf1_00003 inner join census2010.geo_header_sf1 using (logrecno) where sumlev in (‘101’, ‘140’, ‘050’, ‘040’);

Now we want to compute the percent each group makes of the total population.



create table c10p3_percent as select

geoid10, Total_population, Total_population::double precision / Total_population::double precision AS Total_population_percent, White_alone, White_alone / Total_population::double precision AS White_alone_percent, Black_or_African_American_alone, Black_or_African_American_alone / Total_population::double precision AS Black_or_African_American_alone_percent, American_Indian_and_Alaska_Native_alone, American_Indian_and_Alaska_Native_alone / Total_population::double precision AS American_Indian_and_Alaska_Native_alone_percent, Asian_alone, Asian_alone / Total_population::double precision AS Asian_alone_percent, Native_Hawaiian_and_Other_Pacific_Islander_alone, Native_Hawaiian_and_Other_Pacific_Islander_alone / Total_population::double precision AS Native_Hawaiian_and_Other_Pacific_Islander_alone_percent, Some_Other_Race_alone, Some_Other_Race_alone / Total_population::double precision AS Some_Other_Race_alone_percent, Two_or_More_Races, Two_or_More_Races / Total_population::double precision AS Two_or_More_Races_percent from c10p3 where Total_population::double precision > 0 ;

create table c00p7proj2010_percent as select

geoid10, Total_population, Total_population / Total_population AS Total_population_percent, White_alone, White_alone / Total_population AS White_alone_percent, Black_or_African_American_alone, Black_or_African_American_alone / Total_population AS Black_or_African_American_alone_percent, American_Indian_and_Alaska_Native_alone, American_Indian_and_Alaska_Native_alone / Total_population AS American_Indian_and_Alaska_Native_alone_percent, Asian_alone, Asian_alone / Total_population AS Asian_alone_percent, Native_Hawaiian_and_Other_Pacific_Islander_alone, Native_Hawaiian_and_Other_Pacific_Islander_alone / Total_population AS Native_Hawaiian_and_Other_Pacific_Islander_alone_percent, Some_Other_Race_alone, Some_Other_Race_alone / Total_population AS Some_Other_Race_alone_percent, Two_or_More_Races, Two_or_More_Races / Total_population AS Two_or_More_Races_percent from c00p7proj2010 where Total_population > 0;

Finally, we create a table that contains the absolute differences in population along with the difference in percentage (as percentage points, i.e. not a ratio).



create table diff_00p10_10_00p7_10p3 as select

geoid10, c10p3.Total_population AS c10p3_Total_population, c00p7.Total_population AS c00p7_Total_population, c10p3.Total_population-c00p7.Total_population AS diff_Total_population, c10p3.Total_population_percent AS c10p3_Total_population_percent, c00p7.Total_population_percent AS c00p7_Total_population_percent, c10p3.Total_population_percent-c00p7.Total_population_percent AS diff_Total_population_percent, c10p3.White_alone AS c10p3_White_alone, c00p7.White_alone AS c00p7_White_alone, c10p3.White_alone-c00p7.White_alone AS diff_White_alone, c10p3.White_alone_percent AS c10p3_White_alone_percent, c00p7.White_alone_percent AS c00p7_White_alone_percent, c10p3.White_alone_percent-c00p7.White_alone_percent AS diff_White_alone_percent, c10p3.Black_or_African_American_alone AS c10p3_Black_or_African_American_alone, c00p7.Black_or_African_American_alone AS c00p7_Black_or_African_American_alone, c10p3.Black_or_African_American_alone-c00p7.Black_or_African_American_alone AS diff_Black_or_African_American_alone, c10p3.Black_or_African_American_alone_percent AS c10p3_Black_or_African_American_alone_percent, c00p7.Black_or_African_American_alone_percent AS c00p7_Black_or_African_American_alone_percent, c10p3.Black_or_African_American_alone_percent-c00p7.Black_or_African_American_alone_percent AS diff_Black_or_African_American_alone_percent, c10p3.American_Indian_and_Alaska_Native_alone AS c10p3_American_Indian_and_Alaska_Native_alone, c00p7.American_Indian_and_Alaska_Native_alone AS c00p7_American_Indian_and_Alaska_Native_alone, c10p3.American_Indian_and_Alaska_Native_alone-c00p7.American_Indian_and_Alaska_Native_alone AS diff_American_Indian_and_Alaska_Native_alone, c10p3.American_Indian_and_Alaska_Native_alone_percent AS c10p3_American_Indian_and_Alaska_Native_alone_percent, c00p7.American_Indian_and_Alaska_Native_alone_percent AS c00p7_American_Indian_and_Alaska_Native_alone_percent, c10p3.American_Indian_and_Alaska_Native_alone_percent-c00p7.American_Indian_and_Alaska_Native_alone_percent AS diff_American_Indian_and_Alaska_Native_alone_percent, c10p3.Asian_alone AS c10p3_Asian_alone, c00p7.Asian_alone AS c00p7_Asian_alone, c10p3.Asian_alone-c00p7.Asian_alone AS diff_Asian_alone, c10p3.Asian_alone_percent AS c10p3_Asian_alone_percent, c00p7.Asian_alone_percent AS c00p7_Asian_alone_percent, c10p3.Asian_alone_percent-c00p7.Asian_alone_percent AS diff_Asian_alone_percent, c10p3.Native_Hawaiian_and_Other_Pacific_Islander_alone AS c10p3_Native_Hawaiian_and_Other_Pacific_Islander_alone, c00p7.Native_Hawaiian_and_Other_Pacific_Islander_alone AS c00p7_Native_Hawaiian_and_Other_Pacific_Islander_alone, c10p3.Native_Hawaiian_and_Other_Pacific_Islander_alone-c00p7.Native_Hawaiian_and_Other_Pacific_Islander_alone AS diff_Native_Hawaiian_and_Other_Pacific_Islander_alone, c10p3.Native_Hawaiian_and_Other_Pacific_Islander_alone_percent AS c10p3_Native_Hawaiian_and_Other_Pacific_Islander_alone_percent, c00p7.Native_Hawaiian_and_Other_Pacific_Islander_alone_percent AS c00p7_Native_Hawaiian_and_Other_Pacific_Islander_alone_percent, c10p3.Native_Hawaiian_and_Other_Pacific_Islander_alone_percent-c00p7.Native_Hawaiian_and_Other_Pacific_Islander_alone_percent AS diff_Native_Hawaiian_and_Other_Pacific_Islander_alone_percent, c10p3.Some_Other_Race_alone AS c10p3_Some_Other_Race_alone, c00p7.Some_Other_Race_alone AS c00p7_Some_Other_Race_alone, c10p3.Some_Other_Race_alone-c00p7.Some_Other_Race_alone AS diff_Some_Other_Race_alone, c10p3.Some_Other_Race_alone_percent AS c10p3_Some_Other_Race_alone_percent, c00p7.Some_Other_Race_alone_percent AS c00p7_Some_Other_Race_alone_percent, c10p3.Some_Other_Race_alone_percent-c00p7.Some_Other_Race_alone_percent AS diff_Some_Other_Race_alone_percent, c10p3.Two_or_More_Races AS c10p3_Two_or_More_Races, c00p7.Two_or_More_Races AS c00p7_Two_or_More_Races, c10p3.Two_or_More_Races-c00p7.Two_or_More_Races AS diff_Two_or_More_Races from c10p3_percent as c10p3 inner join c00p7proj2010_percent as c00p7 using (geoid10);

In order to display the values on a map, we need to join the values with the Census 2010 Tiger geometries.

create or replace view blk_00p10_10_00p7_10p3_geom as select * from diff_00p10_10_00p7_10p3 inner join tiger2010.tabblock using (geoid10);
create or replace view tract_00p10_10_00p7_10p3_geom as select * from diff_00p10_10_00p7_10p3 inner join tiger2010.tract using (geoid10);

2016 New Years Resolutions

Date: 2016-01-25

Tags: resolutions time-management

Starting December 31st and running to the 11th of January I spent some time learning emac’s orgmode and thinking out this year. What would I like to accomplish? More importantly, though, What do I have time to accomplish?

Understanding PostgreSQL's EXPLAIN output and Query Planner

Date: 2015-09-07

Tags: postgres development

We’ll begin our exploration of the EXPLAIN (more information from the docs) clause and query planner by downloading the MovieLens 20 million ratings file. The README.txt file contained in the zip has informational we’ll use to build our test. (Note: The zip is 132 MB. I wanted a large dataset for reasons we’ll get into.)

(Note: While the EXPLAIN output and syntax shouldn’t change drastically, I would note that all the output comes from PostgreSQL 9.3.9.)


jim@lilith /monongahela/movielens% wget -x http://files.grouplens.org/datasets/movielens/ml-20m.zip
jim@lilith /monongahela/movielens% wget -x http://files.grouplens.org/datasets/movielens/ml-20m.zip.md5
jim@lilith /monongahela/movielens/files.grouplens.org/datasets/movielens% [ "`md5sum ml-20m.zip | awk '{print $1}'`" = "`cat ml-20m.zip.md5 | awk '{print $4}'`" ] && echo "OK"
jim@lilith /monongahela/movielens/files.grouplens.org/datasets/movielens% unzip ml-20m.zip

Then we’ll create the database and the movies table. (We’ll get to the other tables later.) We’ll then import the csv file directly. Lastly, we’ll turn the list of genres each movie has, stored as a pipe-delimeted string, into an text array field.


jim@lilith /monongahela/movielens% createdb planner_example
jim@lilith /monongahela/movielens% psql planner_example
[local]:5432 jim@planner_example=# create table movies (
  movie_id integer, 
  title text, 
  genres_text text, 
  genres text[]);
[local]:5432 jim@planner_example=# copy movies (movie_id, title, genres_text) from '/monongahela/movielens/files.grouplens.org/datasets/movielens/ml-20m/movies.csv' with (format csv, header);
[local]:5432 jim@planner_example=# update movies set genres = regexp_split_to_array(genres_text, '\|');
[local]:5432 jim@planner_example=# alter table movies drop genres_text;

To start, we will do a simple selection by the movie_id. (Note: We havn’t yet created an index, let alone a primary key). Then we’ll want to look at the EXPLAIN output.


[local]:5432 jim@planner_example=# select * from movies where movie_id = 6645;
 movie_id |      title      |             genres              
----------+-----------------+---------------------------------
     6645 | THX 1138 (1971) | {Action,Adventure,Drama,Sci-Fi}
(1 row)
Time: 7.989 ms
[local]:5432 jim@planner_example=# explain select * from movies where movie_id = 6645;
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=1 width=77)
   Filter: (movie_id = 6645)
(2 rows)
Time: 0.493 ms

The first thing to notice about the EXPLAIN output is that it’s named “Query Plan” and the second is that it has two lines, the second being indented. The default output1 indents additional steps that need to perform the step above it. Here, we are told that this query will do a “Seq Scan”, a.k.a. Sequence Scan, meaning that it will scan every row in the table in sequence. The indented row tells us that the results of the Seq Scan will be filtered, and what the filter is.

The Seq Scan row has some additional information attached to it: cost, rows, and width. Let’s take these backwards, as cost is the most complex to describe. Width is the approximate average size in bytes of the returned rows.


[local]:5432 jim@planner_example=# explain select title from movies where movie_id = 6645;
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=1 width=28)
   Filter: (movie_id = 6645)
(2 rows)
Time: 0.587 ms
[local]:5432 jim@planner_example=# explain select genres from movies where movie_id = 6645;
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=1 width=45)
   Filter: (movie_id = 6645)
(2 rows)
Time: 0.521 ms
[local]:5432 jim@planner_example=# explain select movie_id, title from movies where movie_id = 6645;
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=1 width=32)
   Filter: (movie_id = 6645)
(2 rows)
Time: 0.568 ms

In the first query we see that the integer type is 4-bytes long, which isn’t entirely unexpected. The title is a text field, which can be arbitrarily long. We see that the table statistics kept by PostgreSQL have it listed as 28-bytes long on average.


[local]:5432 jim@planner_example=# select avg(length(title)) from movies ;
         avg         
---------------------
 27.8817361976684508
(1 row)
Time: 23.780 ms

Those statistics aren’t half bad! The final query, which selects only the movie_id and title has a width of 32, which not-so-suprisingly is 4+28. The width in our original query is 77, again, which is 4+28+45.

Rows is a guess at how many rows will be returned. PostgreSQL keeps statistics on the number of total rows and the cardnality of the values in the row, along with other information, to expect how many rows will be returned. This is important because part of cost, which we cover next, is how much disk IO will be performed. Returned the majority of the table has different IO needs than returning a handful of rows.

Cost is made of two values: start-up cost and approximate cost to complete the step. Cost is not a direct measure of time, but should be proportional. The values configuration constants used by the planner are as follows

Parameter Description Default Relative cost vs seq_page_cost
random_page_cost Nonsequential disk page fetch. 4.00 4x slower
seq_page_cost Sequentially disk page fetch. 1.00
cpu_tuple_cost Processing each tuple. 0.01 100x faster
cpu_index_tuple_cost Processing each entry during an index scan. 0.005 200x faster
cpu_operator_cost Processing each operator or function call. 0.0025 400x faster

Some vocabulary:

  • Tuple is a row of information.
  • Disk page is a block of information that can be read from the disk in a single operation; this may contain multiple tuples.

Each operation that the Query Planner performs is broken into some multiple of each of these operations. As mentioned earlier, the planner attempts to estimate the number of rows returned. This information is partly used to predict if sequential or nonsequential reads will be required. If it becomes cheaper to read the entire table from disk than to read a lot of random table pages, the planner may opt to do so — manifesting in a Seq Scan as opposed to an Index Scan, for exmaple.

These values, taken from many years of experience, can be tweeked in postgresql.conf, though I wouldn’t recommend that unless you are very sure of what you’re doing. For instance, an SSD may have a random-access cost similar to a sequential read.

Additionally, this is why I opted to do this tutorial on such a large example set. When the table size is small, it is often faster to Seq Scan, than it is to read the index and then read the tuples required. When the planner thinks it’ll be reading the majority of the rows anyway, it may forgo indices, even in very large tables, for the sake of decreasing the total amount of disk IO.

Since movie_id is supposed to be the primary key, let’s create a unique index and set it as the primary key.


[local]:5432 jim@planner_example=# create unique index on movies (movie_id);
[local]:5432 jim@planner_example=# alter table movies add primary key using index movies_movie_id_idx ;
[local]:5432 jim@planner_example=# \d movies
     Table "public.movies"
  Column  |  Type   | Modifiers 
----------+---------+-----------
 movie_id | integer | not null
 title    | text    | 
 genres   | text[]  | 
Indexes:
    "movies_movie_id_idx" PRIMARY KEY, btree (movie_id)

Now, let’s rerun the query and see what’s changed.


[local]:5432 jim@planner_example=# select * from movies where movie_id = 6645; movie_id |      title      |             genres              
----------+-----------------+---------------------------------
     6645 | THX 1138 (1971) | {Action,Adventure,Drama,Sci-Fi}
(1 row)
Time: 0.737 ms

The first thing we should notice is that the query ran 10x faster! Let’s look at the explain.


[local]:5432 jim@planner_example=# explain select * from movies where movie_id = 6645;
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Index Scan using movies_movie_id_idx on movies  (cost=0.29..8.30 rows=1 width=77)
   Index Cond: (movie_id = 6645)
(2 rows)
Time: 0.551 ms

We should notice that that Seq Scan from before has been transformed to an Index Scan, which additionally tells us the index being used, and which condition is being used in the index (the indented Index Cond line). The index scan allows us to not read pages from disk and skip directly to the one that will fulfull our query. This is evident in the cost: over 1000 to 8.

Another use of EXPLAIN is EXPLAIN ANALYZE, which actually runs the query and compares the actual values to the predicted ones.


[local]:5432 jim@planner_example=# explain analyze select * from movies where movie_id = 6645;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Index Scan using movies_movie_id_idx on movies  (cost=0.29..8.30 rows=1 width=77) (actual time=0.023..0.025 rows=1 loops=1)
   Index Cond: (movie_id = 6645)
 Total runtime: 0.057 ms
(3 rows)
Time: 0.561 ms

Again, cost and actual time are in different units and will never be identical — cost is unitless and actual time is in ms. There is a rows value which is the same as in the predicted values. A hint here is that if the predicted number of rows is vastly different from the actual, a VACUUM ANALYZE may need to be run to update table and index statistics. The predicted number of rows won’t often match the actual rows exactly, but should be close. There is an additional value in the actual paramters: loops. This is an indication of how many times this part of the plan was run. We’ll come back to this value.

Now, let’s look at searching on the title.


[local]:5432 jim@planner_example=# select * from movies where title like '%THX%';
 movie_id |                  title                   |             genres              
----------+------------------------------------------+---------------------------------
     6645 | THX 1138 (1971)                          | {Action,Adventure,Drama,Sci-Fi}
   116157 | Electronic Labyrinth THX 1138 4EB (1967) | {Sci-Fi}
(2 rows)
Time: 14.442 ms

We see that the full-length motion picture and the short film that inspired it. Let’s see what the planner did to do the search.


[local]:5432 jim@planner_example=# explain select * from movies where title like '%THX%';
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=3 width=77)
   Filter: (title ~~ '%THX%'::text)
(2 rows)
Time: 0.580 ms

PostgreSQL does a Seq Scan, which isn’t surprising considering we haven’t created an index on the title field. Let’s create an index and see what happens.


[local]:5432 jim@planner_example=# create index on movies (title);
CREATE INDEX
Time: 205.577 ms
[local]:5432 jim@planner_example=# \d movies
     Table "public.movies"
  Column  |  Type   | Modifiers 
----------+---------+-----------
 movie_id | integer | not null
 title    | text    | 
 genres   | text[]  | 
Indexes:
    "movies_movie_id_idx" PRIMARY KEY, btree (movie_id)
    "movies_title_idx" btree (title)
[local]:5432 jim@planner_example=# select * from movies where title like '%THX%';
 movie_id |                  title                   |             genres              
----------+------------------------------------------+---------------------------------
     6645 | THX 1138 (1971)                          | {Action,Adventure,Drama,Sci-Fi}
   116157 | Electronic Labyrinth THX 1138 4EB (1967) | {Sci-Fi}
(2 rows)
Time: 14.207 ms
[local]:5432 jim@planner_example=# explain select * from movies where title like '%THX%';
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on movies  (cost=0.00..1026.97 rows=3 width=77)
   Filter: (title ~~ '%THX%'::text)
(2 rows)
Time: 0.738 ms

It doesn’t use the index! What!

Taking a look at table description, we see that movies_title_idx is a btree index. These indecies are fast to use and update, but can only handle equality and inequality comparisions (<, <=, =, >=, and >), not set contains or wild-card searches. Luckily! there is the pg_trgm (trgm = trigram) which creates an index based on all substrings up-to 3 characters long: Widdershins is broken into ‘e’, ‘n’, ‘W’, ‘r’, ‘h’, ‘i’, ‘d’, ‘s’, ‘ns’, ‘er’, ‘Wi’, ‘in’, ‘de’, ‘dd’, ‘rs’, ‘hi’, ‘id’, ‘sh’, ‘dde’, ‘ers’, ‘shi’, ‘der’, ‘ins’, ‘hin’, ‘rsh’, ‘idd’, and ‘Wid’.


[local]:5432 jim@planner_example=# create extension pg_trgm;
[local]:5432 jim@planner_example=# create index on movies using gist (title gist_trgm_ops);
[local]:5432 jim@planner_example=# \d movies
     Table "public.movies"
  Column  |  Type   | Modifiers 
----------+---------+-----------
 movie_id | integer | not null
 title    | text    | 
 genres   | text[]  | 
Indexes:
    "movies_movie_id_idx" PRIMARY KEY, btree (movie_id)
    "movies_title_idx" btree (title)
    "movies_title_idx1" gist (title gist_trgm_ops)
[local]:5432 jim@planner_example=# drop index movies_title_idx;
[local]:5432 jim@planner_example=# select * from movies where title like '%THX%';
 movie_id |                  title                   |             genres              
----------+------------------------------------------+---------------------------------
     6645 | THX 1138 (1971)                          | {Action,Adventure,Drama,Sci-Fi}
   116157 | Electronic Labyrinth THX 1138 4EB (1967) | {Sci-Fi}
(2 rows)
Time: 10.321 ms
[local]:5432 jim@planner_example=# explain select * from movies where title like '%THX%';
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Bitmap Heap Scan on movies  (cost=4.30..15.74 rows=3 width=77)
   Recheck Cond: (title ~~ '%THX%'::text)
   -&gt;  Bitmap Index Scan on movies_title_idx1  (cost=0.00..4.30 rows=3 width=0)
         Index Cond: (title ~~ '%THX%'::text)
(4 rows)
Time: 0.648 ms

There is a lot going on there. pg_trgm is a built-in extension, meaning that, like hstore, it needs enabled with CREATE EXTENSION, but that 3rd part software need not be installed. Next, we’re using GiST index. While I’ll leave an in-depth examination of the types of indices for another post, I will say that b-trees are tree-like structure that keeps data in sorted order and are fast to update and use; GiST indices can keep arbitrary, multi-dimentional indices which are more flexible, a little slower to used than b-trees, and much slower to update. (Note: GiST isn’t an index, actually. It’s an interface for an index to be presented as and therefor may have different performance characteristics for different types. However, all types should offer the same types of operations that can be accelerated by the index.)

We see our query runs a little faster, but not significantly faster. Let’s take a look at what steps go into performing our query. First, we see that a “Bitmap Heap Scan” is being done. Looking beneath, to the -&gt; we see that to do the Bitmap Heap Scan, we need to do the Bitmap Index Scan, which traverses the index and ascertains if a disk page contains a tuple matching the Index Cond, additionally, we are told what index and what condition is being used. The Recheck Cond done by the Bitmap Heap Scan is what condition is used to filter the correct tuple from the disk page (remember! the Bitmap Index Scan only tells us a tuple is on the page, not where!)

———-

1 PostgreSQL has multiple output formats for EXPLAIN, some of which are suitable for consumption by other tools.


[local]:5432 jim@planner_example=# explain (format xml) select * from movies where movie_id = 6645;
                        QUERY PLAN                        
----------------------------------------------------------
 &lt;explain xmlns="http://www.postgresql.org/2009/explain"&gt;+
   &lt;Query&gt;                                               +
     &lt;Plan&gt;                                              +
       &lt;Node-Type&gt;Index Scan&lt;/Node-Type&gt;                 +
       &lt;Scan-Direction&gt;Forward&lt;/Scan-Direction&gt;          +
       &lt;Index-Name&gt;movies_movie_id_idx&lt;/Index-Name&gt;      +
       &lt;Relation-Name&gt;movies&lt;/Relation-Name&gt;             +
       &lt;Alias&gt;movies&lt;/Alias&gt;                             +
       &lt;Startup-Cost&gt;0.29&lt;/Startup-Cost&gt;                 +
       &lt;Total-Cost&gt;8.30&lt;/Total-Cost&gt;                     +
       &lt;Plan-Rows&gt;1&lt;/Plan-Rows&gt;                          +
       &lt;Plan-Width&gt;77&lt;/Plan-Width&gt;                       +
       &lt;Index-Cond&gt;(movie_id = 6645)&lt;/Index-Cond&gt;        +
     &lt;/Plan&gt;                                             +
   &lt;/Query&gt;                                              +
 &lt;/explain&gt;
(1 row)
Time: 2.919 ms
[local]:5432 jim@planner_example=# explain (format json) select * from movies where movie_id = 6645;
                 QUERY PLAN                 
--------------------------------------------
 [                                         +
   {                                       +
     "Plan": {                             +
       "Node Type": "Index Scan",          +
       "Scan Direction": "Forward",        +
       "Index Name": "movies_movie_id_idx",+
       "Relation Name": "movies",          +
       "Alias": "movies",                  +
       "Startup Cost": 0.29,               +
       "Total Cost": 8.30,                 +
       "Plan Rows": 1,                     +
       "Plan Width": 77,                   +
       "Index Cond": "(movie_id = 6645)"   +
     }                                     +
   }                                       +
 ]
(1 row)
Time: 0.601 ms
[local]:5432 jim@planner_example=# explain (format yaml) select * from movies where movie_id = 6645;
              QUERY PLAN               
---------------------------------------
 - Plan:                              +
     Node Type: "Index Scan"          +
     Scan Direction: "Forward"        +
     Index Name: "movies_movie_id_idx"+
     Relation Name: "movies"          +
     Alias: "movies"                  +
     Startup Cost: 0.29               +
     Total Cost: 8.30                 +
     Plan Rows: 1                     +
     Plan Width: 77                   +
     Index Cond: "(movie_id = 6645)"
(1 row)
Time: 0.537 ms

(The +‘s at the end of the line represent a new line and would not be returned in a programmatic fashion, such as your favorite programming languages postgresql library.)

These formats allow for nesting of the query plan as well. (Ed Note: Replace these with a better example when we get to that point).

Eddie Bauer Langley Set

Date: 2015-08-21

Tags: furniture

I sent this to Target and Dorel Living because of the experience I’ve had with their dresser.

My wife and I received a Langley Dresser and Crib in expectation of our first child.

The assembly of the crib was made simple, sturdy, and quick by the length of the bolts, and metal counter-screws already in the crib frame. I was quite glad it was not a lock-and-cam system that plagues cheaper items, as they always seem to wobble after a while.

The dresser, on the other hand, has been a complete nightmare. The first step has cross beams being mounted to the middle section of the dresser with lock-and-cam faster, whose screw-locks never stabilize and just continue to turn as I attempt to tighten them and cams never lock as well as they should compared to an actual screw, or worse, the cross beams are simply sitting on a peg, in a hole too large to prevent movement.

Moving onto the second step, and praying that everything from the first step doesn’t just fall apart because nothing is adequately fastened (or fastened at all!) for what should be the core of the crib and the nuclease that the rest of the piece is assembled to, I found my prayers were not answered, and in attempting to attach one of the 5 crossbeams the core shifted, pulling out the plastic anchors I had already screwed and cammed into place.

Especially after a simple, quick, and sturdy assembly process with the crib, I expected the same simple, quick, and sturdy process with the dresser and was greatly surprised and very displeased at the shoddy and inadequate fastening system provided with the dresser. I expect as much from sub-$100 pieces of furniture, but not something that’s suppose to be of better quality — otherwise why would I have bought it?

Beyond the shoddy and inadequate fastening system for a dresser with a long expected lifetime, I am beyond worried that a part designed to hold a screw-lock in position could be pulled out with such little force, that is the middle section twisting because nothing is tightly fastened to it. I will more-than-likley end up using wood glue to secure the part that detached and any others that continue to. Moreover, I am also planning on purchasing larger screws that I can use to fasten the dresser properly.

While the crib is exactly what I expected, the dresser was worlds apart. I cannot in good conscience recommend this set to anyone.

UPDATE (2015-08-23):

hello James,

I’m sorry to hear about the trouble you had with the crib you bought at Target. I know how frustrating it can be when an item doesn’t work out.

We’re always looking for ways to make the Target experience better for every guest. I’ll be sure to share your comments with our Merchandising team for further review with our buyers. In the meantime, we’ll keep working to provide you with the high quality merchandise you’ve come to expect at Target.

Thanks for writing. Hearing about your experience is important to us.

Sincerely,

Gagan

Target Guest Relations

www.target.com

Thanks for actually reading my complaint and not just skimming the first line for a product name! /s Also, I’d rather not be contacted if all I’m getting is a form letter.

Row Level Security in PostgreSQL 9.5

Date: 2015-08-21

Tags: postgres development

I am extremely excited about What’s new in PostgreSQL 9.5 that I wanted to try it out before its final release. Here I’ll be discussing the new Row Level Security (RLS) features in PostgreSQL 9.5. All examples here are done in postgresql-9.5 α2.

RLS is a feature that allows the contents of a table to be filtered before being returned to the user. For instance, if you have a web store and would like to allow a user to only see their own orders, traditionally we would filter with an explicit WHERE clause.

Setup:

rlstest=# create table orders (order_id serial primary key, amount integer not null check (amount &gt; 0.0), time_placed timestamptz default current_timestamp, customer text);
CREATE TABLE
rlstest=# insert into orders (amount, customer) values (1000, &#39;honor&#39;), (2175, &#39;nimitz&#39;);
INSERT 0 2

Honor is viewing a page that shows her orders.

rlstest=# select order_id, amount, customer from orders where customer = &#39;honor&#39;;
 order_id | amount | customer 
----------+--------+----------
        1 |   1000 | honor      
(1 row)

This is all fine and dandy, until someone, somewhere, inadvertently or maliciously leaves off or makes ineffective the access control clause

rlstest=# select order_id, amount, customer from orders;
 order_id | amount | customer 
----------+--------+----------
        1 |   1000 | honor      
        2 |   2175 | nimitz     
(2 rows)

Whoops!

How can RLS help to prevent issues such as these? The database can be told to only allow different roles to see only certain roles, defined once, not for every query. This differes from the normal GRANTs you could give roles because they filter the rows, not simply allow or disallow operations on a given table as a whole.

Set up: Let’s create the honor and nimitz roles. Note, this does mean that each of your application users would need a corresponding user in the database.

rlstest=# create role honor;
CREATE ROLE
rlstest=# create role nimitz;
CREATE ROLE

Now let’s enable RLS on our orders table and create a read-only policy if you’re the customer of the order.

rlstest=# alter table orders enable row level security;
ALTER TABLE
rlstest=# create policy orders_reader on orders for select using (customer = current_user);
CREATE POLICY

Now, let’s switch roles to honor and look at our order!

rlstest=&gt; select order_id, amount, customer, time_placed from orders;
ERROR:  permission denied for relation orders

Oh No!

RLS does not supercede the normal GRANT allowances. If the role can’t see a table, then the RLS policy can’t change that! set role *YOUR USER* and GRANT select on orders. Now try again.

rlstest=# grant select on orders to honor;
GRANT
rlstest=# grant select on orders to nimitz;
GRANT
rlstest=# set role honor;
SET
rlstest=&gt; select order_id, amount, customer, time_placed from orders;
 order_id | amount | customer |         time_placed          
 ----------+--------+----------+------------------------------
         1 |   1000 | honor    | 2015-08-21 14:05:13.03572-04
         (1 row)

Success!

rlstest=# set role nimitz;
SET
rlstest=&gt; select order_id, amount, customer, time_placed from orders;
 order_id | amount | customer |         time_placed          
 ----------+--------+----------+------------------------------
         2 |   2175 | nimitz   | 2015-08-21 14:05:13.03572-04
         (1 row)

Yes!

Now, being good database developers, we normalize our databases and wouldn’t store the customer’s username in the order. So, let’s create a customers table and update our orders table. (Remember to switch back to your role!)

rlstest=# create table customers (customer_id serial primary key, username text not null unique);  -- You could imagine a real table having more fields for email, phone, address, &amp;c
CREATE TABLE
rlstest=# insert into customers (username) values (&#39;honor&#39;), (&#39;nimitz&#39;);
INSERT 0 2
rlstest=# grant select on customers to honor;
GRANT
rlstest=# grant select on customers to nimitz;
GRANT
rlstest=# alter table orders add column customer_id integer references customers;
ALTER TABLE
rlstest=# update orders set customer_id = (select customer_id from customers where username = orders.customer);
UPDATE 2
rlstest=# alter table orders alter column customer_id set not null; -- We couldn&#39;t do this when we created the column, because on creation the column has the default (NULL in this case) value
ALTER TABLE
rlstest=# alter table orders drop column customer;
ERROR:  cannot drop table orders column customer because other objects depend on it
DETAIL:  policy orders_reader on table orders depends on table orders column customer
HINT:  Use DROP ... CASCADE to drop the dependent objects too.


Good, we can’t drop the column because there is a policy on it!

rlstest=# alter table orders drop column customer cascade;
NOTICE:  drop cascades to policy orders_reader on table orders
ALTER TABLE

Wait! Won’t everyone be able to see everything!

rlstest=# set role honor;
SET
rlstest=&gt; select * from orders;
 order_id | amount | time_placed | customer_id 
 ----------+--------+-------------+-------------
 (0 rows)

OK, well, still breaks things, but things breaking by not leaking information is a good way to break! Go back to your role.

So, what we want to do is allow people to see orders if their username is the current role, and the orders.customer_id is the same as that customers.customer_id. If we were writing a straight SQL query, we’d do something like

select * from orders inner join customers using (customer_id) where username = current_user;

However, we need to do this row by row, not as a set.

rlstest=# create policy orders_reader on orders for select using ( (select true from customers where username = current_user and customers.customer_id = orders.customer_id) );
CREATE POLICY
test=# set role honor;
SET
rlstest=&gt; select * from orders;
 order_id | amount |         time_placed          | customer_id 
 ----------+--------+------------------------------+-------------
         1 |   1000 | 2015-08-21 14:05:13.03572-04 |           1
         (1 row)
rlstest=&gt; set role nimitz;
SET
rlstest=&gt; select * from orders;
 order_id | amount |         time_placed          | customer_id 
 ----------+--------+------------------------------+-------------
         2 |   2175 | 2015-08-21 14:05:13.03572-04 |           2
         (1 row)

But wait! Now everyone can see all the customers!

rlstest=# set role nimitz;
SET
rlstest=&gt; select * from customers;
 customer_id | username 
-------------+----------
           1 | honor
           2 | nimitz
(2 rows)

Well, if we have phone numbers and other PII in there, that’s not good! No back to your user and enable RLS for customers!

rlstest=# alter table customers enable row level security;
ALTER TABLE

Remember how we couldn’t see any orders before we created policies? Same is true.

rlstest=&gt; select * from customers;
 customer_id | username 
-------------+----------
(0 rows)

Same true here! Since the policy for orders depends on the customers, will it affect how we see orders?

rlstest=&gt; select * from orders;
 order_id | amount | time_placed | customer_id 
----------+--------+-------------+-------------
(0 rows)

Yes! Yes it does! Since the policy (or lack there of) won’t let us see any customers, none of the customers match the criteria for the orders policy!

Go back to your users and lets create a read-only policy for customers.

rlstest=# create policy customer_reader on customers for select using (username = current_user);
CREATE POLICY

Now, let’s try this again

rlstest=# set role nimitz;
SET
rlstest=&gt; select * from customers;
 customer_id | username 
-------------+----------
           2 | nimitz
(1 row)
rlstest=&gt; select * from orders;
 order_id | amount |         time_placed          | customer_id 
----------+--------+------------------------------+-------------
        2 |   2175 | 2015-08-21 14:05:13.03572-04 |           2
(1 row)

Awesome! It works again!

But wait! In the policy for orders, we are explicitly limiting the customers based on username, and we do that on the policy for the customers too, so why do double work? Additionally, what if we ever want to change the rules for who can see customers (maybe we’ll add customers who can manage other customers, like a business account). As your role:

rlstest=# drop policy orders_reader on orders;
DROP POLICY
rlstest=# create policy orders_reader on orders for select using ( (select true from customers where customers.customer_id = orders.customer_id) );
CREATE POLICY

Let’s test it out!

rlstest=&gt; set role honor;
SET
rlstest=&gt; select * from customers;
 customer_id | username 
-------------+----------
           1 | honor
(1 row)
rlstest=&gt; select * from orders;
 order_id | amount |         time_placed          | customer_id 
----------+--------+------------------------------+-------------
        1 |   1000 | 2015-08-21 14:05:13.03572-04 |           1
(1 row)
rlstest=&gt; set role nimitz;
SET
rlstest=&gt; select * from customers;
 customer_id | username 
-------------+----------
           2 | nimitz
(1 row)
rlstest=&gt; select * from orders;
 order_id | amount |         time_placed          | customer_id 
----------+--------+------------------------------+-------------
        2 |   2175 | 2015-08-21 14:05:13.03572-04 |           2
(1 row)

Awesome! Now we are controlling the majority of our access control directly in the database, and what’s more, we’re controlling its granulary per table without having to cross-filter from one table to another, enabling us to work on the access control without leaking permissions without repeating yourself (“Keep it DRY”).

There is a caveat to all of this, though: views. Views run as the user who created them, not the user who executes them. As such, if your table-owner were to create a view, the view would have the RLS permissions of the table owner, namly, all permissions. As your role:

rlstest=# create view customer_view as select customer_id, username from customers;
CREATE VIEW
rlstest=# drop policy orders_reader on orders;
DROP POLICY
rlstest=# create policy orders_reader on orders for select using ( (select true from customer_view where customer_view.customer_id = orders.customer_id) );
CREATE POLICY
rlstest=# set role honor;
SET
rlstest=&gt; select * from customers;
 customer_id | username 
-------------+----------
           1 | honor
(1 row)
rlstest=&gt; set role jim;
SET
rlstest=# grant select on customer_view to honor;
GRANT
rlstest=# grant select on customer_view to nimitz;
GRANT
rlstest=# set role honor;
SET
rlstest=&gt; select * from customers;
 customer_id | username 
-------------+----------
           1 | honor
(1 row)
rlstest=&gt; select * from customer_view;
 customer_id | username 
-------------+----------
           1 | honor
           2 | nimitz
(2 rows)
rlstest=&gt; select * from orders;
 order_id | amount |         time_placed          | customer_id 
----------+--------+------------------------------+-------------
        1 |   1000 | 2015-08-21 14:05:13.03572-04 |           1
        2 |   2175 | 2015-08-21 14:05:13.03572-04 |           2
(2 rows)

As you can see, the view is being run as the table owner and as such allows access to all rows.

One work around (if you really need to use a view, say hierarchical permissions), functions in a are excuted as the caller not the owner of the view.

rlstest=# create function can_see_customer(username text) returns boolean as $$begin
return current_user = username;
end;
$$
language &#39;plpgsql&#39;;
CREATE FUNCTION
rlstest=# create or replace view customer_view as (select * from customers where can_see_customer(username));
CREATE VIEW
rlstest=# set role honor;
SET
rlstest=&gt; select * from customer_view;
 customer_id | username 
-------------+----------
           1 | honor
(1 row)
rlstest=&gt; select * from orders;
 order_id | amount |         time_placed          | customer_id 
----------+--------+------------------------------+-------------
        1 |   1000 | 2015-08-21 14:05:13.03572-04 |           1
(1 row)

While this isn’t perfect, because you’re duplicating the access rules (to the customers in this example).

Now that we have policies working, we can look at them by describing a table in psql.

rlstest=# \d customers
                                Table &#34;public.customers&#34;
   Column    |  Type   |                            Modifiers                            
-------------+---------+-----------------------------------------------------------------
 customer_id | integer | not null default nextval(&#39;customers_customer_id_seq&#39;::regclass)
 username    | text    | not null
Indexes:
    &#34;customers_pkey&#34; PRIMARY KEY, btree (customer_id)
    &#34;customers_username_key&#34; UNIQUE CONSTRAINT, btree (username)
Referenced by:
    TABLE &#34;orders&#34; CONSTRAINT &#34;orders_customer_id_fkey&#34; FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
Policies:
    POLICY &#34;customer_reader&#34; FOR SELECT
      USING ((username = (&#34;current_user&#34;())::text))
rlstest=# \d orders
                                       Table &#34;public.orders&#34;
   Column    |           Type           |                         Modifiers                         
-------------+--------------------------+-----------------------------------------------------------
 order_id    | integer                  | not null default nextval(&#39;orders_order_id_seq&#39;::regclass)
 amount      | integer                  | not null
 time_placed | timestamp with time zone | default now()
 customer_id | integer                  | not null
Indexes:
    &#34;orders_pkey&#34; PRIMARY KEY, btree (order_id)
Check constraints:
    &#34;orders_amount_check&#34; CHECK (amount::numeric &gt; 0.0)
Foreign-key constraints:
    &#34;orders_customer_id_fkey&#34; FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
Policies:
    POLICY &#34;orders_reader&#34; FOR SELECT
      USING (( SELECT true AS bool
   FROM customer_view
  WHERE (customer_view.customer_id = orders.customer_id)))

In a later article, I’ll discuss how RLS policies affect the query planner.