About Me!

This blog is about my musings and thoughts. I hope you find it useful, at most, and entertaining, at least.

Résumé [PDF]

Other Pages



Presence Elsewhere




Calculating Service Area, Part 3: Demographics

Date: 2019-01-27
Tags: postgis postgres transit census

In Part 1 we talked about the project and got some of the tools and data we need to start. A quick reminder of our motivation: to figure out the percentage area and population the county transit agency serves by using a "walkshed".

In Part 2, we defined what our walkshed is and defined how we would be linking it to the Census 2010. In this post, we'll be doing the linking

I want to point out sproke's blogpost from 2012 where I got many of these ideas many years back.


Importing the Census 2010 Data

First we need to gather the data from census.gov and initialized our database. I'm going to grab the documentation (sf1.pdf), along with some MS Access database files they use to store the schema.

wget --continue -e robots=off --force-directories --no-parent https://www.census.gov/prod/cen2010/doc/sf1.pdf
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/SF1_Access2003.mdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/SF1_Access2007.accdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/Pennsylvania/
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/SF2_MSAccess_2003.mdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/SF2_MSAccess_2007.accdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/Pennsylvania/

Some notes, --recursive will grab linked files, --continue will try to start mid-file if it's been partially downloaded. -e robots=off is naughty and tels wget to download things even if robots.txt tells you not to. (I don't think it's an issue here, I just add it all the time.) --force-directories will create a directory structure begining with the domain so "https://example.com/a/b/c.d" will be downloaded to a file "example.com/a/b/c.d" -- I find this makes keeping track of what things are and where they came from a lot easier. --no-parent tells wget not to go up the directory tree to get assets, which could cause it to begin going down sibling directories of the root we gave it.

Now, we are going import that data. The first step is to unzip the datafile. Be warned, the unziped data can swell in size. For example, the pa dataset is compressed 90%: the zip is 436MiB and uncompressed is 4.4GiB.

unzip pa2010.sf1.zip

Now that we have the files, let's build the schema. Mostly because I love using the commandline whenever possible, I decided to stream the schema from the Access sample databases through some fix-ups and then into a file.

mdb-schema SF1_Access2003.mdb |
  # MS Access uses [] as idenifier quotes, let's fix that
  tr '[]' '""' |

  # Some tables have a name suffix of `mod` because the first few have been
  # removed to keep the table under 255 fields; let's add them back in.
  sed '/mod"$/{$!{N;s/mod"\n (/"\n (\n        "FILEID"                        text,\n        "STUSAB"                        text,\n        "CHARITER"                      text,\n        "CIFSN"                  text,/g}}' |

  # Remove spaces from identifiers.
  sed -E 's/(\s+)"([^"]+) ([^"]+)"/\1\2_\3/g' |

  # Convert to PG type names
  sed -E 's/Text\s*\([0-9]+\)/text/g' |
  sed -E 's/(Long )?Integer/bigint/g' |
  sed -E 's/Double/double precision/g' |

  # Table 45 is split into 2 parts; let's merge them by deleting the trailing
  # ); of the one and the CREATE TABLE along with the first few fields of the
  # second.
  sed 's/_PT1//g' |
  sed '/);$/{$!{N;N;N;N;N;N;N;N;s/.*_PT2.*LOG[^,]*//g}}' |

  # Just remove the quotes -- they're not needed anymore
  tr -d '"' > SF1_pg.schma.sql

Then, we can send that into postgres, but I want to wrap it all in a commit so that we don't end up with a half-built database if something fails.

cat <(echo "set role census;") \
    <(echo "begin; create schema census2010; set search_path = census2010;") \
    SF1_pg.schma.sql \
    <(echo "commit;") |
  psql -v ON_ERROR_STOP=1 census

Now the fun part! We get to import the datafiles! I'm going to use the copy statement and the psql \copy meta-command. However, like above, I'm going to build up a sql file before running it so that everything can be wrapped in a transaction. Yes, this step can take a bit and if it fails towards the end it sucks, but in the end it doesn't take that long (a few minutes) and it's significantly easier than trying to scrub duplicates. (We haven't built indices yet as having them during the input slows it down considerably.

echo "begin;" > import_script
for input_file in pa000*2010.sf1; do
  table_name=SF1_`echo $input_file | sed 's/pa\(.\{5\}\)2010.sf1/\1/'`
  # Provides some feedback as `copy` will only output the number of rows imported.
  echo "\\\echo $input_file  $table_name" >> import_script
  echo "\\\copy census2010.${table_name} from '${input_file}' with (format csv, header false);\n" >> import_script
echo "commit;" >> import_script
cat import_script | psql -v ON_ERROR_STOP=1 census

Now comes the more interesting part: the geoheader (links demographic records with the geography) is a fixed-width file (i.e. fields are not delimited, but are defined as between character positions on a line). What Sophia did was to build a staging table that would contain the raw lines, and then do a "insert into select" with a select statement that teased out the fields using the substring function. We'll do a very similar thing. The fields are defined in the SF1 documentation

echo "set role census; CREATE TABLE census2010.geo_header_staging(data text);" | psql census
echo "\\\copy census2010.geo_header_staging from 'pageo2010.sf1';" | psql census

Then we'll tease apart that staging table.

INSERT INTO census2010.geo_header_sf1 (fileid, stusab, sumlev, geocomp, chariter, cifsn, logrecno, region, division, state, county, countycc, countysc, cousub, cousubcc, cousubsc, place, placecc, placesc, tract, blkgrp, block, iuc, concit, concitcc, concitsc, aianhh, aianhhfp, aianhhcc, aihhtli, aitsce, aits, aitscc, ttract, tblkgrp, anrc, anrccc, cbsa, cbsasc, metdiv, csa, necta, nectasc, nectadiv, cnecta, cbsapci, nectapci, ua, uasc, uatype, ur, cd, sldu, sldl, vtd, vtdi, reserve2, zcta5, submcd, submcdcc, sdelem, sdsec, sduni, arealand, areawatr, name, funcstat, gcuni, pop100, hu100, intptlat, intptlon, lsadc, partflag, reserve3, uga, statens, countyns, cousubns, placens, concitns, aianhhns, aitsns, anrcns, submcdns, cd113, cd114, cd115, sldu2, sldu3, sldu4, sldl2, sldl3, sldl4, aianhhsc, csasc, cnectasc, memi, nmemi, puma, reserved)
nullif(trim(substring(data,1,6)), '') AS fileid,
nullif(trim(substring(data,7,2)), '') AS stusab,
nullif(trim(substring(data,9,3)), '') AS sumlev,
nullif(trim(substring(data,12,2)), '') AS geocomp,
nullif(trim(substring(data,14,3)), '') AS chariter,
nullif(trim(substring(data,17,2)), '') AS cifsn,
nullif(trim(substring(data,19,7)), '')::bigint AS logrecno,
nullif(trim(substring(data,26,1)), '') AS region,
nullif(trim(substring(data,27,1)), '') AS division,
nullif(trim(substring(data,28,2)), '') AS state,
nullif(trim(substring(data,30,3)), '') AS county,
nullif(trim(substring(data,33,2)), '') AS countycc,
nullif(trim(substring(data,35,2)), '') AS countysc,
nullif(trim(substring(data,37,5)), '') AS cousub,
nullif(trim(substring(data,42,2)), '') AS cousubcc,
nullif(trim(substring(data,44,2)), '') AS cousubsc,
nullif(trim(substring(data,46,5)), '') AS place,
nullif(trim(substring(data,51,2)), '') AS placecc,
nullif(trim(substring(data,53,2)), '') AS placesc,
nullif(trim(substring(data,55,6)), '') AS tract,
nullif(trim(substring(data,61,1)), '') AS blkgrp,
nullif(trim(substring(data,62,4)), '') AS block,
nullif(trim(substring(data,66,2)), '') AS iuc,
nullif(trim(substring(data,68,5)), '') AS concit,
nullif(trim(substring(data,73,2)), '') AS concitcc,
nullif(trim(substring(data,75,2)), '') AS concitsc,
nullif(trim(substring(data,77,4)), '') AS aianhh,
nullif(trim(substring(data,81,5)), '') AS aianhhfp,
nullif(trim(substring(data,86,2)), '') AS aianhhcc,
nullif(trim(substring(data,88,1)), '') AS aihhtli,
nullif(trim(substring(data,89,3)), '') AS aitsce,
nullif(trim(substring(data,92,5)), '') AS aits,
nullif(trim(substring(data,97,2)), '') AS aitscc,
nullif(trim(substring(data,99,6)), '') AS ttract,
nullif(trim(substring(data,105,1)), '') AS tblkgrp,
nullif(trim(substring(data,106,5)), '') AS anrc,
nullif(trim(substring(data,111,2)), '') AS anrccc,
nullif(trim(substring(data,113,5)), '') AS cbsa,
nullif(trim(substring(data,118,2)), '') AS cbsasc,
nullif(trim(substring(data,120,5)), '') AS metdiv,
nullif(trim(substring(data,125,3)), '') AS csa,
nullif(trim(substring(data,128,5)), '') AS necta,
nullif(trim(substring(data,133,2)), '') AS nectasc,
nullif(trim(substring(data,135,5)), '') AS nectadiv,
nullif(trim(substring(data,140,3)), '') AS cnecta,
nullif(trim(substring(data,143,1)), '') AS cbsapci,
nullif(trim(substring(data,144,1)), '') AS nectapci,
nullif(trim(substring(data,145,5)), '') AS ua,
nullif(trim(substring(data,150,2)), '') AS uasc,
nullif(trim(substring(data,152,1)), '') AS uatype,
nullif(trim(substring(data,153,1)), '') AS ur,
nullif(trim(substring(data,154,2)), '') AS cd,
nullif(trim(substring(data,156,3)), '') AS sldu,
nullif(trim(substring(data,159,3)), '') AS sldl,
nullif(trim(substring(data,162,6)), '') AS vtd,
nullif(trim(substring(data,168,1)), '') AS vtdi,
nullif(trim(substring(data,169,3)), '') AS reserve2,
nullif(trim(substring(data,172,5)), '') AS zcta5,
nullif(trim(substring(data,177,5)), '') AS submcd,
nullif(trim(substring(data,182,2)), '') AS submcdcc,
nullif(trim(substring(data,184,5)), '') AS sdelem,
nullif(trim(substring(data,189,5)), '') AS sdsec,
nullif(trim(substring(data,194,5)), '') AS sduni,
nullif(trim(substring(data,199,14)), '')::double precision AS arealand,
nullif(trim(substring(data,213,14)), '')::double precision AS areawatr,
nullif(trim(substring(data,227,90)), '') AS name,
nullif(trim(substring(data,317,1)), '') AS funcstat,
nullif(trim(substring(data,318,1)), '') AS gcuni,
nullif(trim(substring(data,319,9)), '')::bigint AS pop100,
nullif(trim(substring(data,328,9)), '')::bigint AS hu100,
nullif(trim(substring(data,337,11)), '') AS intptlat,
nullif(trim(substring(data,348,12)), '') AS intptlon,
nullif(trim(substring(data,360,2)), '') AS lsadc,
nullif(trim(substring(data,362,1)), '') AS partflag,
nullif(trim(substring(data,363,6)), '') AS reserve3,
nullif(trim(substring(data,369,5)), '') AS uga,
nullif(trim(substring(data,374,8)), '') AS statens,
nullif(trim(substring(data,382,8)), '') AS countyns,
nullif(trim(substring(data,390,8)), '') AS cousubns,
nullif(trim(substring(data,398,8)), '') AS placens,
nullif(trim(substring(data,406,8)), '') AS concitns,
nullif(trim(substring(data,414,8)), '') AS aianhhns,
nullif(trim(substring(data,422,8)), '') AS aitsns,
nullif(trim(substring(data,430,8)), '') AS anrcns,
nullif(trim(substring(data,438,8)), '') AS submcdns,
nullif(trim(substring(data,446,2)), '') AS cd113,
nullif(trim(substring(data,448,2)), '') AS cd114,
nullif(trim(substring(data,450,2)), '') AS cd115,
nullif(trim(substring(data,452,3)), '') AS sldu2,
nullif(trim(substring(data,455,3)), '') AS sldu3,
nullif(trim(substring(data,458,3)), '') AS sldu4,
nullif(trim(substring(data,461,3)), '') AS sldl2,
nullif(trim(substring(data,464,3)), '') AS sldl3,
nullif(trim(substring(data,467,3)), '') AS sldl4,
nullif(trim(substring(data,470,2)), '') AS aianhhsc,
nullif(trim(substring(data,472,2)), '') AS csasc,
nullif(trim(substring(data,474,2)), '') AS cnectasc,
nullif(trim(substring(data,476,1)), '') AS memi,
nullif(trim(substring(data,477,1)), '') AS nmemi,
nullif(trim(substring(data,478,5)), '') AS puma,
nullif(trim(substring(data,483,18)), '') AS reserved
FROM census2010.geo_header_staging;

Now, let's delete that staging table.

drop table census2010.geo_header_staging;

Let's build some indices and foreign keys.

echo "begin;" > create_index
for table in `echo "\\\dt census2010.*" | psql -tA census | grep -v data | cut -f 2 -d \|`; do
  cat <<EOS >> create_index
create index on census2010.${table}(fileid);
create index on census2010.${table}(stusab);
create index on census2010.${table}(chariter);
create index on census2010.${table}(cifsn);
create unique index on census2010.${table}(logrecno);

alter table census2010.${table} alter column fileid set not null;
alter table census2010.${table} alter column stusab set not null;
alter table census2010.${table} alter column chariter set not null;
alter table census2010.${table} alter column cifsn set not null;
alter table census2010.${table} alter column logrecno set not null;

alter table census2010.${table} add primary key using index ${table}_logrecno_idx;

echo "commit;" >> create_index
cat create_index | psql -v ON_ERROR_STOP=1 census

create unique index on census2010.geo_header_sf1(fileid, stusab, chariter, cifsn, logrecno);

echo "begin;" > create_fk
for table in `echo "\\\dt census2010.*" | psql -tA census | grep -v data | grep -v geo | cut -f 2 -d \|`; do
  cat <<EOS >> create_fk
alter table census2010.${table} add constraint ${table}_geo_fk foreign key (logrecno) references census2010.geo_header_sf1(logrecno);
echo "commit;" >> create_fk
cat create_fk | psql -v ON_ERROR_STOP=1 census

Service Area Population

In order to speed up lookups between the TIGER data, let's create an index to use. sumlev (Summary Level) is the aggregation that the row represents, e.g. state, county, track, block group, block, &c. They're defined in the SF1 documentation. 101 is the block level.

create index on census2010.geo_header_sf1 (state, county, tract, block) where sumlev = '101';

Calculating the total population of the county can be done by simply using field p0010001 (total population) in table sf1_00001 (also defined in the SF1 documentation), but only for the rows representing the block summary level. Without filtering on the summary level the population would be counted for every summary level in the county.

select sum(p0010001) 
from census2010.sf1_00001 
join census2010.geo_header_sf1 using (logrecno)
where county = '003' 
  and sumlev = '101';

For the population in our walkshed, we will use the stops_walkshed_blocks table that we generated in Part 2. As a recap, that is every block that is adjacent to a segment of road that is within a quarter mile of a bus stop. (The reason for the group by is that stops_walkshed_blocks is broken down by stop, which obviously leads to some blocks being counted multiple times.)

select sum(p0010001)
from (
  select max(p0010001) as p0010001 
  from census2010.sf1_00001
  join census2010.geo_header_sf1 using (logrecno) 
  join tiger2017.tabblock 
    on  statefp10 = state 
    and countyfp10 = county
    and tractce10 = tract 
    and blockce10 = block
  join stops_walkshed_blocks 
    on stops_walkshed_blocks.geoid10 = tabblock.geoid10
  where county = '003' and sumlev = '101'
  group by tabblock.geoid10) x;


This yields a total population of 1,223,348 and a service area population of 775,835 which is just over 63% of the county's population.