This blog is about my musings and thoughts. I hope you find it useful, at most, and entertaining, at least.
Date: 2019-01-27
Tags: postgis postgres transit census
In Part 1 we talked about the project and got some of the tools and data we need to start. A quick reminder of our motivation: to figure out the percentage area and population the county transit agency serves by using a "walkshed".
In Part 2, we defined what our walkshed is and defined how we would be linking it to the Census 2010. In this post, we'll be doing the linking
I want to point out sproke's blogpost from 2012 where I got many of these ideas many years back.
First we need to gather the data from census.gov and initialized our database. I'm going to grab the documentation (sf1.pdf), along with some MS Access database files they use to store the schema.
wget --continue -e robots=off --force-directories --no-parent https://www.census.gov/prod/cen2010/doc/sf1.pdf
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/SF1_Access2003.mdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/SF1_Access2007.accdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/Pennsylvania/
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/SF2_MSAccess_2003.mdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/SF2_MSAccess_2007.accdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/Pennsylvania/
Some notes, --recursive
will grab linked files, --continue
will try to
start mid-file if it's been partially downloaded. -e robots=off
is naughty
and tels wget to download things even if robots.txt tells you not to. (I don't
think it's an issue here, I just add it all the time.) --force-directories
will create a directory structure begining with the domain so
"https://example.com/a/b/c.d" will be downloaded to a file
"example.com/a/b/c.d" -- I find this makes keeping track of what things are and
where they came from a lot easier. --no-parent
tells wget not to go up the
directory tree to get assets, which could cause it to begin going down sibling
directories of the root we gave it.
Now, we are going import that data. The first step is to unzip the datafile. Be warned, the unziped data can swell in size. For example, the pa dataset is compressed 90%: the zip is 436MiB and uncompressed is 4.4GiB.
unzip pa2010.sf1.zip
Now that we have the files, let's build the schema. Mostly because I love using the commandline whenever possible, I decided to stream the schema from the Access sample databases through some fix-ups and then into a file.
mdb-schema SF1_Access2003.mdb |
# MS Access uses [] as idenifier quotes, let's fix that
tr '[]' '""' |
# Some tables have a name suffix of `mod` because the first few have been
# removed to keep the table under 255 fields; let's add them back in.
sed '/mod"$/{$!{N;s/mod"\n (/"\n (\n "FILEID" text,\n "STUSAB" text,\n "CHARITER" text,\n "CIFSN" text,/g}}' |
# Remove spaces from identifiers.
sed -E 's/(\s+)"([^"]+) ([^"]+)"/\1\2_\3/g' |
# Convert to PG type names
sed -E 's/Text\s*\([0-9]+\)/text/g' |
sed -E 's/(Long )?Integer/bigint/g' |
sed -E 's/Double/double precision/g' |
# Table 45 is split into 2 parts; let's merge them by deleting the trailing
# ); of the one and the CREATE TABLE along with the first few fields of the
# second.
sed 's/_PT1//g' |
sed '/);$/{$!{N;N;N;N;N;N;N;N;s/.*_PT2.*LOG[^,]*//g}}' |
# Just remove the quotes -- they're not needed anymore
tr -d '"' > SF1_pg.schma.sql
Then, we can send that into postgres, but I want to wrap it all in a commit so that we don't end up with a half-built database if something fails.
cat <(echo "set role census;") \
<(echo "begin; create schema census2010; set search_path = census2010;") \
SF1_pg.schma.sql \
<(echo "commit;") |
psql -v ON_ERROR_STOP=1 census
Now the fun part! We get to import the datafiles! I'm going to use the copy statement and the psql \copy meta-command. However, like above, I'm going to build up a sql file before running it so that everything can be wrapped in a transaction. Yes, this step can take a bit and if it fails towards the end it sucks, but in the end it doesn't take that long (a few minutes) and it's significantly easier than trying to scrub duplicates. (We haven't built indices yet as having them during the input slows it down considerably.
echo "begin;" > import_script
for input_file in pa000*2010.sf1; do
table_name=SF1_`echo $input_file | sed 's/pa\(.\{5\}\)2010.sf1/\1/'`
# Provides some feedback as `copy` will only output the number of rows imported.
echo "\\\echo $input_file $table_name" >> import_script
echo "\\\copy census2010.${table_name} from '${input_file}' with (format csv, header false);\n" >> import_script
done
echo "commit;" >> import_script
cat import_script | psql -v ON_ERROR_STOP=1 census
Now comes the more interesting part: the geoheader (links demographic records with the geography) is a fixed-width file (i.e. fields are not delimited, but are defined as between character positions on a line). What Sophia did was to build a staging table that would contain the raw lines, and then do a "insert into select" with a select statement that teased out the fields using the substring function. We'll do a very similar thing. The fields are defined in the SF1 documentation
echo "set role census; CREATE TABLE census2010.geo_header_staging(data text);" | psql census
echo "\\\copy census2010.geo_header_staging from 'pageo2010.sf1';" | psql census
Then we'll tease apart that staging table.
INSERT INTO census2010.geo_header_sf1 (fileid, stusab, sumlev, geocomp, chariter, cifsn, logrecno, region, division, state, county, countycc, countysc, cousub, cousubcc, cousubsc, place, placecc, placesc, tract, blkgrp, block, iuc, concit, concitcc, concitsc, aianhh, aianhhfp, aianhhcc, aihhtli, aitsce, aits, aitscc, ttract, tblkgrp, anrc, anrccc, cbsa, cbsasc, metdiv, csa, necta, nectasc, nectadiv, cnecta, cbsapci, nectapci, ua, uasc, uatype, ur, cd, sldu, sldl, vtd, vtdi, reserve2, zcta5, submcd, submcdcc, sdelem, sdsec, sduni, arealand, areawatr, name, funcstat, gcuni, pop100, hu100, intptlat, intptlon, lsadc, partflag, reserve3, uga, statens, countyns, cousubns, placens, concitns, aianhhns, aitsns, anrcns, submcdns, cd113, cd114, cd115, sldu2, sldu3, sldu4, sldl2, sldl3, sldl4, aianhhsc, csasc, cnectasc, memi, nmemi, puma, reserved)
SELECT
nullif(trim(substring(data,1,6)), '') AS fileid,
nullif(trim(substring(data,7,2)), '') AS stusab,
nullif(trim(substring(data,9,3)), '') AS sumlev,
nullif(trim(substring(data,12,2)), '') AS geocomp,
nullif(trim(substring(data,14,3)), '') AS chariter,
nullif(trim(substring(data,17,2)), '') AS cifsn,
nullif(trim(substring(data,19,7)), '')::bigint AS logrecno,
nullif(trim(substring(data,26,1)), '') AS region,
nullif(trim(substring(data,27,1)), '') AS division,
nullif(trim(substring(data,28,2)), '') AS state,
nullif(trim(substring(data,30,3)), '') AS county,
nullif(trim(substring(data,33,2)), '') AS countycc,
nullif(trim(substring(data,35,2)), '') AS countysc,
nullif(trim(substring(data,37,5)), '') AS cousub,
nullif(trim(substring(data,42,2)), '') AS cousubcc,
nullif(trim(substring(data,44,2)), '') AS cousubsc,
nullif(trim(substring(data,46,5)), '') AS place,
nullif(trim(substring(data,51,2)), '') AS placecc,
nullif(trim(substring(data,53,2)), '') AS placesc,
nullif(trim(substring(data,55,6)), '') AS tract,
nullif(trim(substring(data,61,1)), '') AS blkgrp,
nullif(trim(substring(data,62,4)), '') AS block,
nullif(trim(substring(data,66,2)), '') AS iuc,
nullif(trim(substring(data,68,5)), '') AS concit,
nullif(trim(substring(data,73,2)), '') AS concitcc,
nullif(trim(substring(data,75,2)), '') AS concitsc,
nullif(trim(substring(data,77,4)), '') AS aianhh,
nullif(trim(substring(data,81,5)), '') AS aianhhfp,
nullif(trim(substring(data,86,2)), '') AS aianhhcc,
nullif(trim(substring(data,88,1)), '') AS aihhtli,
nullif(trim(substring(data,89,3)), '') AS aitsce,
nullif(trim(substring(data,92,5)), '') AS aits,
nullif(trim(substring(data,97,2)), '') AS aitscc,
nullif(trim(substring(data,99,6)), '') AS ttract,
nullif(trim(substring(data,105,1)), '') AS tblkgrp,
nullif(trim(substring(data,106,5)), '') AS anrc,
nullif(trim(substring(data,111,2)), '') AS anrccc,
nullif(trim(substring(data,113,5)), '') AS cbsa,
nullif(trim(substring(data,118,2)), '') AS cbsasc,
nullif(trim(substring(data,120,5)), '') AS metdiv,
nullif(trim(substring(data,125,3)), '') AS csa,
nullif(trim(substring(data,128,5)), '') AS necta,
nullif(trim(substring(data,133,2)), '') AS nectasc,
nullif(trim(substring(data,135,5)), '') AS nectadiv,
nullif(trim(substring(data,140,3)), '') AS cnecta,
nullif(trim(substring(data,143,1)), '') AS cbsapci,
nullif(trim(substring(data,144,1)), '') AS nectapci,
nullif(trim(substring(data,145,5)), '') AS ua,
nullif(trim(substring(data,150,2)), '') AS uasc,
nullif(trim(substring(data,152,1)), '') AS uatype,
nullif(trim(substring(data,153,1)), '') AS ur,
nullif(trim(substring(data,154,2)), '') AS cd,
nullif(trim(substring(data,156,3)), '') AS sldu,
nullif(trim(substring(data,159,3)), '') AS sldl,
nullif(trim(substring(data,162,6)), '') AS vtd,
nullif(trim(substring(data,168,1)), '') AS vtdi,
nullif(trim(substring(data,169,3)), '') AS reserve2,
nullif(trim(substring(data,172,5)), '') AS zcta5,
nullif(trim(substring(data,177,5)), '') AS submcd,
nullif(trim(substring(data,182,2)), '') AS submcdcc,
nullif(trim(substring(data,184,5)), '') AS sdelem,
nullif(trim(substring(data,189,5)), '') AS sdsec,
nullif(trim(substring(data,194,5)), '') AS sduni,
nullif(trim(substring(data,199,14)), '')::double precision AS arealand,
nullif(trim(substring(data,213,14)), '')::double precision AS areawatr,
nullif(trim(substring(data,227,90)), '') AS name,
nullif(trim(substring(data,317,1)), '') AS funcstat,
nullif(trim(substring(data,318,1)), '') AS gcuni,
nullif(trim(substring(data,319,9)), '')::bigint AS pop100,
nullif(trim(substring(data,328,9)), '')::bigint AS hu100,
nullif(trim(substring(data,337,11)), '') AS intptlat,
nullif(trim(substring(data,348,12)), '') AS intptlon,
nullif(trim(substring(data,360,2)), '') AS lsadc,
nullif(trim(substring(data,362,1)), '') AS partflag,
nullif(trim(substring(data,363,6)), '') AS reserve3,
nullif(trim(substring(data,369,5)), '') AS uga,
nullif(trim(substring(data,374,8)), '') AS statens,
nullif(trim(substring(data,382,8)), '') AS countyns,
nullif(trim(substring(data,390,8)), '') AS cousubns,
nullif(trim(substring(data,398,8)), '') AS placens,
nullif(trim(substring(data,406,8)), '') AS concitns,
nullif(trim(substring(data,414,8)), '') AS aianhhns,
nullif(trim(substring(data,422,8)), '') AS aitsns,
nullif(trim(substring(data,430,8)), '') AS anrcns,
nullif(trim(substring(data,438,8)), '') AS submcdns,
nullif(trim(substring(data,446,2)), '') AS cd113,
nullif(trim(substring(data,448,2)), '') AS cd114,
nullif(trim(substring(data,450,2)), '') AS cd115,
nullif(trim(substring(data,452,3)), '') AS sldu2,
nullif(trim(substring(data,455,3)), '') AS sldu3,
nullif(trim(substring(data,458,3)), '') AS sldu4,
nullif(trim(substring(data,461,3)), '') AS sldl2,
nullif(trim(substring(data,464,3)), '') AS sldl3,
nullif(trim(substring(data,467,3)), '') AS sldl4,
nullif(trim(substring(data,470,2)), '') AS aianhhsc,
nullif(trim(substring(data,472,2)), '') AS csasc,
nullif(trim(substring(data,474,2)), '') AS cnectasc,
nullif(trim(substring(data,476,1)), '') AS memi,
nullif(trim(substring(data,477,1)), '') AS nmemi,
nullif(trim(substring(data,478,5)), '') AS puma,
nullif(trim(substring(data,483,18)), '') AS reserved
FROM census2010.geo_header_staging;
Now, let's delete that staging table.
drop table census2010.geo_header_staging;
Let's build some indices and foreign keys.
echo "begin;" > create_index
for table in `echo "\\\dt census2010.*" | psql -tA census | grep -v data | cut -f 2 -d \|`; do
cat <<EOS >> create_index
create index on census2010.${table}(fileid);
create index on census2010.${table}(stusab);
create index on census2010.${table}(chariter);
create index on census2010.${table}(cifsn);
create unique index on census2010.${table}(logrecno);
alter table census2010.${table} alter column fileid set not null;
alter table census2010.${table} alter column stusab set not null;
alter table census2010.${table} alter column chariter set not null;
alter table census2010.${table} alter column cifsn set not null;
alter table census2010.${table} alter column logrecno set not null;
alter table census2010.${table} add primary key using index ${table}_logrecno_idx;
EOS
done
echo "commit;" >> create_index
cat create_index | psql -v ON_ERROR_STOP=1 census
create unique index on census2010.geo_header_sf1(fileid, stusab, chariter, cifsn, logrecno);
echo "begin;" > create_fk
for table in `echo "\\\dt census2010.*" | psql -tA census | grep -v data | grep -v geo | cut -f 2 -d \|`; do
cat <<EOS >> create_fk
alter table census2010.${table} add constraint ${table}_geo_fk foreign key (logrecno) references census2010.geo_header_sf1(logrecno);
EOS
done
echo "commit;" >> create_fk
cat create_fk | psql -v ON_ERROR_STOP=1 census
In order to speed up lookups between the TIGER data, let's create an index to
use. sumlev
(Summary Level) is the aggregation that the row represents,
e.g. state, county, track, block group, block, &c. They're defined in the
SF1 documentation. 101
is
the block level.
create index on census2010.geo_header_sf1 (state, county, tract, block) where sumlev = '101';
Calculating the total population of the county can be done by simply using
field p0010001
(total population) in table sf1_00001
(also defined in the
SF1 documentation), but only for the rows representing the block summary level.
Without filtering on the summary level the population would be counted for
every summary level in the county.
select sum(p0010001)
from census2010.sf1_00001
join census2010.geo_header_sf1 using (logrecno)
where county = '003'
and sumlev = '101';
For the population in our walkshed, we will use the stops_walkshed_blocks
table that we generated in Part 2. As a recap, that is every block that is
adjacent to a segment of road that is within a quarter mile of a bus stop. (The
reason for the group by
is that stops_walkshed_blocks
is broken down by
stop, which obviously leads to some blocks being counted multiple times.)
select sum(p0010001)
from (
select max(p0010001) as p0010001
from census2010.sf1_00001
join census2010.geo_header_sf1 using (logrecno)
join tiger2017.tabblock
on statefp10 = state
and countyfp10 = county
and tractce10 = tract
and blockce10 = block
join stops_walkshed_blocks
on stops_walkshed_blocks.geoid10 = tabblock.geoid10
where county = '003' and sumlev = '101'
group by tabblock.geoid10) x;
This yields a total population of 1,223,348 and a service area population of 775,835 which is just over 63% of the county's population.