About Me!

This blog is about my musings and thoughts. I hope you find it useful, at most, and entertaining, at least.

Résumé [PDF]

Other Pages

Quotes

Links

Oak Island

Items for Sale

Presence Elsewhere

jim@jimkeener.com

GitHub

BitBucket

Calculating Service Area, Part 3: Demographics

Date: 20190127

Tags: postgis postgres transit census

In Part 1 we
talked about the project and got some of the tools and data we need to
start. A quick reminder of our motivation: to figure out the percentage
area and population the county transit agency serves by using a
“walkshed”.


In Part 2, we
defined what our walkshed is and defined how we would be linking it to the
Census 2010. In this post, we’ll be doing the linking


I want to point out sproke’s blogpost from
2012

where I got many of these ideas many years back.


Parts



Importing the Census 2010 Data


First we need to gather the data from census.gov and
initialized our database. I’m going to grab the documentation (sf1.pdf), along with
some MS Access database files they use to store the schema.


wget --continue -e robots=off --force-directories --no-parent https://www.census.gov/prod/cen2010/doc/sf1.pdf
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/SF1_Access2003.mdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/SF1_Access2007.accdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/Pennsylvania/
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/SF2_MSAccess_2003.mdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/SF2_MSAccess_2007.accdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/Pennsylvania/

Some notes, --recursive will grab linked files, --continue will try to
start mid-file if it’s been partially downloaded. -e robots=off is naughty
and tels wget to download things even if robots.txt tells you not to. (I don’t
think it’s an issue here, I just add it all the time.) --force-directories
will create a directory structure begining with the domain so
“https://example.com/a/b/c.d” will be downloaded to a file
“example.com/a/b/c.d” — I find this makes keeping track of what things are and
where they came from a lot easier. --no-parent tells wget not to go up the
directory tree to get assets, which could cause it to begin going down sibling
directories of the root we gave it.


Now, we are going import that data. The first step
is to unzip the datafile. Be warned, the unziped data can swell in size. For
example, the pa dataset is compressed 90%: the zip is 436MiB and uncompressed
is 4.4GiB.


unzip pa2010.sf1.zip

Now that we have the files, let’s build the schema. Mostly because I love
using the commandline whenever possible, I decided to stream the schema
from the Access sample databases through some fix-ups and then into a file.


mdb-schema SF1_Access2003.mdb |
  # MS Access uses [] as idenifier quotes, let’s fix that
  tr ‘[]’ ‘”“’ |

# Some tables have a name suffix of `mod` because the first few have been # removed to keep the table under 255 fields; let’s add them back in. sed ‘/mod”$/{$!{N;s/mod”\n (/”\n (\n “FILEID” text,\n “STUSAB” text,\n “CHARITER” text,\n “CIFSN” text,/g}}’ | # Remove spaces from identifiers. sed -E ‘s/(\s+)”([”]+) ([”]+)”/\1\2_\3/g’ | # Convert to PG type names sed -E ‘s/Text\s*\([0-9]+\)/text/g’ | sed -E ‘s/(Long )?Integer/bigint/g’ | sed -E ‘s/Double/double precision/g’ | # Table 45 is split into 2 parts; let’s merge them by deleting the trailing # ); of the one and the CREATE TABLE along with the first few fields of the # second. sed ‘s/_PT1//g’ | sed ‘/);$/{$!{N;N;N;N;N;N;N;N;s/.*_PT2.*LOG[^,]*//g}}’ | # Just remove the quotes — they’re not needed anymore tr -d ‘”’ > SF1_pg.schma.sql

Then, we can send that into postgres, but I want to wrap it all in a commit
so that we don’t end up with a half-built database if something fails.


cat <(echo "set role census;") \
    <(echo "begin; create schema census2010; set search_path = census2010;") \
    SF1_pg.schma.sql \
    <(echo "commit;") |
  psql -v ON_ERROR_STOP=1 census

Now the fun part! We get to import the datafiles! I’m going to use the
copy statement and the
psql \copy meta-command. However, like above,
I’m going to build up a sql file before running it so that everything can
be wrapped in a transaction. Yes, this step can take a bit and if it fails
towards the end it sucks, but in the end it doesn’t take that long (a few
minutes) and it’s significantly easier than trying to scrub duplicates. (We
haven’t built indices yet as having them during the input slows it down
considerably.


echo "begin;" > import_script
for input_file in pa000*2010.sf1; do
  table_name=SF1_`echo $input_file | sed 's/pa\(.\{5\}\)2010.sf1/\1/'`
  # Provides some feedback as `copy` will only output the number of rows imported.
  echo "\\\echo $input_file  $table_name" >> import_script
  echo "\\\copy census2010.${table_name} from '${input_file}' with (format csv, header false);\n" >> import_script
done
echo "commit;" >> import_script
cat import_script | psql -v ON_ERROR_STOP=1 census

Now comes the more interesting part: the geoheader (links demographic records
with the geography) is a fixed-width file (i.e. fields are not delimited, but
are defined as between character positions on a line). What Sophia did was
to build a staging table that would contain the raw lines, and then do a
“insert into select” with a select statement that teased out the fields using
the substring function. We’ll do a very similar thing. The fields are
defined in the SF1
documentation


echo "set role census; CREATE TABLE census2010.geo_header_staging(data text);" | psql census
echo "\\\copy census2010.geo_header_staging from 'pageo2010.sf1';" | psql census

Then we’ll tease apart that staging table.


INSERT INTO census2010.geo_header_sf1 (fileid, stusab, sumlev, geocomp, chariter, cifsn, logrecno, region, division, state, county, countycc, countysc, cousub, cousubcc, cousubsc, place, placecc, placesc, tract, blkgrp, block, iuc, concit, concitcc, concitsc, aianhh, aianhhfp, aianhhcc, aihhtli, aitsce, aits, aitscc, ttract, tblkgrp, anrc, anrccc, cbsa, cbsasc, metdiv, csa, necta, nectasc, nectadiv, cnecta, cbsapci, nectapci, ua, uasc, uatype, ur, cd, sldu, sldl, vtd, vtdi, reserve2, zcta5, submcd, submcdcc, sdelem, sdsec, sduni, arealand, areawatr, name, funcstat, gcuni, pop100, hu100, intptlat, intptlon, lsadc, partflag, reserve3, uga, statens, countyns, cousubns, placens, concitns, aianhhns, aitsns, anrcns, submcdns, cd113, cd114, cd115, sldu2, sldu3, sldu4, sldl2, sldl3, sldl4, aianhhsc, csasc, cnectasc, memi, nmemi, puma, reserved)
SELECT
nullif(trim(substring(data,1,6)), '') AS fileid,
nullif(trim(substring(data,7,2)), '') AS stusab,
nullif(trim(substring(data,9,3)), '') AS sumlev,
nullif(trim(substring(data,12,2)), '') AS geocomp,
nullif(trim(substring(data,14,3)), '') AS chariter,
nullif(trim(substring(data,17,2)), '') AS cifsn,
nullif(trim(substring(data,19,7)), '')::bigint AS logrecno,
nullif(trim(substring(data,26,1)), '') AS region,
nullif(trim(substring(data,27,1)), '') AS division,
nullif(trim(substring(data,28,2)), '') AS state,
nullif(trim(substring(data,30,3)), '') AS county,
nullif(trim(substring(data,33,2)), '') AS countycc,
nullif(trim(substring(data,35,2)), '') AS countysc,
nullif(trim(substring(data,37,5)), '') AS cousub,
nullif(trim(substring(data,42,2)), '') AS cousubcc,
nullif(trim(substring(data,44,2)), '') AS cousubsc,
nullif(trim(substring(data,46,5)), '') AS place,
nullif(trim(substring(data,51,2)), '') AS placecc,
nullif(trim(substring(data,53,2)), '') AS placesc,
nullif(trim(substring(data,55,6)), '') AS tract,
nullif(trim(substring(data,61,1)), '') AS blkgrp,
nullif(trim(substring(data,62,4)), '') AS block,
nullif(trim(substring(data,66,2)), '') AS iuc,
nullif(trim(substring(data,68,5)), '') AS concit,
nullif(trim(substring(data,73,2)), '') AS concitcc,
nullif(trim(substring(data,75,2)), '') AS concitsc,
nullif(trim(substring(data,77,4)), '') AS aianhh,
nullif(trim(substring(data,81,5)), '') AS aianhhfp,
nullif(trim(substring(data,86,2)), '') AS aianhhcc,
nullif(trim(substring(data,88,1)), '') AS aihhtli,
nullif(trim(substring(data,89,3)), '') AS aitsce,
nullif(trim(substring(data,92,5)), '') AS aits,
nullif(trim(substring(data,97,2)), '') AS aitscc,
nullif(trim(substring(data,99,6)), '') AS ttract,
nullif(trim(substring(data,105,1)), '') AS tblkgrp,
nullif(trim(substring(data,106,5)), '') AS anrc,
nullif(trim(substring(data,111,2)), '') AS anrccc,
nullif(trim(substring(data,113,5)), '') AS cbsa,
nullif(trim(substring(data,118,2)), '') AS cbsasc,
nullif(trim(substring(data,120,5)), '') AS metdiv,
nullif(trim(substring(data,125,3)), '') AS csa,
nullif(trim(substring(data,128,5)), '') AS necta,
nullif(trim(substring(data,133,2)), '') AS nectasc,
nullif(trim(substring(data,135,5)), '') AS nectadiv,
nullif(trim(substring(data,140,3)), '') AS cnecta,
nullif(trim(substring(data,143,1)), '') AS cbsapci,
nullif(trim(substring(data,144,1)), '') AS nectapci,
nullif(trim(substring(data,145,5)), '') AS ua,
nullif(trim(substring(data,150,2)), '') AS uasc,
nullif(trim(substring(data,152,1)), '') AS uatype,
nullif(trim(substring(data,153,1)), '') AS ur,
nullif(trim(substring(data,154,2)), '') AS cd,
nullif(trim(substring(data,156,3)), '') AS sldu,
nullif(trim(substring(data,159,3)), '') AS sldl,
nullif(trim(substring(data,162,6)), '') AS vtd,
nullif(trim(substring(data,168,1)), '') AS vtdi,
nullif(trim(substring(data,169,3)), '') AS reserve2,
nullif(trim(substring(data,172,5)), '') AS zcta5,
nullif(trim(substring(data,177,5)), '') AS submcd,
nullif(trim(substring(data,182,2)), '') AS submcdcc,
nullif(trim(substring(data,184,5)), '') AS sdelem,
nullif(trim(substring(data,189,5)), '') AS sdsec,
nullif(trim(substring(data,194,5)), '') AS sduni,
nullif(trim(substring(data,199,14)), '')::double precision AS arealand,
nullif(trim(substring(data,213,14)), '')::double precision AS areawatr,
nullif(trim(substring(data,227,90)), '') AS name,
nullif(trim(substring(data,317,1)), '') AS funcstat,
nullif(trim(substring(data,318,1)), '') AS gcuni,
nullif(trim(substring(data,319,9)), '')::bigint AS pop100,
nullif(trim(substring(data,328,9)), '')::bigint AS hu100,
nullif(trim(substring(data,337,11)), '') AS intptlat,
nullif(trim(substring(data,348,12)), '') AS intptlon,
nullif(trim(substring(data,360,2)), '') AS lsadc,
nullif(trim(substring(data,362,1)), '') AS partflag,
nullif(trim(substring(data,363,6)), '') AS reserve3,
nullif(trim(substring(data,369,5)), '') AS uga,
nullif(trim(substring(data,374,8)), '') AS statens,
nullif(trim(substring(data,382,8)), '') AS countyns,
nullif(trim(substring(data,390,8)), '') AS cousubns,
nullif(trim(substring(data,398,8)), '') AS placens,
nullif(trim(substring(data,406,8)), '') AS concitns,
nullif(trim(substring(data,414,8)), '') AS aianhhns,
nullif(trim(substring(data,422,8)), '') AS aitsns,
nullif(trim(substring(data,430,8)), '') AS anrcns,
nullif(trim(substring(data,438,8)), '') AS submcdns,
nullif(trim(substring(data,446,2)), '') AS cd113,
nullif(trim(substring(data,448,2)), '') AS cd114,
nullif(trim(substring(data,450,2)), '') AS cd115,
nullif(trim(substring(data,452,3)), '') AS sldu2,
nullif(trim(substring(data,455,3)), '') AS sldu3,
nullif(trim(substring(data,458,3)), '') AS sldu4,
nullif(trim(substring(data,461,3)), '') AS sldl2,
nullif(trim(substring(data,464,3)), '') AS sldl3,
nullif(trim(substring(data,467,3)), '') AS sldl4,
nullif(trim(substring(data,470,2)), '') AS aianhhsc,
nullif(trim(substring(data,472,2)), '') AS csasc,
nullif(trim(substring(data,474,2)), '') AS cnectasc,
nullif(trim(substring(data,476,1)), '') AS memi,
nullif(trim(substring(data,477,1)), '') AS nmemi,
nullif(trim(substring(data,478,5)), '') AS puma,
nullif(trim(substring(data,483,18)), '') AS reserved
FROM census2010.geo_header_staging;

Now, let’s delete that staging table.


drop table census2010.geo_header_staging;

Let’s build some indices and foreign keys.

echo “begin;” > create_index
for table in `echo “\\\dt census2010.*” | psql -tA census | grep -v data | cut -f 2 -d \|`; do
  cat <<EOS >> create_index
create index on census2010.${table}(fileid);
create index on census2010.${table}(stusab);
create index on census2010.${table}(chariter);
create index on census2010.${table}(cifsn);
create unique index on census2010.${table}(logrecno);

	

alter table census2010.${table} alter column fileid set not null;
alter table census2010.${table} alter column stusab set not null;
alter table census2010.${table} alter column chariter set not null;
alter table census2010.${table} alter column cifsn set not null;
alter table census2010.${table} alter column logrecno set not null;

alter table census2010.${table} add primary key using index ${table}_logrecno_idx;

EOS
done
echo “commit;” >> create_index
cat create_index | psql -v ON_ERROR_STOP=1 census

create unique index on census2010.geo_header_sf1(fileid, stusab, chariter, cifsn, logrecno);

echo “begin;” > create_fk
for table in `echo “\\\dt census2010.*” | psql -tA census | grep -v data | grep -v geo | cut -f 2 -d \|`; do cat <<EOS >> create_fk
alter table census2010.${table} add constraint ${table}_geo_fk foreign key (logrecno) references census2010.geo_header_sf1(logrecno);
EOS
done
echo “commit;” >> create_fk
cat create_fk | psql -v ON_ERROR_STOP=1 census


Service Area Population


In order to speed up lookups between the TIGER data, let’s create an index to
use. sumlev (Summary Level) is the aggregation that the row represents,
e.g. state, county, track, block group, block, &c. They’re defined in the
SF1 documentation. 101 is
the block level.


create index on census2010.geo_header_sf1 (state, county, tract, block) where sumlev = '101';

Calculating the total population of the county can be done by simply using
field p0010001 (total population) in table sf1_00001 (also defined in the
SF1 documentation), but only for the rows representing the block summary level.
Without filtering on the summary level the population would be counted for
every summary level in the county.


select sum(p0010001) 
from census2010.sf1_00001 
join census2010.geo_header_sf1 using (logrecno)
where county = '003' 
  and sumlev = '101';

For the population in our walkshed, we will use the stops_walkshed_blocks
table that we generated in Part 2. As a recap, that is every block that is
adjacent to a segment of road that is within a quarter mile of a bus stop. (The
reason for the group by is that stops_walkshed_blocks is broken down by
stop, which obviously leads to some blocks being counted multiple times.)


select sum(p0010001)
from (
  select max(p0010001) as p0010001 
  from census2010.sf1_00001
  join census2010.geo_header_sf1 using (logrecno) 
  join tiger2017.tabblock 
    on  statefp10 = state 
    and countyfp10 = county
    and tractce10 = tract 
    and blockce10 = block
  join stops_walkshed_blocks 
    on stops_walkshed_blocks.geoid10 = tabblock.geoid10
  where county = '003' and sumlev = '101'
  group by tabblock.geoid10) x;

Results


This yields a total population of 1,223,348 and a service area population of
775,835 which is just over 63% of the county’s population.

Calculating Service Area, Part 4: Next Steps

Date: 20190127

Tags: postgis postgres transit census

In Part 1 we
talked about the project and got some of the tools and data we need to
start. A quick reminder of our motivation: to figure out the percentage
area and population the county transit agency serves by using a
“walkshed”.


Parts



Walkshed


There are two changes I should make to the walkshed calculation before I dig
into per-route statistics: (1) excluding interstates and (2) light rail and
busways stations should have a half mile walkshed. I don’t believe they made a
difference at the county level; once I redo the calculations I’ll come back to
see if I have to eat those words.


More In-Depth Demographic Analysis


Since we have all of the census data, all of the route data, and we broke our
walkshed down by stop it’d be quite easy to look at the service area as a
whole, and routes in particular against other socio-economic indicators.


Also, more interestingly, who isn’t being served?


Filter Walkshed by Reliable Transit


I’d love to look at what the walkshed is when only stops that are served more
frequently than every 30 min mid-day is included. Another alternative would be
late at night, or weekends. Essentially, what’s the service area that you could
do a good deal in without having to get into a car?


Build a web-based tool to export this data


I would love to make this data more usable and accessible. I previously made a
little tool using python, mapnik, and a
file from GDAL2Tiles. I’d like to explore if QGIS’ “save project to database”
functionality could be used (i.e. mangled, forced, hacked?) to do this more
cleanly.


Isochron / Isotrip maps


I’ve been wanting to build an application where you could pick a location
and it’d generate two (interactive) maps of all other stops: an isochron map
showing how long it’d take to get anywhere else (and that has 15-min isochron
lines showing the boundry of how long it takes to get how far) and an isotrip
map showing how many trips/transfers it’d take to get to the other stop, with
similar isotrip lines.


Create Subject Tables


The Access DBs also contain information on what all the fields mean and which
subject tables they belong to. We can extract it with the mdb-export tool.

mdb-export SF1_Access2003.mdb data_field_descriptors &gt; data_field_descriptors.csv

Calculating Service Area, Part 2: Defining the Walkshed

Date: 20190110

Tags: postgis pg-routing postgres gis gtfs transit census

In Part 1 we
talked about the project and got some of the tools and data we need to
start. A quick reminder of our motivation: to figure out the percentage
area and population the county transit agency serves by using a
“walkshed”. A walkshed is the distance from a busstop, along the road
for a certain distance. We’ll talk more later in this article why it’s
important to be along the road. Once we have a walkshed, we’ll use
census data to compute the statistics we’re interested in.


Parts



Defining Distance


At this point, we have enough data to start playing around with. However, if
we examine a random road from the ways table in QGIS, we’ll notice that its
length is something like “0.0007016736064598”.


Length of a random road before reprojection


You might be thinking that those are some funny units. You wouldn’t be far off;
they’re units of degrees. They’re not insensible units, especially given that
our geographic data has a project on WGS 84 (EPSG
4326
) (GPS Coördinates,
which are in-fact in degrees), but they’re not very useful for what we’re
trying to do.


Reprojection


In order to deal with more useful units, feet, we need to reproject, convert it
into a new coordinate system, the data. (I discuss projections a little
more at Projections and Why They’re
Important
.) I’m going to choose
the Pennsylvania South State
Plane
(EPSG 3365),
which is designed to provide an accurate plane in linear units (3365 is
feet, 3364 is in
meters) in the southern portion of Pennsylvania. Being able to work in
linear units is significantly easier, and significantly more accurate,
when doing analysis like this that is based on requirements in linear
units.


To convert the geometries we currently have, we’ll copy the tables (to
retain the source data) and then reproject those in-place. The reason we
need to reproject the stops and blocks is that the geographic operations
like selecting within a certain distance work best when everything is in
the same projection. (QGIS would display everything correctly, but
PostGIS doesn’t like doing spatial operations over different
projections.) I also took this oppurtunity to filter the block data to
just Allegheny county, whose FIPS code is ‘003’ and a state of ‘42’.
(The full FIPS code is ‘42003’.) In Postgres we can accomplish this
with:


create table osmpgrouting.ways_3365 as select * from osmpgrouting.ways;
alter table osmpgrouting.ways_3365 alter column the_geom type geometry(linestring, 3365) using st_transform(the_geom,3365) ;
create index ways_3365_the_geom_idx on osmpgrouting.ways_3365 using gist(the_geom);

create table osmpgrouting.ways_vertices_pgr_3365 as select * from osmpgrouting.ways_vertices_pgr;
alter table osmpgrouting.ways_vertices_pgr_3365 alter column the_geom type geometry(point, 3365) using st_transform(the_geom,3365) ;
create index ways_vertices_pgr_3365_the_geom_idx on osmpgrouting.ways_vertices_pgr_3365 using gist(the_geom);

create table gtfs.stops_3365 as select * from gtfs.stops;
alter table gtfs.stops_3365 alter column the_geom type geometry(point, 3365) using st_transform(the_geom,3365) ;
create index stops_3365_the_geom_idx on gtfs.stops_3365 using gist(the_geom);

create table tiger2017.tabblock_ac_3365 as select * from tiger2017.tabblock where countyfp10 = ‘003’ and statefp10 =‘42’;
alter table tiger2017.tabblock_ac_3365 alter column geom type geometry(multipolygon, 3365) using st_transform(geom,3365) ;
create index tabblock_ac_3365_geom_idx on tiger2017.tabblock_ac_3365 using gist(geom);


We’re also going to use a handy little feature of Postgres to make the
resulting SQL use an index: expression
indices
.
The index is on the start point and end point of the geometry column,
will update if the column is ever updated, and isn’t stores explicitly
as a field.


create index osmpgrouting_ways_startpoint_idx on osmpgrouting.ways_3365 using gist (st_startpoint(the_geom));
create index osmpgrouting_ways_endpoint_idx on osmpgrouting.ways_3365 using gist (st_endpoint(the_geom));

We also need to update the length, cost, and reverse_cost fields
on osmpgrouting.ways_3365.


update osmpgrouting.ways_3365 set length = st_length(the_geom), cost = st_length(the_geom), reverse_cost = st_length(the_geom);

Now, let’s load these layers and work with them.


Length of a random road after reprojection


That same road is now “203.747620652544” which seems like a reasonable
number of feet. (It’s also spot on the length_m (length in meters)
field, that is part of the OSM import, when converted to feet!)


Finding all roads within a certain distance of a stop


Now, a walkshed is how far you can walk on the road for a certain
distance. To clarify why this is important, take the yellow bus stop
below. The grey circle is ¼mi (1230ft)
buffer around it. This buffer includes part of a loop of road to the
south.
However, the pale orange line of the ruler shows that you’d need to walk
along 0.3mi of road to get to the point from our starting spot.


Example of why walkshed is important. The yellow busstop is \<¼mi from part of a loop the way the crow flies, but not by foot.


What we need is a function that takes the road network and provides the
distance you’d walk or drive on the road from a starting point to get
somewhere, since we can’t just build a buffer around each point.


pgRouting provides a handy function that does exactly what we’re looking
for:
pgr_drivingDistance.

We give pgr_drivingDistance a query containing our weighted graph as a
table of (edge id, source vertex id, target vertex id, cost, reverse
cost). We need to select reverse cost, otherwise pgr_drivingDistance
will be forced to traverse every road from start to end, and not end to
start, so we’d be at the mercy of how the road was imported. There are
ways to take “one-way” roads into account, but since we’re dealing with
pedestrians, we’ll ignore that.


We also give pgr_drivingDistance a source vertex id and a maximum
distance, and it returns every segment of road in the source graph, and
the aggregate cost to get there, that is within the distance given.


The following query computes the walkshed. I store it directly in a
table because it takes north of 2 hours to run on my machine. I’m still
working on making this faster, but since we’re essentially computing it
once and will reüse these results, it’s not a huge problem in
practice.


What’s happening here is I build a
CTE that is the
result or running pgr_drivingDistance for each stop. Notice the
keyword lateral in the join. Lateral
Subqueries

allow the execution of the subquery for each record in the FROM
clause; otherwise the subquery is executed independently of the rest of
the query.


I then build a CTE that pulls in all road segments that have a start or
end point that is the source or target from our results. I do this,
instead of just going by the edges returned fro pgr_drivingDistance
because if a particularly long segment or road (which isn’t unheard of
if we were doing this for ¼mi) would not be included and it would
skew our results by not counting that area, even if it may be within our
walkshed. I’m erring on the side of including a few features further than
I want to avoid not including features I want.


Then, because of how the join was done, we simply group everything by
stop and road segment, pulling the smallest distance to a vertex for a
segment of road. (Remember, pgr_drivingDistance gives the distance a
vertex is from the start based on the edges in the graph.) However,
let’s focus on a single stop, ‘S18340’, first since computing walksheds
for all stops takes a good while.


create table stops_walkshed_S18340 as
with all_within as (
  select stop_id, dd.*
  from gtfs.stops_3365
  join lateral (
    select seq, node, edge, cost, agg_cost
      from pgr_drivingDistance(
            -- We need to select reverse cost, otherwise
            -- pgr_drivingDistance will be forced to traverse every road
            -- from start to end, and not end to start, so we'd be at the
            -- mercy of how the road was imported. There are ways to take
            -- "one-way" roads into account, but since we're dealing with
            -- pedestrians, we'll ignore that.
            'SELECT gid as id, source, target, cost, reverse_cost ' ||
            'FROM osmpgrouting.ways_3365 ' ||
            -- Find all road segments that begin or end within the
            -- distance we're looking for. We're going to do 1mi to give
            -- ourselves some flexibility later on, and it doesn't
            -- substantually increase the walltime.
            'where st_dwithin(' ||
              -- This awkwardness is because pgRouting functions take
              -- text to run as a query to get the edges and vertecies,
              -- not the results of a subquery.
              'ST_GeomFromEWKB(decode(''' || stops_3365.the_geom::text  || ''', ''hex'')), ' ||
              'ST_StartPoint(ways_3365.the_geom), ' ||
              '5280) or' ||
            ' st_dwithin(' ||
              'ST_GeomFromEWKB(decode(''' || stops_3365.the_geom::text  || ''', ''hex'')), ' ||
              'ST_EndPoint(ways_3365.the_geom), ' ||
              '5280)',
        -- Start from the nearest vertex/intersection to the bus stop.
        (select id
          from osmpgrouting.ways_vertices_pgr_3365
          order by ways_vertices_pgr_3365.the_geom &lt;-&gt; stops_3365.the_geom
          limit 1),
        5280
    )
  ) dd
    on true
  where stop_id = 'S18340'
), simplified as (
  select
    gid, stop_id, agg_cost,
    the_geom
  from all_within
  join osmpgrouting.ways_3365
    on node in (source, target)
)
-- When we do the join above, a road segment could be pulled in
-- multiple times for a single stop, so we'll collect it and count
-- it as the min of either its start or end vertecies.
select gid, stop_id, min(agg_cost) as agg_cost, the_geom
from simplified
group by stop_id, gid, the_geom
;

We still have issues with portions of roads that have other portions
outside the distance we’re shooting for. To solve this, we’re going to
split each road into 100ft segments.


source and target are used by pgr_drivingDistance to connect
segments, so we need to ensure that they are still consistent. Examining
the data shows that the max edge and vertex ids are 6 digits longs.


census=# select max(gid) as max_gid, max(source) as max_source, max(target) as max_target from osmpgrouting.ways_3365;
 max_gid | max_source | max_target
---------+------------+------------
  287229 |     216247 |     216247
(1 row)

Additionally, gid is a bigint, which maxes out at 9223372036854775807,
which is 19 digits long. As such, I can create a new vertex id that is
the “packed” version of the old vertex id, the edge id, and the sequence
in the split. For example, if I had an edge with gid=287229 and a
source=216247, and it’s the 4th segment after splitting, I can call
this new vertex 287229216247000004. Adding some special logic to keep
the original source and targets the same as they were in the
original data (but formatted appropriately), we can still use the
ways_vertices_pgr vertex table as well. (Note that I got a good
portion/idea for this query from the
ST_LineSubString
documentation.) (This does take a while to run.)


create table osmpgrouting.ways_3365_split100 as
select (to_char(gid, ‘FM000000’) || to_char(n, ‘FM000000’))::bigint as gid osm_id, tag_id, st_length(new_geom) as length, name, cast(case when n = 0 then to_char(source, ‘FM000000’) || ‘000000000000’ else to_char(source, ‘FM000000’) || to_char(gid, ‘FM000000’) || to_char(n, ‘FM000000’) end as bigint) as source, cast(case when end_percent = 1 then to_char(target, ‘000000’) || ‘000000000000’ else to_char(source, ‘FM000000’) || to_char(gid, ‘FM000000’) || to_char(n+1, ‘FM000000’) end as bigint) as target, st_length(new_geom) as cost, st_length(new_geom) as reverse_cost, new_geom as the_geom
from (select *, greatest(length,1) as orig_length from osmpgrouting.ways_3365) as x
cross join generate_series(0,10000) as n
join lateral ( select 100.00*n/orig_length as start_percent, case when 100.00*(n+1) < orig_length then 100.00*(n+1)/orig_length else 1 end as end_percent
)y on n*100.00/orig_length< 1
join lateral ( select st_linesubstring(the_geom, start_percent, end_percent) as new_geom
)z on true ;
— Some lines end up as a single point. Something to look into more :/
delete from osmpgrouting.ways_3365_split100 where ST_GeometryType(the_geom) = ‘ST_Point’;
alter table osmpgrouting.ways_3365_split100 add primary key (gid);
alter table osmpgrouting.ways_3365_split100 owner to census;
alter table osmpgrouting.ways_3365_split100 alter column the_geom type geometry(linestring, 3365) using st_setsrid(the_geom, 3365);
create index ways_3365_split100_the_geom_idx on osmpgrouting.ways_3365_split100 using gist(the_geom);
create index ways_3365_split100_source_idx on osmpgrouting.ways_3365_split100(source);
create index ways_3365_split100_target_idx on osmpgrouting.ways_3365_split100(target);
vacuum full osmpgrouting.ways_3365_split100;

create table osmpgrouting.ways_vertices_pgr_3365_split100 as select * from osmpgrouting.ways_vertices_pgr_3365;
update osmpgrouting.ways_vertices_pgr_3365_split100 set id = (to_char(id, ‘FM000000’) || ‘000000000000’)::bigint;
create index ways_vertices_pgr_3365_split100_the_geom_idx on osmpgrouting.ways_vertices_pgr_3365_split100 using gist(the_geom);
vacuum full osmpgrouting.ways_vertices_pgr_3365_split100;


Now, we can run our walkshed calculation again. However, we can clean it
up a little as well, now that we have much smaller segments to work
with. Again, just for S18340 because we want to iterate quickly.


create table stops_walkshed_s18340_split100 as
with all_within as (
  select stop_id, dd.*
  from gtfs.stops_3365
  join lateral (
    select seq, node, edge, cost, agg_cost
      from pgr_drivingDistance(
            'SELECT gid as id, source, target, cost, reverse_cost ' ||
            'FROM osmpgrouting.ways_3365_split100 ' ||
            'where st_dwithin(' ||
              'ST_GeomFromEWKB(decode(''' || stops_3365.the_geom::text  || ''', ''hex'')), ' ||
              'ways_3365_split100.the_geom, ' ||
              '5280)',
        (select id
          from osmpgrouting.ways_vertices_pgr_3365_split100
          order by ways_vertices_pgr_3365_split100.the_geom &lt;-&gt; stops_3365.the_geom
          limit 1),
        5280
    )
  ) dd
    on true
  where stop_id = 'S18340'
), all_within_simplified as (
  select stop_id, node, max(agg_cost) as agg_cost
  from all_within
  group by stop_id, node
)
select
  gid, stop_id, agg_cost, the_geom
from all_within_simplified
join osmpgrouting.ways_3365_split100
  on node in (source, target);

Below is a much more accurate representation of the walkshed in red,
which was computed after splitting the road into segments. The orange
was the “¼ mi” walkshed before we split the road into segments,
and it misses a good portion of territory.


Comparison of walkshed before and after split


Since we computed the walkshed out to 1mi, and we stored the aggregate
distance for each segment, we can use that to build heat maps of what’s
within a certain distance of a stop.


Coloring Walkshed by distance


Let’s now join the walkshed against our block data. Since the road
features themselves might not overlap a block, since we’re using
different data sources, it’s highly likely that a road might be on one
side or the other of the block boundary. What we can do is build an
arbitrary buffer along the road segment, i.e. make it “thicker”, and
test if the buffer overlaps a block. I chose 10ft. I also chose to
filter only for blocks within ¼mi, and not build the distance
metrics into the block.


create table blocks_s18340 as 
select b.* 
from tiger2017.tabblock_ac_3365 b 
join stops_walkshed_s18340_split100 ws 
  on st_overlaps(b.geom, st_buffer(ws.the_geom, 10))
where ws.agg_cost &lt;= 1230;

The blocks themselves are delineated by a road or one of the white
lines. Each block that was adjacent to one of the red lines is now
selected in dark yellow. (As opposed to the rest of the blocks in a
lighter yellow, mustard if you will.) This does add a bit to what the
“walkshed” technically is, but at this point I’m not willing to
interpolate population by building an arbitrary buffer over the
walkshed, as when blocks are large it’s likely that the population is
not evenly spread among it and interpolation can give rather odd
results. (See
Interpolating Old Census Data
for some discussion of those issues.) Using adjacent blocks is more than
sufficient for our purposes, and at worst will overestimate slightly.


Quarter mile walkshed with adjacent blocks


Now, let’s remove the where stop_id = 'S18340' and compute the
walkshed and adjacent blocks for each stop!


create table stops_walkshed_5280 as
with all_within as ( select stop_id, dd.* from gtfs.stops_3365 join lateral ( select seq, node, edge, cost, agg_cost from pgr_drivingDistance( ‘SELECT gid as id, source, target, cost, reverse_cost ‘ || ‘FROM osmpgrouting.ways_3365_split100 ‘ || ‘where st_dwithin(’ || ‘ST_GeomFromEWKB(decode(‘’‘ || stops_3365.the_geom::text || ‘’‘, ‘‘hex’‘)), ‘ || ‘ways_3365_split100.the_geom, ‘ || ’5280)’, (select id from osmpgrouting.ways_vertices_pgr_3365_split100 order by ways_vertices_pgr_3365_split100.the_geom <-> stops_3365.the_geom limit 1), 5280 ) ) dd on true
), all_within_simplified as ( select stop_id, node, max(agg_cost) as agg_cost from all_within group by stop_id, node
)
select gid, stop_id, agg_cost, the_geom
from all_within_simplified
join osmpgrouting.ways_3365_split100 on node in (source, target);
delete from stops_walkshed_5280 where ST_GeometryType(the_geom) = ‘ST_Point’;
vacuum full stops_walkshed_5280;
alter table stops_walkshed_5280 alter column the_geom type geometry(linestring, 3365) using st_setsrid(the_geom, 3365);
create index stops_walkshed_the_geom_idx on stops_walkshed using gist (the_geom);

create table stop_walkshed_blocks as
with stop_walkshed_buffer as ( select stop_id, st_buffer(st_collect(the_geom), 10) as geom from stops_walkshed ws where ws.agg_cost <= 1230 group by stop_id
)
select b.gid, b.geoid10, stop_id, b.geom
from stop_walkshed_buffer swb
join tiger2017.tabblock_ac_3365 b on st_overlaps(b.geom, swb.geom);


(The reason I used a CTE is because I wantd to limit the number of blocks
joined against. I didn’t exhustively test performance against the more
naïve query of


create table stop_walkshed_blocks_2 as
  select b.gid, b.geoid10, stop_id, b.geom
  from tiger2017.tabblock_ac_3365 b
  join stops_walkshed_5280 ws
    on st_overlaps(b.geom, st_buffer(ws.the_geom, 10))
  where ws.agg_cost &lt;= 1230;

as the CTE version ran fast enough and I was running out of time alloted for
this step.)


And Voilà, we have our walkshed.


Walkshed


Now, to calculate the area of the county served.


select
(select sum(st_area(geom)) from (select geom from stop_walkshed_blocks group by geom)x)
/
(select sum(st_area(geom)) from tiger2017.tabblock_ac_3365);

Which yields just over 34% of the county by area.


In the next part, we’ll import the cenus records and compute the percent of the
population served. While that seems low, I expect the percent of population to
be substantially higher, as the parts of the county not covered in pink have a
much lower density (and in some cases are very much rural).

Calculating Service Area, Part 1: Introduction and Setup

Date: 20190108

Tags: postgis pg-routing postgres gis gtfs transit census

Motivation


One of my fellow Allegheny County Transit
Council

(ACTC) members asked how much of the county the Port
Authority
serves. After not receiving a
response, I decided to compute it myself. We’re going to define the
service area in terms of both land and people, i.e. percent of the land in
Allegheny County in the service area and also percent of the people in Allegheny
County in the service area.


That still leaves a question of what a “Service Area” actually is. For this
project, I’m going to define the “Service Area” is any Census 2010 Tabulation
Block

that is adjacent to a road that is within ¼mi of a bus stop and
½mi from a busway or trolley
stop
. From
there, I can use the area and population of tabulation block (block) to build
up my statistics.


Parts



Tools Required


I’m using Ubuntu 18.04 and have basic build utilities such as gcc,
autoconf, make, cmake, git, &c already installed.


QGIS


QGIS is a featureful, free (libre and beer,
although paid consulting is available) GIS (Geographic Information
Systems) tool similar in many ways to ArcGIS. I covered a little on
using it in my So You Want to Make a
Map
post. The QGIS Training
Manual

contains information (among many, many other things) about how to
connect QGIS to PostgreSQL. It also has sections on Database
Concepts

and Spatial Database
Concepts

which I would highly recommend. It’s very fast, well documented, and
easy to use. I cannot say enough good things about it.


To install the latest version (3.x), otherwise the ubuntu repository has the
2.18.x LTR already:


sudo add-apt-repository "deb     https://qgis.org/debian bionic main"
wget -O - https://qgis.org/downloads/qgis-2017.gpg.key | gpg --import
gpg --fingerprint CAEB3DC3BDF7FB45
gpg --export --armor CAEB3DC3BDF7FB45 | sudo apt-key add -
sudo apt-get update
sudo apt-get install qgis python-qgis

Some instructions say to do the following too


sudo add-apt-repository "deb-src https://qgis.org/debian bionic main"
sudo apt-get install qgis-plugin-grass

but they didn’t work for me, nor are they required, so I skipped them.


PostgreSQL, PostGIS, pgRouting


PostgreSQL is an extremely powerful,
full-featured SQL RDBMS. The documentation is top-notch and extensions
such as PostGIS for performing GIS (Geographic
Information System) work and pgRouting for
performing graph algorithms and traversals make it much more valuable. I
truly cannot say enough good things about PostgreSQL.


To install:


sudo apt-get install postgresql-10 postgresql-10-pgrouting postgresql-10-postgis-2.4

mdbtools


mdbtools allows viewing and
manipulating Microsoft Access databases — which the census distributes
schemas as. It is a fork of an older tool,
mdbtools, but that I could get to
compile.


To install:


cd ~/projects/external
git clone git@github.com:brianb/mdbtools.git brianb/mdbtools
cd brianb/mdbtools
sudo apt-get install txt2man
autoreconf -i -f
export DOCBOOK_DSL=/usr/share/sgml/docbook/stylesheet/dsssl/modular/html/docbook.dsl
./configure
make

gtfs-sql-importer


gtfs-sql-importer is a
set of scripts to help import the csv files in the GTFS and generate
some geographic features inside of PostgreSQL.


To install:


cd projects/external
git clone git@github.com:fitnr/gtfs-sql-importer.git fitnr/gtfs-sql-importer
make init
PGDATABASE=census GTFS=~/Downloads/www.portauthority.org/generaltransitfeed/general_transit_CleverTripID_1811.zip make load

I did need to patch the code to handle files with empty lines. However,
this appears to be fixed in the
add-cols
branch, which might be merged into master soon.


diff —git a/src/load.sh b/src/load.sh
index 6d77846..f2f4278 100644
—- a/src/load.sh
+++ b/src/load.sh
-15,7 +15,7 function import_stdin() # remove possible BOM hed=$(unzip -p “$ZIP” “${1}.txt” | head -n 1 | awk ‘{sub(/^\xef\xbb\xbf/,”“)}{print}’) echo “COPY gtfs.${1}” 1>&2
- unzip -p “$ZIP” “${1}.txt” | awk ‘{ sub(/\r$/,”“); sub(“,\”\”“, “,”); print }’ | psql -c “COPY gtfs.${1} (${hed}) FROM STDIN WITH DELIMITER AS ‘,’ HEADER CSV
+ unzip -p “$ZIP” “${1}.txt” | grep -v “^\s*$” | awk ‘{ sub(/\r$/,”“); sub(“,\”\”“, “,”); print }’ | psql -c “COPY gtfs.${1} (${hed}) FROM STDIN WITH DELIMITER AS ‘,’ HEADER CSV” }

ADD_DATES=

census-postgres-scripts

census-postgres-scripts is a set of scripts from
Census Reporter to make quick work of
importing ACS and TIGER datasets.


To install:


cd projects/external
git clone git@github.com:censusreporter/census-postgres-scripts.git censusreporter/census-postgres-scripts

I also made some patches to only download data for Pennsylvania (FIPS ID
42) or the whole US when needed and also to download the Roads and
Tabulation Block geometries, which aren’t included by default in these
scripts. I also store mine in a different location and login to postgres
via my personal user, not the census user.

diff —git a/12_download_tiger_2017.sh b/12_download_tiger_2017.sh
index ea17e88..86084b1 100755
—- a/12_download_tiger_2017.sh
+++ b/12_download_tiger_2017.sh
 -1,11 +1,12 
 # Download census-y TIGER data

	

-mkdir -p /mnt/tmp/tiger2017
-
-wget —recursive —continue —accept=*.zip \
+wget —recursive —continue \
+ —accept=tl_2017_42_*.zip \
+ —accept=tl_2017_42003*.zip \
+ —accept=tl_2017_us*.zip \
+ —wait=1 —random-wait \ —no-parent —cut-dirs=3 —no-host-directories \ -e robots=off \
- —directory-prefix=/mnt/tmp/tiger2017 \ https://www2.census.gov/geo/tiger/TIGER2017/AIANNH/ \ https://www2.census.gov/geo/tiger/TIGER2017/AITSN/ \ https://www2.census.gov/geo/tiger/TIGER2017/ANRC/ \
-23,11 +24,13 wget —recursive —continue —accept=*.zip \ https://www2.census.gov/geo/tiger/TIGER2017/NECTADIV/ \ https://www2.census.gov/geo/tiger/TIGER2017/PLACE/ \ https://www2.census.gov/geo/tiger/TIGER2017/PUMA/ \
+ https://www2.census.gov/geo/tiger/TIGER2017/ROADS/ \ https://www2.census.gov/geo/tiger/TIGER2017/SCSD/ \ https://www2.census.gov/geo/tiger/TIGER2017/SLDL/ \ https://www2.census.gov/geo/tiger/TIGER2017/SLDU/ \ https://www2.census.gov/geo/tiger/TIGER2017/STATE/ \ https://www2.census.gov/geo/tiger/TIGER2017/SUBMCD/ \
+ https://www2.census.gov/geo/tiger/TIGER2017/TABBLOCK/ \ https://www2.census.gov/geo/tiger/TIGER2017/TBG/ \ https://www2.census.gov/geo/tiger/TIGER2017/TRACT/ \ https://www2.census.gov/geo/tiger/TIGER2017/TTRACT/ \
diff —git a/13_import_tiger_2017.sh b/13_import_tiger_2017.sh
index 2d00e54..683c39f 100755
—- a/13_import_tiger_2017.sh
+++ b/13_import_tiger_2017.sh
-1,18 +1,9 #!/bin/bash
-cd /mnt/tmp/tiger2017

-# For shp2pgsql
-sudo apt-get install -y postgis
+psql -d census -v ON_ERROR_STOP=1 -q -c “DROP SCHEMA tiger2017 CASCADE; CREATE SCHEMA tiger2017;”
+psql -d census -v ON_ERROR_STOP=1 -q -c “ALTER SCHEMA tiger2017 OWNER TO census;”

-if [ -z $PGHOST ]; then
- echo “You must set PGHOST environment variable to the hostname of the PostgreSQL server to operate on.”
- exit 1
-fi
-
-psql -d census -U census -v ON_ERROR_STOP=1 -q -c “DROP SCHEMA IF EXISTS tiger2017 CASCADE; CREATE SCHEMA tiger2017;”
-psql -d census -U census -v ON_ERROR_STOP=1 -q -c “ALTER SCHEMA tiger2017 OWNER TO census;”
-
-for i in /mnt/tmp/tiger2017/{CBSA,CD,COUNTY,CSA,PLACE,STATE,ELSD,SCSD,ZCTA5,COUSUB,PUMA,SLDL,SLDU,AIANNH,AITSN,ANRC,BG,CNECTA,CONCITY,METDIV,NECTA,NECTADIV,SUBMCD,TBG,TTRACT,TRACT,UAC,UNSD}
+for i in ./{TABBLOCK,ROADS,CBSA,CD,COUNTY,CSA,PLACE,STATE,ELSD,SCSD,ZCTA5,COUSUB,PUMA,SLDL,SLDU,AIANNH,AITSN,ANRC,BG,CNECTA,CONCITY,METDIV,NECTA,NECTADIV,SUBMCD,TBG,TTRACT,TRACT,UAC,UNSD} do tablename=$(basename $i) for j in $i/*.zip
-24,12 +15,12 do one_shapefile=`ls -a $i/*.shp | head -n 1`

# Start by preparing the table - shp2pgsql -W “latin1” -s 4326 -p -I $one_shapefile tiger2017.$tablename | psql -d census -U census -v ON_ERROR_STOP=1 -q + shp2pgsql -W “latin1” -s 4326 -p -I $one_shapefile tiger2017.$tablename | psql -d census -v ON_ERROR_STOP=1 -q # Then append all the geometries for j in $i/*.shp do - shp2pgsql -W “latin1” -s 4326 -a $j tiger2017.$tablename | psql -d census -U census -v ON_ERROR_STOP=1 -q + shp2pgsql -W “latin1” -s 4326 -a $j tiger2017.$tablename | psql -d census -v ON_ERROR_STOP=1 -q done if [ $? -ne 0 ] diff —git a/13_index_tiger_2017.sql b/13_index_tiger_2017.sql index 288d8ac..5421efa 100644 —- a/13_index_tiger_2017.sql +++ b/13_index_tiger_2017.sql -27,6 +27,8 ALTER TABLE tiger2017.ttract OWNER TO census; ALTER TABLE tiger2017.tract OWNER TO census; ALTER TABLE tiger2017.uac OWNER TO census; ALTER TABLE tiger2017.unsd OWNER TO census; +ALTER TABLE tiger2017.roads OWNER TO census; +ALTER TABLE tiger2017.tabblock OWNER TO census; — OGR needs select on geography_columns to do it’s thing GRANT SELECT ON geography_columns TO census;

osm2pgrouting

osm2pgrouting is a tool
that will import OpenStreetMap dumps as a topology (i.e. connects roads
that are connected in real-life) into a database with PostGIS and
pgRouting.


To install:


sudo apt-get install libpqxx-dev
cd projects/external
git clone git@github.com:pgRouting/osm2pgrouting.git pgRouting/osm2pgrouting
cd pgRouting/osm2pgrouting
mkdir build
cd build
cmake ..
make

osmosis


osmosis is a tool for
manipulating OpenStreetMap dumps. We’ll use it to “crop” the Geofabrik
dump of Pennsylvania to around Allegheny County because osm2pgrouting
requires raw xml, not the Protocol
Buffers
format
Geofabrik distributes in. osm2pgrouting also tries to load the whole
dump into memory before performing its work, so the less data it needs
to load, the better.


To install:


 sudo apt-get install osmosis

or


cd ~/projects/external
mkdir osmosis
cd osmosis
wget https://bretth.dev.openstreetmap.org/osmosis-build/osmosis-latest.tgz
tar xvfz osmosis-latest.tgz
rm osmosis-latest.tgz
chmod a+x bin/osmosis

Database Setup


First we need to create the role/user and database in postgres. I’ve
traditionally used the “census” user as that’s what the Census Report
scripts used, and that’s previously where I started. If you login to
postgres as a superuser, you can create the role and user via the
following SQL.


create role census with nosuperuser login password 'censuspassword';
create database census with owner census;

Alternatively, you could use the following commands (run as the census role).


sudo -u postgres createuser --no-superuser --login --pwprompt census
sudo -u postgres createdb --owner census census

Once created, we can login via your favourite SQL interface, including psql.


psql --username census -password census

and execute the following:


create extension postgis;
create extension hstore;
create extension pgrouting;
create schema osmpgrouting;
create schema gtfs;

Data Needed


To compute all of this, and because I haven’t loaded this data onto this
laptop previously, we’ll need a few pieces of data. We’ll cover
downloading and importing the data in the next post.


TIGER 2010 Tabulation Blocks


The Census provides the geometries corresponding to their summary data, along
with some additional data, such as basic water features and roads.
These geometries are part of a collection called “TIGER”(Topologically
Integrated Geographic Encoding & Referencing). I will be using many of
these datasets at a later time, so I am going to import all of them; If you’re
following along, feel free to only download what you need.


You can follow the instructions that come with the
census-postgres-scripts,
so an abbreviated version is presented here. As I mentioned before, feel
free to only download the TABBLOCK, as that’s all we need for this
project and to modify it to only download your state’s data.


cd projects/external/censusreporter/
./12_download_tiger_2017.sh
./13_import_tiger_2017.sh
./13_index_tiger_2017.sh

Road Topologies


While TIGER does provide road information, it’s actually not the most
useful for this purpose because it does not differentiate between an
at-grade or grade-separated crossing (e.g. an intersection vs a bridge).
This is the essence of a “topology”: the connections of the roads
themselves. For this, we’re going to use the
OpenStreetMap data. Luckily, the folks
over at Geofabrik in Germany, produce
up-to-date exports based on administrative boundaries, such as US
States.


First I grabbed the dump for Pennsylvania.


cd ~/Downloads
wget https://download.geofabrik.de/north-america/us/pennsylvania-190107.osm.pbf

Then, I used osmosis to convert it into XML from the Protocol
Buffers
format and also
to “crop” the input to osm2pgrouting to just the county. The
coordinates I’m using were randomly taken to be outside the county in
all directions. It also seemed like osmosis didn’t like ~ or
$HOME, so I used full paths to my files.


osmosis --read-pbf file=/home/jim/Downloads/pennsylvania-190107.osm.pbf --bb left="-80.456" right="-79.542" top="40.787" bottom="40.119" --write-xml file=/home/jim/Downloads/pennsylvania-190107-ac.osm

Now we can actually import the data! Before I decided to trim the input to
osm2pgrouting, this operation would take quite a while and then fail with a
std::bad_alloc, i.e. out of memory. After cropping, it didn’t take as long,
nor use as much memory, but more importantly, finished.


cd projects/external/pgRouting/osm2pgrouting
./osm2pgrouting --conf ../mapconfig.xml --addnodes --attributes --tags --dbname census --username census --password censuspassword --schema osmpgrouting --file ~/Downloads/pennsylvania-190107-ac.osm

Something I noticed when viewing the data was that there were a lot of
lines from how I “cropped” the region; I suspect they’re connections of
roads with the same name, i.e. state routes, interstate highways, &c. I
deleted them by filter on values with a really large cost, but there
were still a handful of outliers that needed to be cleaned up afterwards
by hand.


echo "delete from osmpgrouting.ways where cost &gt; 1" | psql census

Bus Stops


The General Transit Feed
Specification
is produced
my most transit agencies and is consumed by services such as Google
Maps, Bing Maps, Transit App, &c. It contains data such as stops,
routes, trips, fare, normal schedules, special schedules, &c. The Port
Authority also produces a GTFS
feed
.


I started by grabbing all of the GTFS and GIS data from the Port Authority,
although I only needed the
www.portauthority.org/generaltransitfeed/general_transit_CleverTripID_1811.zip
file and could have only downloaded that. Remember, you may need the patched
version of the code or the add-cols branch for the import to succeed.


cd ~/Downloads wget --recursive --continue -e robots=off
--force-directories --no-parent
http://www.portauthority.org/generaltransitfeed/GIS/ cd
~/projects/external/fitnr/gtfs-sql-importer make init PGDATABASE=census
GTFS=~/Downloads/www.portauthority.org/generaltransitfeed/general_transit_CleverTripID_1811.zip
make load

Census 2010 Population Data


While I could use more recent ACS estimates, I’m choosing to use the
decennial census because I would like my area estimations to use
tabulation blocks, and not block
groups
. Block
Groups are groups of blocks in the same Census Track, and are used to
aggregate data over a larger geographic region. That’s useful for a few
reasons:



  • Fewer data points if you don’t need the geographic specificity of a block. (Let’s face it, 90+% of the time you don’t; tracts will work well-enough.)

  • The ACS is a sampled survey, not an enumeration like the decennial census, and the error would be too high when examining blocks.

  • Statistical Disclosure Limitation whereby data that could be used to identify a person may be edited, especially in geographically fine-grained data such as block-level summaries. However, “By agreement with the Department of Justice (2000) the Census Bureau will provide exact counts at the Census block level for the following variables:

  • Number of people: total, age 18+ (voting age) , and less than age 18,

  • Number of vacant housing units, and

  • Number of householders, which is equal to the number of occupied housing units”


Since this is a little more complicated to import, and we need to generate our
walksheds first, anyway, I’m going to put off the import until a later post.


Next Steps


Part 2: Defining the Walkshed

The Death of jimktrains

Date: 20180514

Tags: social_media

Yesterday I got into yet another argument about the GDPR on hackernews.
It ended with me being rate limited to some long amount of time and no
actual discussion except people insinuating that my position is wrong
because of the GDPR. This is a fundemental misunderstanding in how
arguments work. I can find the GDPR to be silly and baseless while still
complying with it. Their arguments boil down to, in another context,
marijuana is bad because the law says so. That doesn’t help expand the
conversation more. I’m not talking about compliance with the law, I’m
talking about the premise of it.


(I’m working on a longer post discussing my thoughts on the GDPR. I view it in
much the same way as previous cookie legislation: good intentioned, but
ultimatly poorly designed. I feel that, unfortunatly, the onus is ultimatly on
us, the consumers of said services to understand what we’re sending to these
services and how it might be used. Moreover, we need to understand and be fully
aware that anything we give someone else is no longer ours to control. I’ve
deleted my accounts from a few social media outlets, but I’m under no delusion
that they’re actually gone. Even the best of intentioned places will have
remenants of me in backups and logs at the very least or at “worst” just set a
“deleted” field on my record(s). I gave this information away and no longer
believe I am in ultimate control of said data.)


I realized in that I don’t gain anything meaningful from social media.
It brings out the worst in me, and becomes a huge time sink that I can’t
afford.


I hardly use Twitter anymore because, for me, the restricted length of a
tweet makes discussion difficult to near meaningless. The stream of news
is redundant with 1st-party sources I already check, and many tweets are
often linking to articles of questionable origin.


I used Facebook sparingly. While it is extremly nice to see some of
the heartwarming posts from actual friends, ultimately, I don’t feel like
that furthers our friendship. Those interactions are few and far
between, and in contrast to the constant stream of nonsense and
questionable sources posted by others. I was part of a community message
board, but perhaps I can coax them elsewhere or simply have Anne let me
know if there is anything interesting there.


LinkedIn was nothing but recruiter spam and a deletion I should have
done a very long time ago.


HackerNews provides no way to delete an account or older posts. I
emailed the mods and requested my account be deleted. We’ll see what
ultimately happens. It’s funny that I was yelled at because I asserted
websites should control the data you send them and the group-think is
that the GDPR changes that long-standing (and practical and realistic
convention) and that I’m just plain wrong, and yet the website we’re
arguing on provides no means for the users disagreeing with me to
control “their” data (hint: it’s not really and ultimately theirs, no
matter the law).


So anyway, please don’t try to find me on social media. If you’d like to
contact me, send me an email, gchat, or text.