About Me!

This blog is about my musings and thoughts. I hope you find it useful, at most, and entertaining, at least.

Résumé [PDF]

Other Pages

Quotes

Links

Oak Island

Items for Sale

Presence Elsewhere

jim@jimkeener.com

GitHub

BitBucket

Visualizing System-wide Headway

Date: 20191113

Tags: postgis postgres gis gtfs transit

I was bored this evening and decided to visualized the system-wide headway,
that is the amount of time between buses at a stop for every stop in the
system.


Given the stop-distance table from the previous
post, I set out to compute the headway for each stop segment (the portion
of a route between stops).


— Compute the max headway for each stop per hour.
create table overall_headway as
select stop_id, hour, max(headway) as max_headway
from ( select stop_id, extract(hour from departure_time) as hour, departure_time, lag(departure_time, 1) over w as prev_departure_time, extract(epoch from (departure_time – lag(departure_time, 1) over w))/60 as headway from gtfs.stops_3365 join gtfs.stop_times using (stop_id) window w as (partition by stop_id order by departure_time asc)
)x
group by stop_id, hour;

— Assign the segments created for the stop-distance calculation
— a headway from overall_headway.
create table stop_line_headway as
select coalesce(ohs.hour, ohp.hour) as hour, sdl.stop_id, sdl.dist, prev_stop_id, min(least(ohs.max_headway, ohp.max_headway)) as headway, the_geom
from stop_dist_line sdl
left join overall_headway ohs on sdl.stop_id = ohs.stop_id
left join overall_headway ohp on sdl.prev_stop_id = ohp.stop_id
where (ohs.hour is null or ohp.hour is null or (ohs.hour = ohp.hour)) and st_length(the_geom) < 2500
group by coalesce(ohs.hour, ohp.hour), sdl.stop_id, sdl.dist, prev_stop_id, the_geom;

alter table stop_line_headway add id serial primary key;


I then pulled this into QGIS, colored by graduation, and filtered per hour.
For each our, I exported a PNG of the map and tried by hand at KDEnlive for the
first time. (Although ffmpeg could have gotten the job done as well.)


Raw video, images, and more details on building the map to come.

Stop Spacing, Part 1

Date: 20191112

Tags: postgis postgres gis gtfs transit

For reasons I’ll get into in another post, I wanted to have a rough idea
of the distribution of the distance between stops. To do this I used the
GTFS import for the Port Authority of Allegheny County GTFS
that I did earlier this year when calculating the service area.


A quick note about projections. Since I had created a table with a geometry in
the Pennsylvania South (ftUS) state plane( EPGS 3365),
any distance calculations I do will be done in feet.


A quick, rough and dirty experiment went like this:


create table stop_dist_line as 
select stop.stop_id, stop.stop_name, prev.stop_id as prev_stop_id, prev.stop_name as prev_stop_name, — Just using euclidean distance right now. — This could be improved by using the on-the-street distance. stop.the_geom <-> prev.the_geom as dist, st_setsrid(st_makeline(stop.the_geom, prev.the_geom), 3365) as the_geom
from ( — This gets unique pairs of stops, regardless of ordering. select distinct least(stop_id, prev_stop_id) as prev_stop_id, greatest(stop_id, prev_stop_id) as stop_id — This gets all pairs of stops. In theory there could be duplicate pairs — based on direction. from (select stop_id, lag(stop_id, 1) over stop_order prev_stop_id from gtfs.stop_times join gtfs.trips using (trip_id) window stop_order as (partition by trip_id, direction_id order by stop_sequence) ) unordered_pairs
) pairs
join gtfs.stops_3365 stop on stop.stop_id = pairs.stop_id
join gtfs.stops_3365 prev on prev.stop_id = pairs.prev_stop_id;

— Remove segments that don’t have a line because there is no
— previous stop to have a line or have a distance from.
delete from stop_dist_line where prev_stop_id is null;

— Register the new geometry column
select Populate_Geometry_Columns();
— Add a PK to play more nicely with QGIS
alter table stop_dist_line add id serial primary key;
— Add an index on the geometry.
create index on stop_dist_line using gist(the_geom);


We can then get some idea of the distribution by bucketing by 100s.


select dist_group,
       ttlsf,
       ttl,
       round(ttlsf/ttl, 2) as perc,
       c
from
  (select dist_group,
          sum(c) over (rows between unbounded preceding and current row) as ttlsf,
          sum(c) over () as ttl,
          c
   from (
      select 100*floor(dist / 100) as dist_group,
             count(*) as c
      from stop_dist_line
      group by dist_group
      order by dist_group) x)y;

Giving


 dist_group | ttlsf | ttl  | perc |  c
------------+-------+------+------+------
          0 |   149 | 7732 | 0.02 |  149
        100 |   223 | 7732 | 0.03 |   74
        200 |   681 | 7732 | 0.09 |  458
        300 |  1568 | 7732 | 0.20 |  887
        400 |  2528 | 7732 | 0.33 |  960
        500 |  3579 | 7732 | 0.46 | 1051
        600 |  4462 | 7732 | 0.58 |  883
        700 |  5135 | 7732 | 0.66 |  673
        800 |  5589 | 7732 | 0.72 |  454
        900 |  5946 | 7732 | 0.77 |  357
       1000 |  6234 | 7732 | 0.81 |  288
       1100 |  6456 | 7732 | 0.83 |  222
       1200 |  6628 | 7732 | 0.86 |  172
       1300 |  6766 | 7732 | 0.88 |  138
       1400 |  6852 | 7732 | 0.89 |   86
       1500 |  6921 | 7732 | 0.90 |   69
       ...

However, because PostgreSQL is awesome, we can also just compute percentiles
directly.


select percentile_cont(0.25) within group (order by dist) as p_25,
       percentile_cont(0.35) within group (order by dist) as p_35,
       percentile_cont(0.45) within group (order by dist) as p_45,
       percentile_cont(0.50) within group (order by dist) as p_50
from stop_dist_line;

Giving


        p_25       |       p_35       |       p_45       |       p_50
------------------+------------------+------------------+------------------
 438.775035648918 | 518.085656645151 | 590.352075165022 | 631.007847560984

Next steps would be to correct many of the little corners cut, e.g. not using
on-the-street distance, and rerunning the calculations.

Calculating Service Area, Part 3: Demographics

Date: 20190127

Tags: postgis postgres transit census

In Part 1 we
talked about the project and got some of the tools and data we need to
start. A quick reminder of our motivation: to figure out the percentage
area and population the county transit agency serves by using a
“walkshed”.


In Part 2, we
defined what our walkshed is and defined how we would be linking it to the
Census 2010. In this post, we’ll be doing the linking


I want to point out sproke’s blogpost from
2012

where I got many of these ideas many years back.


Parts



Importing the Census 2010 Data


First we need to gather the data from census.gov and
initialized our database. I’m going to grab the documentation (sf1.pdf), along with
some MS Access database files they use to store the schema.


wget --continue -e robots=off --force-directories --no-parent https://www.census.gov/prod/cen2010/doc/sf1.pdf
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/SF1_Access2003.mdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/SF1_Access2007.accdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/04-Summary_File_1/Pennsylvania/
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/SF2_MSAccess_2003.mdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/SF2_MSAccess_2007.accdb
wget --recursive --continue -e robots=off --force-directories --no-parent https://www2.census.gov/census_2010/05-Summary_File_2/Pennsylvania/

Some notes, --recursive will grab linked files, --continue will try to
start mid-file if it’s been partially downloaded. -e robots=off is naughty
and tels wget to download things even if robots.txt tells you not to. (I don’t
think it’s an issue here, I just add it all the time.) --force-directories
will create a directory structure begining with the domain so
“https://example.com/a/b/c.d” will be downloaded to a file
“example.com/a/b/c.d” — I find this makes keeping track of what things are and
where they came from a lot easier. --no-parent tells wget not to go up the
directory tree to get assets, which could cause it to begin going down sibling
directories of the root we gave it.


Now, we are going import that data. The first step
is to unzip the datafile. Be warned, the unziped data can swell in size. For
example, the pa dataset is compressed 90%: the zip is 436MiB and uncompressed
is 4.4GiB.


unzip pa2010.sf1.zip

Now that we have the files, let’s build the schema. Mostly because I love
using the commandline whenever possible, I decided to stream the schema
from the Access sample databases through some fix-ups and then into a file.


mdb-schema SF1_Access2003.mdb |
  # MS Access uses [] as idenifier quotes, let’s fix that
  tr ‘[]’ ‘”“’ |

# Some tables have a name suffix of `mod` because the first few have been # removed to keep the table under 255 fields; let’s add them back in. sed ‘/mod”$/{$!{N;s/mod”\n (/”\n (\n “FILEID” text,\n “STUSAB” text,\n “CHARITER” text,\n “CIFSN” text,/g}}’ | # Remove spaces from identifiers. sed -E ‘s/(\s+)”([”]+) ([”]+)”/\1\2_\3/g’ | # Convert to PG type names sed -E ‘s/Text\s*\([0-9]+\)/text/g’ | sed -E ‘s/(Long )?Integer/bigint/g’ | sed -E ‘s/Double/double precision/g’ | # Table 45 is split into 2 parts; let’s merge them by deleting the trailing # ); of the one and the CREATE TABLE along with the first few fields of the # second. sed ‘s/_PT1//g’ | sed ‘/);$/{$!{N;N;N;N;N;N;N;N;s/.*_PT2.*LOG[^,]*//g}}’ | # Just remove the quotes — they’re not needed anymore tr -d ‘”’ > SF1_pg.schma.sql

Then, we can send that into postgres, but I want to wrap it all in a commit
so that we don’t end up with a half-built database if something fails.


cat &lt;(echo "set role census;") \
    &lt;(echo "begin; create schema census2010; set search_path = census2010;") \
    SF1_pg.schma.sql \
    &lt;(echo "commit;") |
  psql -v ON_ERROR_STOP=1 census

Now the fun part! We get to import the datafiles! I’m going to use the
copy statement and the
psql \copy meta-command. However, like above,
I’m going to build up a sql file before running it so that everything can
be wrapped in a transaction. Yes, this step can take a bit and if it fails
towards the end it sucks, but in the end it doesn’t take that long (a few
minutes) and it’s significantly easier than trying to scrub duplicates. (We
haven’t built indices yet as having them during the input slows it down
considerably.


echo "begin;" &gt; import_script
for input_file in pa000*2010.sf1; do
  table_name=SF1_`echo $input_file | sed 's/pa\(.\{5\}\)2010.sf1/\1/'`
  # Provides some feedback as `copy` will only output the number of rows imported.
  echo "\\\echo $input_file  $table_name" &gt;&gt; import_script
  echo "\\\copy census2010.${table_name} from '${input_file}' with (format csv, header false);\n" &gt;&gt; import_script
done
echo "commit;" &gt;&gt; import_script
cat import_script | psql -v ON_ERROR_STOP=1 census

Now comes the more interesting part: the geoheader (links demographic records
with the geography) is a fixed-width file (i.e. fields are not delimited, but
are defined as between character positions on a line). What Sophia did was
to build a staging table that would contain the raw lines, and then do a
“insert into select” with a select statement that teased out the fields using
the substring function. We’ll do a very similar thing. The fields are
defined in the SF1
documentation


echo "set role census; CREATE TABLE census2010.geo_header_staging(data text);" | psql census
echo "\\\copy census2010.geo_header_staging from 'pageo2010.sf1';" | psql census

Then we’ll tease apart that staging table.


INSERT INTO census2010.geo_header_sf1 (fileid, stusab, sumlev, geocomp, chariter, cifsn, logrecno, region, division, state, county, countycc, countysc, cousub, cousubcc, cousubsc, place, placecc, placesc, tract, blkgrp, block, iuc, concit, concitcc, concitsc, aianhh, aianhhfp, aianhhcc, aihhtli, aitsce, aits, aitscc, ttract, tblkgrp, anrc, anrccc, cbsa, cbsasc, metdiv, csa, necta, nectasc, nectadiv, cnecta, cbsapci, nectapci, ua, uasc, uatype, ur, cd, sldu, sldl, vtd, vtdi, reserve2, zcta5, submcd, submcdcc, sdelem, sdsec, sduni, arealand, areawatr, name, funcstat, gcuni, pop100, hu100, intptlat, intptlon, lsadc, partflag, reserve3, uga, statens, countyns, cousubns, placens, concitns, aianhhns, aitsns, anrcns, submcdns, cd113, cd114, cd115, sldu2, sldu3, sldu4, sldl2, sldl3, sldl4, aianhhsc, csasc, cnectasc, memi, nmemi, puma, reserved)
SELECT
nullif(trim(substring(data,1,6)), '') AS fileid,
nullif(trim(substring(data,7,2)), '') AS stusab,
nullif(trim(substring(data,9,3)), '') AS sumlev,
nullif(trim(substring(data,12,2)), '') AS geocomp,
nullif(trim(substring(data,14,3)), '') AS chariter,
nullif(trim(substring(data,17,2)), '') AS cifsn,
nullif(trim(substring(data,19,7)), '')::bigint AS logrecno,
nullif(trim(substring(data,26,1)), '') AS region,
nullif(trim(substring(data,27,1)), '') AS division,
nullif(trim(substring(data,28,2)), '') AS state,
nullif(trim(substring(data,30,3)), '') AS county,
nullif(trim(substring(data,33,2)), '') AS countycc,
nullif(trim(substring(data,35,2)), '') AS countysc,
nullif(trim(substring(data,37,5)), '') AS cousub,
nullif(trim(substring(data,42,2)), '') AS cousubcc,
nullif(trim(substring(data,44,2)), '') AS cousubsc,
nullif(trim(substring(data,46,5)), '') AS place,
nullif(trim(substring(data,51,2)), '') AS placecc,
nullif(trim(substring(data,53,2)), '') AS placesc,
nullif(trim(substring(data,55,6)), '') AS tract,
nullif(trim(substring(data,61,1)), '') AS blkgrp,
nullif(trim(substring(data,62,4)), '') AS block,
nullif(trim(substring(data,66,2)), '') AS iuc,
nullif(trim(substring(data,68,5)), '') AS concit,
nullif(trim(substring(data,73,2)), '') AS concitcc,
nullif(trim(substring(data,75,2)), '') AS concitsc,
nullif(trim(substring(data,77,4)), '') AS aianhh,
nullif(trim(substring(data,81,5)), '') AS aianhhfp,
nullif(trim(substring(data,86,2)), '') AS aianhhcc,
nullif(trim(substring(data,88,1)), '') AS aihhtli,
nullif(trim(substring(data,89,3)), '') AS aitsce,
nullif(trim(substring(data,92,5)), '') AS aits,
nullif(trim(substring(data,97,2)), '') AS aitscc,
nullif(trim(substring(data,99,6)), '') AS ttract,
nullif(trim(substring(data,105,1)), '') AS tblkgrp,
nullif(trim(substring(data,106,5)), '') AS anrc,
nullif(trim(substring(data,111,2)), '') AS anrccc,
nullif(trim(substring(data,113,5)), '') AS cbsa,
nullif(trim(substring(data,118,2)), '') AS cbsasc,
nullif(trim(substring(data,120,5)), '') AS metdiv,
nullif(trim(substring(data,125,3)), '') AS csa,
nullif(trim(substring(data,128,5)), '') AS necta,
nullif(trim(substring(data,133,2)), '') AS nectasc,
nullif(trim(substring(data,135,5)), '') AS nectadiv,
nullif(trim(substring(data,140,3)), '') AS cnecta,
nullif(trim(substring(data,143,1)), '') AS cbsapci,
nullif(trim(substring(data,144,1)), '') AS nectapci,
nullif(trim(substring(data,145,5)), '') AS ua,
nullif(trim(substring(data,150,2)), '') AS uasc,
nullif(trim(substring(data,152,1)), '') AS uatype,
nullif(trim(substring(data,153,1)), '') AS ur,
nullif(trim(substring(data,154,2)), '') AS cd,
nullif(trim(substring(data,156,3)), '') AS sldu,
nullif(trim(substring(data,159,3)), '') AS sldl,
nullif(trim(substring(data,162,6)), '') AS vtd,
nullif(trim(substring(data,168,1)), '') AS vtdi,
nullif(trim(substring(data,169,3)), '') AS reserve2,
nullif(trim(substring(data,172,5)), '') AS zcta5,
nullif(trim(substring(data,177,5)), '') AS submcd,
nullif(trim(substring(data,182,2)), '') AS submcdcc,
nullif(trim(substring(data,184,5)), '') AS sdelem,
nullif(trim(substring(data,189,5)), '') AS sdsec,
nullif(trim(substring(data,194,5)), '') AS sduni,
nullif(trim(substring(data,199,14)), '')::double precision AS arealand,
nullif(trim(substring(data,213,14)), '')::double precision AS areawatr,
nullif(trim(substring(data,227,90)), '') AS name,
nullif(trim(substring(data,317,1)), '') AS funcstat,
nullif(trim(substring(data,318,1)), '') AS gcuni,
nullif(trim(substring(data,319,9)), '')::bigint AS pop100,
nullif(trim(substring(data,328,9)), '')::bigint AS hu100,
nullif(trim(substring(data,337,11)), '') AS intptlat,
nullif(trim(substring(data,348,12)), '') AS intptlon,
nullif(trim(substring(data,360,2)), '') AS lsadc,
nullif(trim(substring(data,362,1)), '') AS partflag,
nullif(trim(substring(data,363,6)), '') AS reserve3,
nullif(trim(substring(data,369,5)), '') AS uga,
nullif(trim(substring(data,374,8)), '') AS statens,
nullif(trim(substring(data,382,8)), '') AS countyns,
nullif(trim(substring(data,390,8)), '') AS cousubns,
nullif(trim(substring(data,398,8)), '') AS placens,
nullif(trim(substring(data,406,8)), '') AS concitns,
nullif(trim(substring(data,414,8)), '') AS aianhhns,
nullif(trim(substring(data,422,8)), '') AS aitsns,
nullif(trim(substring(data,430,8)), '') AS anrcns,
nullif(trim(substring(data,438,8)), '') AS submcdns,
nullif(trim(substring(data,446,2)), '') AS cd113,
nullif(trim(substring(data,448,2)), '') AS cd114,
nullif(trim(substring(data,450,2)), '') AS cd115,
nullif(trim(substring(data,452,3)), '') AS sldu2,
nullif(trim(substring(data,455,3)), '') AS sldu3,
nullif(trim(substring(data,458,3)), '') AS sldu4,
nullif(trim(substring(data,461,3)), '') AS sldl2,
nullif(trim(substring(data,464,3)), '') AS sldl3,
nullif(trim(substring(data,467,3)), '') AS sldl4,
nullif(trim(substring(data,470,2)), '') AS aianhhsc,
nullif(trim(substring(data,472,2)), '') AS csasc,
nullif(trim(substring(data,474,2)), '') AS cnectasc,
nullif(trim(substring(data,476,1)), '') AS memi,
nullif(trim(substring(data,477,1)), '') AS nmemi,
nullif(trim(substring(data,478,5)), '') AS puma,
nullif(trim(substring(data,483,18)), '') AS reserved
FROM census2010.geo_header_staging;

Now, let’s delete that staging table.


drop table census2010.geo_header_staging;

Let’s build some indices and foreign keys.

echo “begin;” > create_index
for table in `echo “\\\dt census2010.*” | psql -tA census | grep -v data | cut -f 2 -d \|`; do
  cat <<EOS >> create_index
create index on census2010.${table}(fileid);
create index on census2010.${table}(stusab);
create index on census2010.${table}(chariter);
create index on census2010.${table}(cifsn);
create unique index on census2010.${table}(logrecno);

	

alter table census2010.${table} alter column fileid set not null;
alter table census2010.${table} alter column stusab set not null;
alter table census2010.${table} alter column chariter set not null;
alter table census2010.${table} alter column cifsn set not null;
alter table census2010.${table} alter column logrecno set not null;

alter table census2010.${table} add primary key using index ${table}_logrecno_idx;

EOS
done
echo “commit;” >> create_index
cat create_index | psql -v ON_ERROR_STOP=1 census

create unique index on census2010.geo_header_sf1(fileid, stusab, chariter, cifsn, logrecno);

echo “begin;” > create_fk
for table in `echo “\\\dt census2010.*” | psql -tA census | grep -v data | grep -v geo | cut -f 2 -d \|`; do cat <<EOS >> create_fk
alter table census2010.${table} add constraint ${table}_geo_fk foreign key (logrecno) references census2010.geo_header_sf1(logrecno);
EOS
done
echo “commit;” >> create_fk
cat create_fk | psql -v ON_ERROR_STOP=1 census


Service Area Population


In order to speed up lookups between the TIGER data, let’s create an index to
use. sumlev (Summary Level) is the aggregation that the row represents,
e.g. state, county, track, block group, block, &c. They’re defined in the
SF1 documentation. 101 is
the block level.


create index on census2010.geo_header_sf1 (state, county, tract, block) where sumlev = '101';

Calculating the total population of the county can be done by simply using
field p0010001 (total population) in table sf1_00001 (also defined in the
SF1 documentation), but only for the rows representing the block summary level.
Without filtering on the summary level the population would be counted for
every summary level in the county.


select sum(p0010001) 
from census2010.sf1_00001 
join census2010.geo_header_sf1 using (logrecno)
where county = '003' 
  and sumlev = '101';

For the population in our walkshed, we will use the stops_walkshed_blocks
table that we generated in Part 2. As a recap, that is every block that is
adjacent to a segment of road that is within a quarter mile of a bus stop. (The
reason for the group by is that stops_walkshed_blocks is broken down by
stop, which obviously leads to some blocks being counted multiple times.)


select sum(p0010001)
from (
  select max(p0010001) as p0010001 
  from census2010.sf1_00001
  join census2010.geo_header_sf1 using (logrecno) 
  join tiger2017.tabblock 
    on  statefp10 = state 
    and countyfp10 = county
    and tractce10 = tract 
    and blockce10 = block
  join stops_walkshed_blocks 
    on stops_walkshed_blocks.geoid10 = tabblock.geoid10
  where county = '003' and sumlev = '101'
  group by tabblock.geoid10) x;

Results


This yields a total population of 1,223,348 and a service area population of
775,835 which is just over 63% of the county’s population.

Calculating Service Area, Part 4: Next Steps

Date: 20190127

Tags: postgis postgres transit census

In Part 1 we
talked about the project and got some of the tools and data we need to
start. A quick reminder of our motivation: to figure out the percentage
area and population the county transit agency serves by using a
“walkshed”.


Parts



Walkshed


There are two changes I should make to the walkshed calculation before I dig
into per-route statistics: (1) excluding interstates and (2) light rail and
busways stations should have a half mile walkshed. I don’t believe they made a
difference at the county level; once I redo the calculations I’ll come back to
see if I have to eat those words.


More In-Depth Demographic Analysis


Since we have all of the census data, all of the route data, and we broke our
walkshed down by stop it’d be quite easy to look at the service area as a
whole, and routes in particular against other socio-economic indicators.


Also, more interestingly, who isn’t being served?


Filter Walkshed by Reliable Transit


I’d love to look at what the walkshed is when only stops that are served more
frequently than every 30 min mid-day is included. Another alternative would be
late at night, or weekends. Essentially, what’s the service area that you could
do a good deal in without having to get into a car?


Build a web-based tool to export this data


I would love to make this data more usable and accessible. I previously made a
little tool using python, mapnik, and a
file from GDAL2Tiles. I’d like to explore if QGIS’ “save project to database“
functionality could be used (i.e. mangled, forced, hacked?) to do this more
cleanly.


Isochron / Isotrip maps


I’ve been wanting to build an application where you could pick a location
and it’d generate two (interactive) maps of all other stops: an isochron map
showing how long it’d take to get anywhere else (and that has 15-min isochron
lines showing the boundry of how long it takes to get how far) and an isotrip
map showing how many trips/transfers it’d take to get to the other stop, with
similar isotrip lines.


Create Subject Tables


The Access DBs also contain information on what all the fields mean and which
subject tables they belong to. We can extract it with the mdb-export tool.

mdb-export SF1_Access2003.mdb data_field_descriptors &gt; data_field_descriptors.csv

Calculating Service Area, Part 2: Defining the Walkshed

Date: 20190110

Tags: postgis pg-routing postgres gis gtfs transit census

In Part 1 we
talked about the project and got some of the tools and data we need to
start. A quick reminder of our motivation: to figure out the percentage
area and population the county transit agency serves by using a
“walkshed”. A walkshed is the distance from a busstop, along the road
for a certain distance. We’ll talk more later in this article why it’s
important to be along the road. Once we have a walkshed, we’ll use
census data to compute the statistics we’re interested in.


Parts



Defining Distance


At this point, we have enough data to start playing around with. However, if
we examine a random road from the ways table in QGIS, we’ll notice that its
length is something like “0.0007016736064598”.


Length of a random road before reprojection


You might be thinking that those are some funny units. You wouldn’t be far off;
they’re units of degrees. They’re not insensible units, especially given that
our geographic data has a project on WGS 84 (EPSG
4326
) (GPS Coördinates,
which are in-fact in degrees), but they’re not very useful for what we’re
trying to do.


Reprojection


In order to deal with more useful units, feet, we need to reproject, convert it
into a new coordinate system, the data. (I discuss projections a little
more at Projections and Why They’re
Important
.) I’m going to choose
the Pennsylvania South State
Plane
(EPSG 3365),
which is designed to provide an accurate plane in linear units (3365 is
feet, 3364 is in
meters) in the southern portion of Pennsylvania. Being able to work in
linear units is significantly easier, and significantly more accurate,
when doing analysis like this that is based on requirements in linear
units.


To convert the geometries we currently have, we’ll copy the tables (to
retain the source data) and then reproject those in-place. The reason we
need to reproject the stops and blocks is that the geographic operations
like selecting within a certain distance work best when everything is in
the same projection. (QGIS would display everything correctly, but
PostGIS doesn’t like doing spatial operations over different
projections.) I also took this oppurtunity to filter the block data to
just Allegheny county, whose FIPS code is ‘003’ and a state of ‘42’.
(The full FIPS code is ‘42003’.) In Postgres we can accomplish this
with:


create table osmpgrouting.ways_3365 as select * from osmpgrouting.ways;
alter table osmpgrouting.ways_3365 alter column the_geom type geometry(linestring, 3365) using st_transform(the_geom,3365) ;
create index ways_3365_the_geom_idx on osmpgrouting.ways_3365 using gist(the_geom);

create table osmpgrouting.ways_vertices_pgr_3365 as select * from osmpgrouting.ways_vertices_pgr;
alter table osmpgrouting.ways_vertices_pgr_3365 alter column the_geom type geometry(point, 3365) using st_transform(the_geom,3365) ;
create index ways_vertices_pgr_3365_the_geom_idx on osmpgrouting.ways_vertices_pgr_3365 using gist(the_geom);

create table gtfs.stops_3365 as select * from gtfs.stops;
alter table gtfs.stops_3365 alter column the_geom type geometry(point, 3365) using st_transform(the_geom,3365) ;
create index stops_3365_the_geom_idx on gtfs.stops_3365 using gist(the_geom);

create table tiger2017.tabblock_ac_3365 as select * from tiger2017.tabblock where countyfp10 = ‘003’ and statefp10 =‘42’;
alter table tiger2017.tabblock_ac_3365 alter column geom type geometry(multipolygon, 3365) using st_transform(geom,3365) ;
create index tabblock_ac_3365_geom_idx on tiger2017.tabblock_ac_3365 using gist(geom);


We’re also going to use a handy little feature of Postgres to make the
resulting SQL use an index: expression
indices
.
The index is on the start point and end point of the geometry column,
will update if the column is ever updated, and isn’t stores explicitly
as a field.


create index osmpgrouting_ways_startpoint_idx on osmpgrouting.ways_3365 using gist (st_startpoint(the_geom));
create index osmpgrouting_ways_endpoint_idx on osmpgrouting.ways_3365 using gist (st_endpoint(the_geom));

We also need to update the length, cost, and reverse_cost fields
on osmpgrouting.ways_3365.


update osmpgrouting.ways_3365 set length = st_length(the_geom), cost = st_length(the_geom), reverse_cost = st_length(the_geom);

Now, let’s load these layers and work with them.


Length of a random road after reprojection


That same road is now “203.747620652544” which seems like a reasonable
number of feet. (It’s also spot on the length_m (length in meters)
field, that is part of the OSM import, when converted to feet!)


Finding all roads within a certain distance of a stop


Now, a walkshed is how far you can walk on the road for a certain
distance. To clarify why this is important, take the yellow bus stop
below. The grey circle is ¼mi (1230ft)
buffer around it. This buffer includes part of a loop of road to the
south.
However, the pale orange line of the ruler shows that you’d need to walk
along 0.3mi of road to get to the point from our starting spot.


Example of why walkshed is important. The yellow busstop is \<¼mi from part of a loop the way the crow flies, but not by foot.


What we need is a function that takes the road network and provides the
distance you’d walk or drive on the road from a starting point to get
somewhere, since we can’t just build a buffer around each point.


pgRouting provides a handy function that does exactly what we’re looking
for:
pgr_drivingDistance.

We give pgr_drivingDistance a query containing our weighted graph as a
table of (edge id, source vertex id, target vertex id, cost, reverse
cost). We need to select reverse cost, otherwise pgr_drivingDistance
will be forced to traverse every road from start to end, and not end to
start, so we’d be at the mercy of how the road was imported. There are
ways to take “one-way” roads into account, but since we’re dealing with
pedestrians, we’ll ignore that.


We also give pgr_drivingDistance a source vertex id and a maximum
distance, and it returns every segment of road in the source graph, and
the aggregate cost to get there, that is within the distance given.


The following query computes the walkshed. I store it directly in a
table because it takes north of 2 hours to run on my machine. I’m still
working on making this faster, but since we’re essentially computing it
once and will reüse these results, it’s not a huge problem in
practice.


What’s happening here is I build a
CTE that is the
result or running pgr_drivingDistance for each stop. Notice the
keyword lateral in the join. Lateral
Subqueries

allow the execution of the subquery for each record in the FROM
clause; otherwise the subquery is executed independently of the rest of
the query.


I then build a CTE that pulls in all road segments that have a start or
end point that is the source or target from our results. I do this,
instead of just going by the edges returned fro pgr_drivingDistance
because if a particularly long segment or road (which isn’t unheard of
if we were doing this for ¼mi) would not be included and it would
skew our results by not counting that area, even if it may be within our
walkshed. I’m erring on the side of including a few features further than
I want to avoid not including features I want.


Then, because of how the join was done, we simply group everything by
stop and road segment, pulling the smallest distance to a vertex for a
segment of road. (Remember, pgr_drivingDistance gives the distance a
vertex is from the start based on the edges in the graph.) However,
let’s focus on a single stop, ‘S18340’, first since computing walksheds
for all stops takes a good while.


create table stops_walkshed_S18340 as
with all_within as (
  select stop_id, dd.*
  from gtfs.stops_3365
  join lateral (
    select seq, node, edge, cost, agg_cost
      from pgr_drivingDistance(
            -- We need to select reverse cost, otherwise
            -- pgr_drivingDistance will be forced to traverse every road
            -- from start to end, and not end to start, so we'd be at the
            -- mercy of how the road was imported. There are ways to take
            -- "one-way" roads into account, but since we're dealing with
            -- pedestrians, we'll ignore that.
            'SELECT gid as id, source, target, cost, reverse_cost ' ||
            'FROM osmpgrouting.ways_3365 ' ||
            -- Find all road segments that begin or end within the
            -- distance we're looking for. We're going to do 1mi to give
            -- ourselves some flexibility later on, and it doesn't
            -- substantually increase the walltime.
            'where st_dwithin(' ||
              -- This awkwardness is because pgRouting functions take
              -- text to run as a query to get the edges and vertecies,
              -- not the results of a subquery.
              'ST_GeomFromEWKB(decode(''' || stops_3365.the_geom::text  || ''', ''hex'')), ' ||
              'ST_StartPoint(ways_3365.the_geom), ' ||
              '5280) or' ||
            ' st_dwithin(' ||
              'ST_GeomFromEWKB(decode(''' || stops_3365.the_geom::text  || ''', ''hex'')), ' ||
              'ST_EndPoint(ways_3365.the_geom), ' ||
              '5280)',
        -- Start from the nearest vertex/intersection to the bus stop.
        (select id
          from osmpgrouting.ways_vertices_pgr_3365
          order by ways_vertices_pgr_3365.the_geom &lt;-&gt; stops_3365.the_geom
          limit 1),
        5280
    )
  ) dd
    on true
  where stop_id = 'S18340'
), simplified as (
  select
    gid, stop_id, agg_cost,
    the_geom
  from all_within
  join osmpgrouting.ways_3365
    on node in (source, target)
)
-- When we do the join above, a road segment could be pulled in
-- multiple times for a single stop, so we'll collect it and count
-- it as the min of either its start or end vertecies.
select gid, stop_id, min(agg_cost) as agg_cost, the_geom
from simplified
group by stop_id, gid, the_geom
;

We still have issues with portions of roads that have other portions
outside the distance we’re shooting for. To solve this, we’re going to
split each road into 100ft segments.


source and target are used by pgr_drivingDistance to connect
segments, so we need to ensure that they are still consistent. Examining
the data shows that the max edge and vertex ids are 6 digits longs.


census=# select max(gid) as max_gid, max(source) as max_source, max(target) as max_target from osmpgrouting.ways_3365;
 max_gid | max_source | max_target
---------+------------+------------
  287229 |     216247 |     216247
(1 row)

Additionally, gid is a bigint, which maxes out at 9223372036854775807,
which is 19 digits long. As such, I can create a new vertex id that is
the “packed” version of the old vertex id, the edge id, and the sequence
in the split. For example, if I had an edge with gid=287229 and a
source=216247, and it’s the 4th segment after splitting, I can call
this new vertex 287229216247000004. Adding some special logic to keep
the original source and targets the same as they were in the
original data (but formatted appropriately), we can still use the
ways_vertices_pgr vertex table as well. (Note that I got a good
portion/idea for this query from the
ST_LineSubString
documentation.) (This does take a while to run.)


create table osmpgrouting.ways_3365_split100 as
select (to_char(gid, ‘FM000000’) || to_char(n, ‘FM000000’))::bigint as gid osm_id, tag_id, st_length(new_geom) as length, name, cast(case when n = 0 then to_char(source, ‘FM000000’) || ‘000000000000‘ else to_char(source, ‘FM000000’) || to_char(gid, ‘FM000000’) || to_char(n, ‘FM000000’) end as bigint) as source, cast(case when end_percent = 1 then to_char(target, ‘000000’) || ‘000000000000‘ else to_char(source, ‘FM000000’) || to_char(gid, ‘FM000000’) || to_char(n+1, ‘FM000000’) end as bigint) as target, st_length(new_geom) as cost, st_length(new_geom) as reverse_cost, new_geom as the_geom
from (select *, greatest(length,1) as orig_length from osmpgrouting.ways_3365) as x
cross join generate_series(0,10000) as n
join lateral ( select 100.00*n/orig_length as start_percent, case when 100.00*(n+1) < orig_length then 100.00*(n+1)/orig_length else 1 end as end_percent
)y on n*100.00/orig_length< 1
join lateral ( select st_linesubstring(the_geom, start_percent, end_percent) as new_geom
)z on true ;
— Some lines end up as a single point. Something to look into more :/
delete from osmpgrouting.ways_3365_split100 where ST_GeometryType(the_geom) = ‘ST_Point’;
alter table osmpgrouting.ways_3365_split100 add primary key (gid);
alter table osmpgrouting.ways_3365_split100 owner to census;
alter table osmpgrouting.ways_3365_split100 alter column the_geom type geometry(linestring, 3365) using st_setsrid(the_geom, 3365);
create index ways_3365_split100_the_geom_idx on osmpgrouting.ways_3365_split100 using gist(the_geom);
create index ways_3365_split100_source_idx on osmpgrouting.ways_3365_split100(source);
create index ways_3365_split100_target_idx on osmpgrouting.ways_3365_split100(target);
vacuum full osmpgrouting.ways_3365_split100;

create table osmpgrouting.ways_vertices_pgr_3365_split100 as select * from osmpgrouting.ways_vertices_pgr_3365;
update osmpgrouting.ways_vertices_pgr_3365_split100 set id = (to_char(id, ‘FM000000’) || ‘000000000000’)::bigint;
create index ways_vertices_pgr_3365_split100_the_geom_idx on osmpgrouting.ways_vertices_pgr_3365_split100 using gist(the_geom);
vacuum full osmpgrouting.ways_vertices_pgr_3365_split100;


Now, we can run our walkshed calculation again. However, we can clean it
up a little as well, now that we have much smaller segments to work
with. Again, just for S18340 because we want to iterate quickly.


create table stops_walkshed_s18340_split100 as
with all_within as (
  select stop_id, dd.*
  from gtfs.stops_3365
  join lateral (
    select seq, node, edge, cost, agg_cost
      from pgr_drivingDistance(
            'SELECT gid as id, source, target, cost, reverse_cost ' ||
            'FROM osmpgrouting.ways_3365_split100 ' ||
            'where st_dwithin(' ||
              'ST_GeomFromEWKB(decode(''' || stops_3365.the_geom::text  || ''', ''hex'')), ' ||
              'ways_3365_split100.the_geom, ' ||
              '5280)',
        (select id
          from osmpgrouting.ways_vertices_pgr_3365_split100
          order by ways_vertices_pgr_3365_split100.the_geom &lt;-&gt; stops_3365.the_geom
          limit 1),
        5280
    )
  ) dd
    on true
  where stop_id = 'S18340'
), all_within_simplified as (
  select stop_id, node, max(agg_cost) as agg_cost
  from all_within
  group by stop_id, node
)
select
  gid, stop_id, agg_cost, the_geom
from all_within_simplified
join osmpgrouting.ways_3365_split100
  on node in (source, target);

Below is a much more accurate representation of the walkshed in red,
which was computed after splitting the road into segments. The orange
was the “¼ mi” walkshed before we split the road into segments,
and it misses a good portion of territory.


Comparison of walkshed before and after split


Since we computed the walkshed out to 1mi, and we stored the aggregate
distance for each segment, we can use that to build heat maps of what’s
within a certain distance of a stop.


Coloring Walkshed by distance


Let’s now join the walkshed against our block data. Since the road
features themselves might not overlap a block, since we’re using
different data sources, it’s highly likely that a road might be on one
side or the other of the block boundary. What we can do is build an
arbitrary buffer along the road segment, i.e. make it “thicker”, and
test if the buffer overlaps a block. I chose 10ft. I also chose to
filter only for blocks within ¼mi, and not build the distance
metrics into the block.


create table blocks_s18340 as 
select b.* 
from tiger2017.tabblock_ac_3365 b 
join stops_walkshed_s18340_split100 ws 
  on st_overlaps(b.geom, st_buffer(ws.the_geom, 10))
where ws.agg_cost &lt;= 1230;

The blocks themselves are delineated by a road or one of the white
lines. Each block that was adjacent to one of the red lines is now
selected in dark yellow. (As opposed to the rest of the blocks in a
lighter yellow, mustard if you will.) This does add a bit to what the
“walkshed” technically is, but at this point I’m not willing to
interpolate population by building an arbitrary buffer over the
walkshed, as when blocks are large it’s likely that the population is
not evenly spread among it and interpolation can give rather odd
results. (See
Interpolating Old Census Data
for some discussion of those issues.) Using adjacent blocks is more than
sufficient for our purposes, and at worst will overestimate slightly.


Quarter mile walkshed with adjacent blocks


Now, let’s remove the where stop_id = 'S18340' and compute the
walkshed and adjacent blocks for each stop!


create table stops_walkshed_5280 as
with all_within as ( select stop_id, dd.* from gtfs.stops_3365 join lateral ( select seq, node, edge, cost, agg_cost from pgr_drivingDistance( ‘SELECT gid as id, source, target, cost, reverse_cost ‘ || ‘FROM osmpgrouting.ways_3365_split100 ‘ || ‘where st_dwithin(’ || ‘ST_GeomFromEWKB(decode(‘’‘ || stops_3365.the_geom::text || ‘’‘, ‘‘hex’‘)), ‘ || ‘ways_3365_split100.the_geom, ‘ || ‘5280)’, (select id from osmpgrouting.ways_vertices_pgr_3365_split100 order by ways_vertices_pgr_3365_split100.the_geom <-> stops_3365.the_geom limit 1), 5280 ) ) dd on true
), all_within_simplified as ( select stop_id, node, max(agg_cost) as agg_cost from all_within group by stop_id, node
)
select gid, stop_id, agg_cost, the_geom
from all_within_simplified
join osmpgrouting.ways_3365_split100 on node in (source, target);
delete from stops_walkshed_5280 where ST_GeometryType(the_geom) = ‘ST_Point’;
vacuum full stops_walkshed_5280;
alter table stops_walkshed_5280 alter column the_geom type geometry(linestring, 3365) using st_setsrid(the_geom, 3365);
create index stops_walkshed_the_geom_idx on stops_walkshed using gist (the_geom);

create table stop_walkshed_blocks as
with stop_walkshed_buffer as ( select stop_id, st_buffer(st_collect(the_geom), 10) as geom from stops_walkshed ws where ws.agg_cost <= 1230 group by stop_id
)
select b.gid, b.geoid10, stop_id, b.geom
from stop_walkshed_buffer swb
join tiger2017.tabblock_ac_3365 b on st_overlaps(b.geom, swb.geom);


(The reason I used a CTE is because I wantd to limit the number of blocks
joined against. I didn’t exhustively test performance against the more
naïve query of


create table stop_walkshed_blocks_2 as
  select b.gid, b.geoid10, stop_id, b.geom
  from tiger2017.tabblock_ac_3365 b
  join stops_walkshed_5280 ws
    on st_overlaps(b.geom, st_buffer(ws.the_geom, 10))
  where ws.agg_cost &lt;= 1230;

as the CTE version ran fast enough and I was running out of time alloted for
this step.)


And Voilà, we have our walkshed.


Walkshed


Now, to calculate the area of the county served.


select
(select sum(st_area(geom)) from (select geom from stop_walkshed_blocks group by geom)x)
/
(select sum(st_area(geom)) from tiger2017.tabblock_ac_3365);

Which yields just over 34% of the county by area.


In the next part, we’ll import the cenus records and compute the percent of the
population served. While that seems low, I expect the percent of population to
be substantially higher, as the parts of the county not covered in pink have a
much lower density (and in some cases are very much rural).