This blog is about my musings and thoughts. I hope you find it useful, at most, and entertaining, at least.
Date: 2019-11-12
Tags: postgis postgres gis gtfs transit
For reasons I'll get into in another post, I wanted to have a rough idea of the distribution of the distance between stops. To do this I used the GTFS import for the Port Authority of Allegheny County GTFS that I did earlier this year when calculating the service area.
A quick note about projections. Since I had created a table with a geometry in the Pennsylvania South (ftUS) state plane( EPGS 3365), any distance calculations I do will be done in feet.
A quick, rough and dirty experiment went like this:
create table stop_dist_line as
select stop.stop_id,
stop.stop_name,
prev.stop_id as prev_stop_id,
prev.stop_name as prev_stop_name,
-- Just using euclidean distance right now.
-- This could be improved by using the on-the-street distance.
stop.the_geom <-> prev.the_geom as dist,
st_setsrid(st_makeline(stop.the_geom, prev.the_geom), 3365) as the_geom
from (
-- This gets unique pairs of stops, regardless of ordering.
select distinct least(stop_id, prev_stop_id) as prev_stop_id,
greatest(stop_id, prev_stop_id) as stop_id
-- This gets all pairs of stops. In theory there could be duplicate pairs
-- based on direction.
from (select stop_id,
lag(stop_id, 1) over stop_order prev_stop_id
from gtfs.stop_times
join gtfs.trips using (trip_id)
window stop_order as (partition by trip_id, direction_id order by stop_sequence)
) unordered_pairs
) pairs
join gtfs.stops_3365 stop on stop.stop_id = pairs.stop_id
join gtfs.stops_3365 prev on prev.stop_id = pairs.prev_stop_id;
-- Remove segments that don't have a line because there is no
-- previous stop to have a line or have a distance from.
delete from stop_dist_line where prev_stop_id is null;
-- Register the new geometry column
select Populate_Geometry_Columns();
-- Add a PK to play more nicely with QGIS
alter table stop_dist_line add id serial primary key;
-- Add an index on the geometry.
create index on stop_dist_line using gist(the_geom);
We can then get some idea of the distribution by bucketing by 100s.
select dist_group,
ttlsf,
ttl,
round(ttlsf/ttl, 2) as perc,
c
from
(select dist_group,
sum(c) over (rows between unbounded preceding and current row) as ttlsf,
sum(c) over () as ttl,
c
from (
select 100*floor(dist / 100) as dist_group,
count(*) as c
from stop_dist_line
group by dist_group
order by dist_group) x)y;
Giving
dist_group | ttlsf | ttl | perc | c
------------+-------+------+------+------
0 | 149 | 7732 | 0.02 | 149
100 | 223 | 7732 | 0.03 | 74
200 | 681 | 7732 | 0.09 | 458
300 | 1568 | 7732 | 0.20 | 887
400 | 2528 | 7732 | 0.33 | 960
500 | 3579 | 7732 | 0.46 | 1051
600 | 4462 | 7732 | 0.58 | 883
700 | 5135 | 7732 | 0.66 | 673
800 | 5589 | 7732 | 0.72 | 454
900 | 5946 | 7732 | 0.77 | 357
1000 | 6234 | 7732 | 0.81 | 288
1100 | 6456 | 7732 | 0.83 | 222
1200 | 6628 | 7732 | 0.86 | 172
1300 | 6766 | 7732 | 0.88 | 138
1400 | 6852 | 7732 | 0.89 | 86
1500 | 6921 | 7732 | 0.90 | 69
...
However, because PostgreSQL is awesome, we can also just compute percentiles directly.
select percentile_cont(0.25) within group (order by dist) as p_25,
percentile_cont(0.35) within group (order by dist) as p_35,
percentile_cont(0.45) within group (order by dist) as p_45,
percentile_cont(0.50) within group (order by dist) as p_50
from stop_dist_line;
Giving
p_25 | p_35 | p_45 | p_50
------------------+------------------+------------------+------------------
438.775035648918 | 518.085656645151 | 590.352075165022 | 631.007847560984
Next steps would be to correct many of the little corners cut, e.g. not using on-the-street distance, and rerunning the calculations.