Data analysis demo¶
This page is a demo analysis of a day of 7 train arrival times from June
1 2019, using data from the MTA GTFS-RT
archive processed
using gtfs_tripify
. This tutorial will show you how to work with
arrival logbooks.
Although not required, it is helpful to have already skimmed the Tutorial, which demonstrates how logbooks are built. For the purposes of this demo we will use a dataset we prepared in advance.
import requests
import pandas as pd
pd.options.display.float_format = '{:.0f}'.format
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import seaborn as sns
trains = pd.read_csv('https://raw.githubusercontent.com/ResidentMario/gtfs-tripify-demo-data/master/7_trains.csv')
trains.head()
trip_id | route_id | action | minimum_time | maximum_time | stop_id | latest_information_time | stop_name | unique_trip_id | |
---|---|---|---|---|---|---|---|---|---|
0 | 042200_7..N | 7 | STOPPED_OR_SKIPPED | 1559386833 | 1559386852 | 726N | 1559386852 | 34 St - 11 Av | 24746f24-b4a2-11e9-9912-8c8590adc94b |
1 | 042200_7..N | 7 | STOPPED_OR_SKIPPED | 1559387014 | 1559387034 | 725N | 1559387034 | Times Sq - 42 St | 24746f24-b4a2-11e9-9912-8c8590adc94b |
2 | 042200_7..N | 7 | STOPPED_OR_SKIPPED | 1559387124 | 1559387144 | 724N | 1559387144 | 5 Av | 24746f24-b4a2-11e9-9912-8c8590adc94b |
3 | 042200_7..N | 7 | STOPPED_OR_SKIPPED | 1559387214 | 1559387234 | 723N | 1559387234 | Grand Central - 42 St | 24746f24-b4a2-11e9-9912-8c8590adc94b |
4 | 042200_7..N | 7 | STOPPED_OR_SKIPPED | 1559387394 | 1559387414 | 721N | 1559387414 | Vernon Blvd - Jackson Av | 24746f24-b4a2-11e9-9912-8c8590adc94b |
trains.stop_name.value_counts(dropna=False).sort_values(ascending=False).plot.bar(
color='steelblue', figsize=(12, 6),
title='7 Train Stops Made by Station'
)
<matplotlib.axes._subplots.AxesSubplot at 0x1161105c0>
The 7 train is split into a local service which makes all stops and an express service which makes an expedited set of stops during rush hours only. The rush hour trains skipped the second of stops in this dataset: 33 Street, 46 Street, and so on. Looking at this chart, we can see that there were around 275 or so total 7 train trips that day. About half of those trips ran as local trains and half ran as express trains.
The nan
values here are for stop IDs with no corresponding stop name
in the GTFS record. This seems like a data error on the MTA’s part;
maybe these stops are for the train yard?
trains.stop_id.map(lambda stop_id: 'Northbound' if 'N' in stop_id else 'Southbound').value_counts().plot.bar(
title='7 Train Stops Made by Service Direction'
)
<matplotlib.axes._subplots.AxesSubplot at 0x119bf6940>
The 7 train made over 3500 southbound stops that day, but only a touch over 2100 northbound ones. Even though the 7 train is primarily east-west, the MTA codes all of its train lines as north-west. In this case “north” means “west” (train trips starting in Flushing, Queens and ending at Hudson Yards in Manhattan) and “south” means “east” (train trips starting at Hudson Yards and ending in Flushing, Queens).
trains.groupby('unique_trip_id').route_id.head(1).value_counts()
7 332
7X 3
Name: route_id, dtype: int64
The route_id
identifies the headsign a train is running under. Just
three of the trains that ran that day were identified as 7X
, e.g. “7
express” trains. Most of the trains that ran the express service were
identified as regular 7
trains in the data stream.
trains.action.value_counts().plot.bar(title='7 Train Stops Made by Type')
<matplotlib.axes._subplots.AxesSubplot at 0x116137128>
It may seem strange that this record contains only
STOPPED_OR_SKIPPED
records, which tell us that a train passed
through a station, and no STOPPED_AT
records, which tell us that a
train stopped at a station for sure. However, this is just a consequence
of how the MTA codes their systems. Messages about the 7 train always
jump straight from “en route to this station” to “en route to the next
station”. It’s almost completely safe to consider a
STOPPED_OR_SKIPPED
record as evidence that the train actually
stopped at that station; a train skipping a scheduled stop can only
occur due to operator error.
trains.assign(n=0).groupby('unique_trip_id').agg(len).n.value_counts().sort_index().plot.bar(
figsize=(12, 8), color='steelblue', title='7 Train Trips by Length'
)
<matplotlib.axes._subplots.AxesSubplot at 0x112ed5780>
If we look at the trips that were made based on their length, we can clearly see two peaks: one at 22, the number of stops typically made by local trains, and one at 12, the number of stops typically made by express trains.
Trains may make fewer or more stops per trip due to delays or incidents forcing trains to skip certain stops. This is likely the reason for the relatively large number of train that made 11 stops that day, for example, instead of 12.
However, because of how the MTA codes its systems, it’s also possible
for a train to seem to make fewer or more stops than it actually made
due to trip fragmentation. This occurs when the MTA scrubs a
schedule whose estimates have become inaccurate, usually due to delays,
and inserts a new schedule with a new trip_id
instead. Since it’s
not always possible to detect when this has happened, and the result is
that a single train trip will get cut into two (or more!) pieces.
For example, consider the 20 or so case where a train seemed to make only 1 stop. This occurs when the MTA scrubs a schedule for a train that is currently in a station, assigns a new one, and then scrubs that schedule as well.
The gt.ops.cut_cancellations
method we ran earlier removes stops
that are actually artifacts of trip fragmentation. This is discussed in
more detail in the section of the documentation on Additional
methods.
But due to the way the data is formatted, it’s often not possible to
patch over train trips that are split into several different segments.
The amount of trip fragmentation varies from line to line, and is especially bad on days in which trains suffer heavy dealys. Since every 7 train begins and ends its service at either Flushing – Main Street or 34 St - 11 Avenue, depending on the heading (northbound or southbound) of the train, checking fragmentation is relatively easy:
def check_if_trip_is_complete(df):
first_and_last = df.iloc[[0, -1]].stop_name
if pd.isnull(first_and_last).any():
return False
else:
return list(sorted(df.iloc[[0, -1]].stop_name)) == ['34 St - 11 Av', 'Flushing - Main St']
(trains.groupby('unique_trip_id')
.apply(check_if_trip_is_complete)
.value_counts().plot.bar())
<matplotlib.axes._subplots.AxesSubplot at 0x116738c18>
Two thirds of the train trips included in this dataset are complete—e.g. they record all of the stops the train took, from the first stop all the way to the last. The remaining one-third of stops are incomplete; they record trips that have been split into two or more segments.
A schedule getting scrubbed almost always means that the train is experiencing significant delays. By looking at what station the train was sitting in or going to right before the schedule was scrubbed, we can see which stops trains tended to get delayed at.
def get_problematic_station(df):
first_and_last = df.iloc[[0, -1]].stop_name
if pd.isnull(first_and_last).any():
return None
elif not list(sorted(df.iloc[[0, -1]].stop_name)) == ['34 St - 11 Av', 'Flushing - Main St']:
last_station = df.iloc[-1].stop_name
if last_station == 'Flushing - Main St' or last_station == '34 St - 11 Av':
return None
else:
return df.iloc[-1].stop_name
(trains.groupby('unique_trip_id')
.apply(get_problematic_station)
.value_counts().plot.bar(color='steelblue', figsize=(12, 8), title='7 Train Delays by Station'))
<matplotlib.axes._subplots.AxesSubplot at 0x1161dc630>
It looks like many of the delays that day occurred near Queensbororo Plaza.
The Ibry chart is a structured visualization of train stops across time and station that was famously applied to Paris-Lyon train line in France, and which later appeared on the cover of Edward Tufte’s seminal book on data visualization, “The Visual Display of Quantitative Information”. The code that follows builds two such charts for our 7 train data—one for northbound (Flushing to Hudson Yards) trips, and one for southbound (Hudson Yard to Flushing) trips.
sns.set_style('white')
stop_sequence = [
'Flushing - Main St',
'Mets - Willets Point',
'111 St',
'103 St - Corona Plaza',
'Junction Blvd',
'90 St - Elmhurst Av',
'82 St - Jackson Hts',
'74 St - Broadway',
'69 St',
'Woodside - 61 St',
'52 St',
'46 St',
'40 St',
'33 St',
'Queensboro Plaza',
'Court Sq',
'Hunters Point Av',
'Vernon Blvd - Jackson Av',
'Grand Central - 42 St',
'5 Av',
'Times Sq - 42 St',
'34 St - 11 Av'
]
def plot_trains(trains, title):
estimated_times = []
for min_timestamp, max_timestamp in zip(trains.minimum_time, trains.maximum_time):
estimated_times.append(
min_timestamp + (max_timestamp - min_timestamp) / 2
)
timetable = pd.pivot_table(
trains.assign(estimated_arrival_time=estimated_times),
index='unique_trip_id',
columns='stop_name',
values='minimum_time'
).T.reindex(stop_sequence)
ax = timetable.rename_axis(None).plot.line(legend=False, color='black', linewidth=1, figsize=(16, 24))
ax.axvline(0, color='black', linewidth=1)
ax.axvline(21, color='black', linewidth=1)
plt.xticks([0 , 22], [stop_sequence[0], stop_sequence[21]], fontsize=16)
plt.yticks([], [])
plt.title(title, fontsize=16)
sns.despine(left=True, bottom=True)
Time on the left axis, and stop number on the bottom:
northbound_trains = trains[trains.stop_id.map(lambda v: 'N' in v)]
plot_trains(northbound_trains, 'Northbound 7 Train Service, June 1 2019')
All but one of the northbound 7 trains ran express that day! This
explains why there were so many trips that were 12 stops in length, but
which had the local service 7
headsign instead of the express
service 7X
headsign.
Most trips are neatly sequential, as you would expect. Service looked very smooth that day overall; there were some delays, demarkated by increasing slopes in the lines, but the trains were mostly able to maintain a consistent headway (time between train arrivals) throughout the day. We can also see how there were far fewer trains running in the early morning hours e.g. between midnight and around 7 AM, then there were later in the day. The number of trains was otherwise pretty consistent across all times.
We can see some data and library artifacts in this plot:
Trains seem to cross over one another going into the last stop in the line, Flushing - Main Street. This occurs because trains are staying in the data stream after arriving at the last stop, which throws off the accuracy of our naive estimate (which is just
minimum_time + (maximum_time - minimum_time) / 2
).Some trains seem to jump back in time on arrival to their final station; this is a data error that I’m still investigating.
Some train trips are discontiguous, indicating places where the service schedule got scrubbed and rebuilt.
Trains are sometimes added to the feed one-at-a-time, sometimes in groups. Some of the trains were added to the feed seconds before they set off, causing us to miss their first stop, 34 St - 11 Av, completely.
If we look at southbound 7 train service, we see a different picture:
southbound_trains = trains[trains.stop_id.map(lambda v: 'S' in v)]
plot_trains(southbound_trains, 'Southbound 7 Train Service, June 1 2019')
In contrast to the northbound 7 trains, all of the southbound 7 trains that day made all local stops. This makes sense, as it would impossible otherwise to access any of the stations that the northbound trains skipped.
The Flushing - Main Street dispatch, like the 34 St - 11 Av dispatch, sometimes adds multiple trains at once to the feed. They also add trains seconds before they leave the station, but seem to do so more rarely than the 34 St - 11 Av dispatch does.
For the most part, we see the same smooth headway pattern we saw with the northbound trains, indicating that this was a pretty good service day for the southbound 7 trains as well.
That concludes this short demo. For more example applications of this data stream, see the section Further reading. To learn how to roll a dataset like this yourself, see the Tutorial section.