gtfs-tripify is a CLI and Python library for transforming archival GTFS-Realtime messages into
a tabular dataset of historical vehicle arrival and departure times. In this section of the
documentation, I will build a quick demonstration dataset using the
To begin, make sure that you have Python 3.6 or newer installed and active. Then run the
pip package manager (comes included) command from your command line to install
pip install gtfs_tripify
We will also need some data. For the purposes of this demo, we’ll use some example data from the
MTA archive (this code snippet uses the
curl Unix utility; Windows users, use a
alternative or download these files by hand):
curl -O https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31 curl -O https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-36 curl -O https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-41
This will create three GTFS-Realtime files on your machine, each containing a snapshot of the
state of the MTA system as of a certain date and time. We can turn these hard-to-read
binary-encoded messages into a simple CSV table using the
gtfs_tripify logify ./ stops.csv --to csv --no-clean
This command tells
gtfs_tripify to “logify” (transform into a tabular trip log) every
GTFS-Realtime message in the current folder and output the result to
gtfs_tripify to output the data in a CSV format, and
gtfs_tripify not to drop partial trips from the file (we are using just fifteen minutes of
data in this demo).
The resulting file looks something like this:
trip_id,route_id,action,minimum_time,maximum_time,stop_id,latest_information_time,unique_trip_id 131750_7..N,7,STOPPED_OR_SKIPPED,1559440299.0,1559440695.0,726N,1559440315 131750_7..N,7,STOPPED_OR_SKIPPED,1559440846.0,1559440860.0,725N,1559440860 131750_7..N,7,STOPPED_OR_SKIPPED,1559440936.0,1559440950.0,724N,1559440950 131750_7..N,7,STOPPED_OR_SKIPPED,1559441016.0,1559441030.0,723N,1559441030 131750_7..N,7,STOPPED_OR_SKIPPED,1559441211.0,1559441226.0,721N,1559441226 131750_7..N,7,STOPPED_OR_SKIPPED,1559441291.0,1559441306.0,720N,1559441306 131750_7..N,7,STOPPED_OR_SKIPPED,1559441411.0,1559441426.0,719N,1559441426 131750_7..N,7,STOPPED_OR_SKIPPED,1559441561.0,1559441591.0,718N,1559441591 131750_7..N,7,STOPPED_OR_SKIPPED,1559441942.0,1559441956.0,712N,1559441956
This dataset has the following schema:
trip_id: The ID assigned to the trip in the GTFS-Realtime record.
route_id: The ID of the route (e.g.
stop_id: The ID assigned to the stop in question. To resolve this value to stop names, please see the GTFS file for this transit system. The MTA for example hosts this file at https://api.mta.info/.
action: The action that the given train took at the given stop. One of
minimum_time: The minimum arrival time. Unix timestamp. If the first snapshot included in the feed parse has this vehicle already at a station, this value will be set to null.
maximum_time: The maximum departure time. If the trip cuts out without a vehicle having arrived at some of its stations this value will be set to null.
latest_information_time: The timestamp of the most recent GTFS-Realtime data feed containing information pertinent to this record. Unix timestamp.
That concludes this brisk introduction. For a more detailed demo see the tutorial. To get a better idea of what you can do with this data, see the data analysis demo.