Quickstart

gtfs-tripify is a CLI and Python library for transforming archival GTFS-Realtime messages into a tabular dataset of historical vehicle arrival and departure times. In this section of the documentation, I will build a quick demonstration dataset using the gtfs-tripify command-line interface.

To begin, make sure that you have Python 3.6 or newer installed and active. Then run the following pip package manager (comes included) command from your command line to install gtfs-tripify:

pip install gtfs_tripify

We will also need some data. For the purposes of this demo, we’ll use some example data from the MTA archive (this code snippet uses the curl Unix utility; Windows users, use a curl alternative or download these files by hand):

curl -O https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31
curl -O https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-36
curl -O https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-41

This will create three GTFS-Realtime files on your machine, each containing a snapshot of the state of the MTA system as of a certain date and time. We can turn these hard-to-read binary-encoded messages into a simple CSV table using the gtfs_tripify CLI:

gtfs_tripify logify ./ stops.csv --to csv --no-clean

This command tells gtfs_tripify to “logify” (transform into a tabular trip log) every GTFS-Realtime message in the current folder and output the result to stops.csv. --to csv instructs gtfs_tripify to output the data in a CSV format, and --no-clean instructs gtfs_tripify not to drop partial trips from the file (we are using just fifteen minutes of data in this demo).

The resulting file looks something like this:

trip_id,route_id,action,minimum_time,maximum_time,stop_id,latest_information_time,unique_trip_id
131750_7..N,7,STOPPED_OR_SKIPPED,1559440299.0,1559440695.0,726N,1559440315
131750_7..N,7,STOPPED_OR_SKIPPED,1559440846.0,1559440860.0,725N,1559440860
131750_7..N,7,STOPPED_OR_SKIPPED,1559440936.0,1559440950.0,724N,1559440950
131750_7..N,7,STOPPED_OR_SKIPPED,1559441016.0,1559441030.0,723N,1559441030
131750_7..N,7,STOPPED_OR_SKIPPED,1559441211.0,1559441226.0,721N,1559441226
131750_7..N,7,STOPPED_OR_SKIPPED,1559441291.0,1559441306.0,720N,1559441306
131750_7..N,7,STOPPED_OR_SKIPPED,1559441411.0,1559441426.0,719N,1559441426
131750_7..N,7,STOPPED_OR_SKIPPED,1559441561.0,1559441591.0,718N,1559441591
131750_7..N,7,STOPPED_OR_SKIPPED,1559441942.0,1559441956.0,712N,1559441956

This dataset has the following schema:

  • trip_id: The ID assigned to the trip in the GTFS-Realtime record.

  • route_id: The ID of the route (e.g. 2 train, 3 train, etcetera).

  • stop_id: The ID assigned to the stop in question. To resolve this value to stop names, please see the GTFS file for this transit system. The MTA for example hosts this file at https://api.mta.info/.

  • action: The action that the given train took at the given stop. One of STOPPED_AT, STOPPED_OR_SKIPPED, or EN_ROUTE_TO.

  • minimum_time: The minimum arrival time. Unix timestamp. If the first snapshot included in the feed parse has this vehicle already at a station, this value will be set to null.

  • maximum_time: The maximum departure time. If the trip cuts out without a vehicle having arrived at some of its stations this value will be set to null.

  • latest_information_time: The timestamp of the most recent GTFS-Realtime data feed containing information pertinent to this record. Unix timestamp.

That concludes this brisk introduction. For a more detailed demo see the tutorial. To get a better idea of what you can do with this data, see the data analysis demo.