Quickstart¶
gtfs-tripify
is a CLI and Python library for transforming archival GTFS-Realtime messages into
a tabular dataset of historical vehicle arrival and departure times. In this section of the
documentation, I will build a quick demonstration dataset using the gtfs-tripify
command-line
interface.
To begin, make sure that you have Python 3.6 or newer installed and active. Then run the
following pip
package manager (comes included) command from your command line to install
gtfs-tripify
:
pip install gtfs_tripify
We will also need some data. For the purposes of this demo, we’ll use some example data from the
MTA archive (this code snippet uses the curl
Unix utility; Windows users, use a curl
alternative or download these files by hand):
curl -O https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31
curl -O https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-36
curl -O https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-41
This will create three GTFS-Realtime files on your machine, each containing a snapshot of the
state of the MTA system as of a certain date and time. We can turn these hard-to-read
binary-encoded messages into a simple CSV table using the gtfs_tripify
CLI:
gtfs_tripify logify ./ stops.csv --to csv --no-clean
This command tells gtfs_tripify
to “logify” (transform into a tabular trip log) every
GTFS-Realtime message in the current folder and output the result to stops.csv
. --to csv
instructs gtfs_tripify
to output the data in a CSV format, and --no-clean
instructs
gtfs_tripify
not to drop partial trips from the file (we are using just fifteen minutes of
data in this demo).
The resulting file looks something like this:
trip_id,route_id,action,minimum_time,maximum_time,stop_id,latest_information_time,unique_trip_id
131750_7..N,7,STOPPED_OR_SKIPPED,1559440299.0,1559440695.0,726N,1559440315
131750_7..N,7,STOPPED_OR_SKIPPED,1559440846.0,1559440860.0,725N,1559440860
131750_7..N,7,STOPPED_OR_SKIPPED,1559440936.0,1559440950.0,724N,1559440950
131750_7..N,7,STOPPED_OR_SKIPPED,1559441016.0,1559441030.0,723N,1559441030
131750_7..N,7,STOPPED_OR_SKIPPED,1559441211.0,1559441226.0,721N,1559441226
131750_7..N,7,STOPPED_OR_SKIPPED,1559441291.0,1559441306.0,720N,1559441306
131750_7..N,7,STOPPED_OR_SKIPPED,1559441411.0,1559441426.0,719N,1559441426
131750_7..N,7,STOPPED_OR_SKIPPED,1559441561.0,1559441591.0,718N,1559441591
131750_7..N,7,STOPPED_OR_SKIPPED,1559441942.0,1559441956.0,712N,1559441956
This dataset has the following schema:
trip_id
: The ID assigned to the trip in the GTFS-Realtime record.route_id
: The ID of the route (e.g.2
train,3
train, etcetera).stop_id
: The ID assigned to the stop in question. To resolve this value to stop names, please see the GTFS file for this transit system. The MTA for example hosts this file at https://api.mta.info/.action
: The action that the given train took at the given stop. One ofSTOPPED_AT
,STOPPED_OR_SKIPPED
, orEN_ROUTE_TO
.minimum_time
: The minimum arrival time. Unix timestamp. If the first snapshot included in the feed parse has this vehicle already at a station, this value will be set to null.maximum_time
: The maximum departure time. If the trip cuts out without a vehicle having arrived at some of its stations this value will be set to null.latest_information_time
: The timestamp of the most recent GTFS-Realtime data feed containing information pertinent to this record. Unix timestamp.
That concludes this brisk introduction. For a more detailed demo see the tutorial. To get a better idea of what you can do with this data, see the data analysis demo.