API Reference

tripify.logify()

Builds a logbook.

A logbook is a dict of pandas DataFrames “logs”, each of which contains all known information about a specific trip. Logbooks are writeable to disk in a CSV (gt.ops.to_csv) or GTFS (gt.ops.to_gtfs) format and contiguously mergeable (using gt.ops.merge_logbooks).

For a further reference on logbooks, including a scheme definition and code samples, refer to online documentation at https://residentmario.github.io/gtfs-tripify/index.html.

Input should be a list of byte objects corresponding with raw Protobuf messages. Also accepts already-parsed dict objects, as would be returned by dictify, a convenience for testing.

Output is in the form of a (logbook, timestamps, parse_errors) tuple. logbook is the resultant logbook. timestamps is a dict of logbook timestamps. These are required when merging logbooks. parse_errors is a list of non-fatal errors (schema violations or data corruption) discovered while building the logbook. For a reference on parse error types refer to the corresponding section of the online documentation: https://residentmario.github.io/gtfs-tripify/parse_errors.html.

Data Cleaning

ops.discard_partial_logs()

Removes partial logs from a logbook. Example usage:

import gtfs_tripify as gt
import requests

response1 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31')
response2 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-36')
response3 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-41')
stream = [response1.content, response2.content, response3.content]

logbook, timestamps, parse_errors = gt.logify(stream)
logbook = gt.ops.discard_partial_logs(logbook)

Logbooks are constructed on a “time slice” of data. Trips that appear in the first or last message included in the time slice are necessarily incomplete. These incomplete trips may be:

  • Left as-is.

  • Completed by merging this logbook with a time-contiguous one (using gt.ops.merge_logbooks)

  • Partitioned out (using gt.ops.partition_on_incomplete).

  • Pruned from the logbook (using this method).

The best course of action is dependent on your use case.

ops.cut_cancellations()

Removes reassigned stops from a logbook. Example usage:

import gtfs_tripify as gt
import requests

response1 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31')
response2 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-36')
response3 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-41')
stream = [response1.content, response2.content, response3.content]

logbook, timestamps, parse_errors = gt.logify(stream)
gt.ops.cut_cancellations(logbook)

GTFS-Realtime messages from certain transit providers suffer from trip fragmentation: trains may be reassigned IDs and schedules mid-trip. gtfs_tripify naively assumes that trips that disappeared from the record in this way completed all of their remaining scheduled stops, even though they didn’t.

This method removes those such stops in a logbook which almost assuredly did not happen using a best-effort heuristic. cut_cancellations is robust if and only if transitioning from the second-to-last stop to the last stop on the route takes more than $TIME_INTERVAL seconds, where $TIME_INTERVAL is the distance between feed messages.

If this constraint is violated, either because the interval between the last two stops in the service is unusually short, or due to downtime in the underlying feed, some data will be unfixably ambiguous and may be lost.

Merge and Partition

ops.merge_logbooks()

Given a list of trip logbooks and their corresponding timestamp data, perform a merge and return a combined logbook and the combined timestamps.

The input logbooks must be in time-contiguous order. In other words, the first logbook should cover the time slice (t(1), …, t(n)), the second the time slice (t(n + 1), …, t(n + m)), and so on.

Example usage:

import gtfs_tripify as gt
import requests

response1 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31')
response2 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-36')
response3 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-41')
response4 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-46')
response5 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-51')
response6 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-46')

stream1 = [response1.content, response2.content, response3.content]
stream2 = [response4.content, response5.content, response6.content]
logbook1, timestamps1, parse_errors1 = gt.logify(stream1)
logbook2, timestamps2, parse_errors2 = gt.logify(stream2)

logbook, timestamps = gt.ops.merge([(logbook1, timestamps1), (logbook2, timestamps2)])
ops.partition_on_incomplete(timestamps)

Partitions incomplete logs in a logbook into a separate logbook. Incomplete logs are logs in the logbook for trips that were already in progress as of the first feed update included in the parsed messages, or were still in progress as of the last feed update included in the parsed messages.

This operation is useful when merging logbooks. See also gt.ops.discard_partial_logs. Example usage:

import gtfs_tripify as gt
import requests

response1 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31')
response2 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-36')
response3 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-41')
stream = [response1.content, response2.content, response3.content]
logbook, timestamps, parse_errors = gt.logify(stream)

complete_logbook, complete_timestamps, incomplete_logbook, incompete_timestamps =            gt.ops.partition_on_incomplete(logbook, timestamps)
ops.partition_on_route_id(timestamps)

Partitioning a logbook on route_id. Outputs a dict of logbooks keyed on route ID and a dict of timestamps keyed on route ID. Example usage:

import gtfs_tripify as gt
import requests

response1 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31')
response2 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-36')
response3 = requests.get('https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-41')
stream = [response1.content, response2.content, response3.content]
logbook, timestamps, parse_errors = gt.logify(stream)

route_logbooks, route_timestamps = gt.ops.partition_on_route_id(logbook, timestamps)

File I/O

ops.to_csv(filename, output=False)

Write a logbook to a CSV file.

The output file is readable using an ordinary CSV reader, e.g. pandas.read_csv. Alternatively you may read it back into a logbook format using gt.ops.from_csv.

ops.to_gtfs(filename, tz=None, output=False)

Write a logbook to a GTFS stops.txt record. This method should only be run on complete logbooks (e.g., ones which you have already run gt.ops.cut_cancellations and gt.ops.discard_partial_logs on), as the GTFS spec does not allow null values or hypothetical stops in stops.txt. For general use-cases, gt.ops.to_csv is preferable.

Some edge case behaviors to keep in mind:

  • If there is no known minimum_time for a stop, a time 15 seconds before the maximum_time will be imputed. GTFS does not allow for null values.

  • If there is no known maximum time for a stop, the stop will not be included in the file.

  • If the train is still en route to a stop, that stop will not be included in the file.

ops.from_csv()

Read a logbook from a CSV file (as written to by gt.ops.to_csv).