r5py Tutorial: 2.7 Million Transit Routes for $0 in Python

If we want to measure how long it actually takes transit-dependent residents to reach grocery stores, we need to calculate travel times at scale. For 408 census tracts and 6,613 grocery stores, that means 2,697,704 origin-destination pairs. Using Google Maps API, this would cost approximately $13,500 at current rates ($5 per 1,000 routes).^[1] We can calculate all of them for $0.

Why does this matter? Many food access questions require travel time calculations, but commercial API pricing puts large-scale analysis out of reach for unfunded projects. Studying transit-based food access at scale requires free alternatives. They exist, but they require assembling multiple data sources and learning new tools.

This post walks through what we learned building this workflow: how to acquire GTFS transit data, build a pedestrian network from OpenStreetMap, and route with r5py. All code is available upon request.^[2]

Worth noting up front: r5py is not the only free option. OSRM handles car, bike, and walking routes but lacks transit support. OpenTripPlanner offers full-featured transit routing with real-time updates, though it requires deploying a Java server. Conveyal Analysis provides a web-based interface to the same R5 engine r5py wraps, with built-in visualization but less flexibility for custom pipelines. We chose r5py because it handles batch travel time matrices directly in Python, which tends to be what large-scale research needs. The Alternatives section below goes into more detail.

The Tool Stack: GTFS, OpenStreetMap, and r5py

Four free, open-source components make this possible:

The Tool Stack
Component	Tool	Purpose
Transit schedules	GTFS feeds	Bus and rail routes, stops, schedules
Street network	OpenStreetMap	Walking paths between transit and destinations
Routing engine	r5py (Conveyal R5)	Calculate multimodal travel times
Destination data	Google Places API	Grocery store locations (some API cost here)

GTFS (General Transit Feed Specification) is the standard format for transit data. Most US transit agencies publish free GTFS feeds that include routes, stops, and schedules.^[3]

OpenStreetMap provides free street network data worldwide. For transit routing, we need pedestrian paths: sidewalks, crosswalks, and building entrances.

r5py is a Python wrapper for Conveyal's R5 routing engine, designed for rapid accessibility analysis.^[4] It calculates travel times accounting for walking, waiting, riding, and transferring.

Step 1: Acquire GTFS Data

Every transit agency maintains its own GTFS feed. For California, the Cal-ITP project aggregates feeds from 200+ agencies into a single download.^[5]

import requests
from pathlib import Path

# Cal-ITP statewide stops (aggregated from all agencies)
CALITP_URL = "https://data.ca.gov/dataset/cal-itp-gtfs-ingest-pipeline-dataset"

# Download the aggregated stops file
response = requests.get(CALITP_URL + "/stops.csv")
Path("data/calitp_stops.csv").write_bytes(response.content)

For single-agency analysis, download directly from the agency:

# Example: VTA (Santa Clara County)
VTA_GTFS_URL = "https://www.vta.org/sites/default/files/google_transit.zip"
response = requests.get(VTA_GTFS_URL)
Path("data/gtfs/vta_gtfs.zip").write_bytes(response.content)

A word of caution on GTFS quality: Some agency feeds contain errors that can silently corrupt results -- stops with coordinates of (0, 0), routes with no trips, or stop_times with impossible schedules. It helps to validate feeds before routing. The function below checks for the most common issues:

# Validate GTFS before using
def validate_gtfs(gtfs_path):
    import zipfile

    with zipfile.ZipFile(gtfs_path) as zf:
        stops = pd.read_csv(zf.open("stops.txt"))

        # Check for invalid coordinates
        invalid = stops[
            (stops["stop_lat"] == 0) |
            (stops["stop_lon"] == 0) |
            (stops["stop_lat"].isna())
        ]

        if len(invalid) > 0:
            print(f"Warning: {len(invalid)} stops with invalid coordinates")
            stops = stops[~stops.index.isin(invalid.index)]

        return stops

Key files in a GTFS feed:

stops.txt: Stop locations (latitude, longitude)
routes.txt: Route names and types
trips.txt: Individual scheduled trips
stop_times.txt: Arrival/departure times at each stop
calendar.txt: Service patterns (weekday, weekend)

Step 2: Download OpenStreetMap Data

r5py needs a pedestrian network to route walking segments. OpenStreetMap provides this.

# Download California OSM extract from Geofabrik
OSM_URL = "https://download.geofabrik.de/north-america/us/california-latest.osm.pbf"

response = requests.get(OSM_URL, stream=True)
with open("data/osm/california.osm.pbf", "wb") as f:
    for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)

For smaller regions, it is worth extracting a subset. Large OSM files (statewide or larger) can cause r5py to run slowly or fail entirely during network building. Extracting only the region we need avoids this overhead and keeps build times reasonable. The osmium command-line tool handles this well (install via pip install osmium-tool, or on macOS: brew install osmium-tool):

# Using osmium-tool to extract a bounding box
osmium extract -b -122.5,37.0,-121.5,37.8 california.osm.pbf -o bay_area.osm.pbf

We can also calculate the bounding box programmatically from our data, which ensures the extract covers all origins and destinations with a small buffer:

# Calculate bounding box from our origin and destination data
min_lon = min(origins.geometry.x.min(), destinations.geometry.x.min()) - 0.1
max_lon = max(origins.geometry.x.max(), destinations.geometry.x.max()) + 0.1
min_lat = min(origins.geometry.y.min(), destinations.geometry.y.min()) - 0.1
max_lat = max(origins.geometry.y.max(), destinations.geometry.y.max()) + 0.1

# Extract using osmium
import subprocess
subprocess.run([
    "osmium", "extract",
    "-b", f"{min_lon},{min_lat},{max_lon},{max_lat}",
    "california.osm.pbf",
    "-o", "region.osm.pbf"
])

The Bay Area extract is about 200 MB; statewide California is about 1 GB. The Bay Area extract builds in about 5 minutes, while statewide California takes 20-30 minutes.

Step 3: Set Up r5py

r5py requires Java and the R5 JAR file. Installation:

# Install r5py via pip
pip install r5py

# r5py automatically downloads the R5 JAR on first use

Build the transport network:

import r5py

# Build network from GTFS and OSM
transport_network = r5py.TransportNetwork(
    osm_pbf="data/osm/bay_area.osm.pbf",
    gtfs=["data/gtfs/vta_gtfs.zip"]  # Can include multiple agencies
)

Network building takes 5-30 minutes depending on area size and number of GTFS feeds. The result is cached for reuse.

Step 4: Prepare Origins and Destinations

Origins are census tract centroids (where people live). Destinations are grocery stores (where people need to go).

The id column serves as the unique identifier for each point in the travel time matrix. For origins, we use the census tract GEOID; for destinations, the store ID. r5py uses these identifiers to label the rows and columns in its output, so they need to be unique and meaningful for later analysis.

import pandas as pd
import geopandas as gpd

# Load census tract centroids
tracts = gpd.read_file("data/census_tracts.geojson")
origins = tracts[["GEOID", "geometry"]].copy()
origins["id"] = origins["GEOID"]  # Unique identifier for each origin tract

# Load grocery store locations
stores = pd.read_csv("data/grocery_stores.csv")
destinations = gpd.GeoDataFrame(
    stores,
    geometry=gpd.points_from_xy(stores.longitude, stores.latitude),
    crs="EPSG:4326"
)
destinations["id"] = destinations["store_id"]  # Unique identifier for each destination

Important: Both origins and destinations need an id column and geometry column in a GeoDataFrame.

Step 5: Calculate Travel Times

The departure_time_window parameter controls how long r5py samples departures after the initial departure time. Setting a 2-hour window (say, 9:00 AM to 11:00 AM) means r5py simulates multiple departures across that period and reports the median travel time. This captures service variation: a trip departing at 9:05 might catch a bus immediately, while one at 9:15 might wait 20 minutes for the next one. The wider the window, the more representative the result tends to be.

r5py's TravelTimeMatrixComputer calculates travel times between all origin-destination pairs:

from datetime import datetime, timedelta

# Set departure time window
departure = datetime(2024, 11, 15, 9, 0)  # Friday 9:00 AM
departure_window = timedelta(hours=2)  # Allow departures until 11:00 AM

# Configure the travel time calculation
travel_time_computer = r5py.TravelTimeMatrixComputer(
    transport_network,
    origins=origins,
    destinations=destinations,
    departure=departure,
    # departure_time_window: r5py samples multiple departures across this
    # period and reports the median, capturing schedule variation
    departure_time_window=departure_window,
    transport_modes=[r5py.TransportMode.WALK, r5py.TransportMode.TRANSIT],
    max_time=timedelta(minutes=60),  # Maximum travel time to consider
    percentiles=[50]  # Median travel time across departure window
)

# Calculate!
travel_times = travel_time_computer.compute_travel_times()

For 408 tracts x 6,613 stores = 2,697,704 pairs, this takes about 45 minutes on a laptop with 16GB RAM and an M1 processor. The result is a DataFrame with columns: from_id, to_id, travel_time.

Memory note: Calculating all pairs at once may exceed available RAM for large matrices. If we run into memory issues, batching the origins into groups of 50 keeps memory usage manageable:

# Process in batches of 50 tracts to avoid memory limits
batch_size = 50
results = []

# Loop through origins in chunks, calculating travel times for each batch
# separately. This keeps peak memory usage proportional to batch_size
# rather than the full origin set.
for i in range(0, len(origins), batch_size):
    batch_origins = origins.iloc[i:i+batch_size]

    computer = r5py.TravelTimeMatrixComputer(
        transport_network,
        origins=batch_origins,
        destinations=destinations,
        departure=departure,
        departure_time_window=departure_window,
        transport_modes=[r5py.TransportMode.WALK, r5py.TransportMode.TRANSIT],
        max_time=timedelta(minutes=60),
        percentiles=[50]
    )

    batch_results = computer.compute_travel_times()
    results.append(batch_results)

# Combine all batches
travel_times = pd.concat(results, ignore_index=True)

Step 6: Find Nearest Store by Transit

With travel times calculated, we can find the minimum for each origin:

# Group by origin and find minimum travel time
nearest_by_transit = (
    travel_times
    .groupby("from_id")
    .agg(
        min_travel_time=("travel_time", "min"),
        stores_within_30_min=("travel_time", lambda x: (x <= 30).sum()),
        stores_within_45_min=("travel_time", lambda x: (x <= 45).sum())
    )
    .reset_index()
)

# Merge back to tract data
tracts_with_transit = tracts.merge(
    nearest_by_transit,
    left_on="GEOID",
    right_on="from_id"
)

A typical output row might look like: tract 06085503102, nearest store 14.2 minutes by transit, 8 stores reachable within 30 minutes, 23 stores within 45 minutes. This gives us both a "nearest option" measure and a "choice set" measure for each census tract, which turn out to tell quite different stories about food access.

Troubleshooting Reference

The three most common issues we encountered are now addressed inline in the steps above: GTFS validation (see Step 1), OSM file sizing (see Step 2), and memory limits during batch processing (see Step 5). If something goes wrong during routing, those are the first places to check.

With the full travel time matrix in hand, the natural question becomes: when does this approach make sense, and when might something else work better?

When to Use This Approach

r5py's batch routing works best for research projects that need large travel time matrices on a budget. The tradeoffs become clearer once we consider the use case.

Good fit:

Research requiring many origin-destination pairs
Budget constraints preclude commercial APIs
Need for reproducibility (GTFS + OSM are public data)
Batch processing is acceptable (not real-time queries)

Less suitable:

Real-time routing for individual trips
Need for traffic-aware car routing (GTFS is transit only)
Very small analyses where API costs are negligible
Regions where transit agencies do not publish GTFS feeds (some rural areas, demand-responsive services) or where OpenStreetMap pedestrian paths are incomplete (typically rural or newly developed areas)

Alternatives

Several tools cover similar ground, each with a different cost-complexity tradeoff.

Commercial APIs (Google Maps, Mapbox): Easier setup, real-time capabilities, but costly at scale. Google Maps charges $5 per 1,000 routes for transit directions.

OSRM (Open Source Routing Machine): Free and fast, but car/bike/walk only. No transit support.

OpenTripPlanner: Full-featured transit router with real-time updates and trip planning APIs. Deployment requires running a Java server and configuring multiple components (graph building, API endpoints, frontend). This overhead makes sense for transit agency websites serving individual trip queries, but for batch research calculating millions of routes, r5py's single Python script is simpler.

Conveyal Analysis: Web-based interface to R5 with built-in isochrone mapping and accessibility visualization. Handles data loading and visualization without code. The limitation: we cannot easily customize routing parameters, export raw travel time matrices, or integrate results into a larger analysis pipeline.

Limitations

Free tools involve real tradeoffs. Here are the ones that seem to matter most for research applications.

Schedule-based, not real-time: GTFS represents planned service, not actual arrivals. Delays, cancellations, and disruptions are not captured.

Walking assumptions: r5py assumes constant walking speed. Elderly residents, those with mobility limitations, or those carrying groceries may walk slower than the model expects.

Code Availability

Complete code for this analysis: GitHub repository

The repository includes:

scripts/50_download_calitp_gtfs.py: GTFS download and validation
scripts/53_extract_regional_osm.py: OSM extraction utilities
scripts/41_calculate_transit_all_counties.py: r5py routing implementation with batch processing
scripts/82_mobility_deserts_calitp.py: Post-processing to identify mobility deserts
data/: Sample data files and expected outputs

References and Notes

[1] Google Maps Platform pricing as of November 2024. Transit Directions API: $5.00 per 1,000 requests. 2,697,704 requests x $0.005 = $13,488.52. ↩

[2] GitHub repository: github.com/dphdmae/foodsecurity_mobility. MIT License. ↩

[3] GTFS was developed by Google and TriMet (Portland, OR) in 2005. Over 2,500 transit agencies worldwide now publish GTFS feeds. See gtfs.org for specification details. ↩

[4] Conway, M. W., Byrd, A., & van Eggermond, M. (2018). "Accounting for uncertainty and variation in accessibility metrics for public transport sketch planning." Journal of Transport and Land Use, 11(1), 541-558. For r5py implementation, see Pereira, R. H. M., Saraiva, M., Herszenhut, D., Braga, C. K. V., & Conway, M. W. (2021). "r5r: Rapid Realistic Routing on Multimodal Transport Networks with R5 in R." Findings. https://doi.org/10.32866/001c.21262 ↩

[5] Cal-ITP (California Integrated Travel Project) aggregates GTFS data from California transit agencies. Data available at data.ca.gov. ↩

Tags: #FoodSecurity #TransitAnalysis #GTFS #r5py #OpenStreetMap #Methods #Tutorial #OpenSource