We ran transit routing across 7 California counties: 11 GTFS feeds, hundreds of census tracts, thousands of grocery stores. Four counties failed on the first attempt. The feeds had downloaded successfully, contained all required files, and looked perfectly valid. But when r5py (a Python transit routing library) tried to build transit networks from them, everything broke.
Caltrain's frequencies table was empty. Two agencies' download URLs had quietly stopped working. Three counties' routing crashed because the OpenStreetMap file was too large, and by the time we figured that out, we'd already burned hours debugging what we assumed was a GTFS problem.
Unfortunately, a GTFS feed can pass every basic check and still be functionally useless for research. Fortunately, walking through a validation workflow can catch those problems before they reach the routing engine.[1] Let's walk through it together.
Before diving in, it helps to understand that GTFS validation operates on two levels. Structural validation checks whether files exist and fields conform to the spec: correct column names, valid data types, required files present. Content validation checks whether the data makes sense: reasonable coordinates, active service dates, non-empty route lists. A feed can be structurally perfect and still fail content validation in ways that break routing engines. The workflow below addresses both levels.
Several GTFS validation tools already exist. The MobilityData Canonical Validator checks spec compliance. QGIS offers visual inspection of stop locations. Transitland provides feed discovery and archiving. Our Python approach here complements these by integrating validation directly into data pipelines, catching the content-level problems that spec-compliance tools tend to miss.
Six Layers, Six Failure Modes
Why six layers? Each targets a different category of data quality issue. Download failures, missing files, bad coordinates, expired calendars, broken relational integrity, and multi-feed conflicts are all distinct failure modes. A feed can pass five layers and still break on the sixth. The layered approach means we catch problems at the earliest possible stage, before they cascade into harder-to-diagnose downstream failures.
| Component | Tool | Purpose |
|---|---|---|
| Transit data | GTFS feeds | Bus/rail routes, stops, schedules |
| Feed registry | Cal-ITP / Transitland | Find and download feeds with fallbacks |
| Structural validation | zipfile + pandas | Check file presence, parse tables |
| Geographic validation | pandas + bounding box | Coordinate sanity checks |
| Calendar validation | pandas datetime | Expired services, active date ranges |
| Downstream testing | r5py | Confirm feeds actually produce routes |
To make this concrete, here is what a typical validation report looks like for a single feed: SFMTA -- 72 routes, 3,241 trips, date range 2024-01-15 to 2024-07-14, 0 coordinate outliers, 2 warnings (missing agency_url, empty frequencies.txt). Each layer contributes a piece of that picture, and the warnings are what keep bad data from reaching the routing engine.
Step 1: Download with Fallbacks
Transit agencies publish their own GTFS feeds, but the URLs are unstable. Two of our 11 target agencies returned HTTP errors without warning: no redirect, no deprecation notice.
The fix: always have a fallback source. Transitland maintains a feed archive covering thousands of agencies worldwide.[2]
import requests
import zipfile
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
TRANSITLAND_FALLBACKS = {
'sfmta': 'https://transit.land/api/v2/feeds/f-9q8y-sfmta/download_latest_feed_version',
'bart': 'https://transit.land/api/v2/feeds/f-9q9-bart/download_latest_feed_version',
'actransit': 'https://transit.land/api/v2/feeds/f-9q9-actransit/download_latest_feed_version',
}
def download_feed(feed_id, primary_url, output_dir):
"""Download GTFS feed with Transitland fallback."""
zip_path = output_dir / f"{feed_id}_gtfs.zip"
for url in [primary_url, TRANSITLAND_FALLBACKS.get(feed_id)]:
if url is None:
continue
try:
response = requests.get(url, timeout=60, allow_redirects=True)
if response.status_code == 200 and zipfile.is_zipfile(zip_path):
with open(zip_path, 'wb') as f:
f.write(response.content)
return zip_path
except requests.RequestException:
continue
return None
For multiple agencies, parallel downloads save significant time:
# Download 11 feeds in parallel
with ThreadPoolExecutor(max_workers=4) as executor:
futures = {
executor.submit(download_feed, fid, info['url'], GTFS_DIR): fid
for fid, info in GTFS_FEEDS.items()
}
for future in as_completed(futures):
result = future.result()
if result is None:
print(f" Failed: {futures[future]}")
Step 2: Structural Validation
A valid GTFS feed is a zip file containing specific text files. Six are required by the specification.[3]
One thing to watch for before even parsing tables: file sizes. A stops.txt under 1 KB probably means an empty or near-empty feed. A stop_times.txt over 2 GB suggests a statewide aggregation that may cause memory problems downstream. The structural check below captures these sizes alongside the file-presence validation.
def validate_structure(gtfs_path):
"""Check required and optional GTFS files."""
required = ['agency.txt', 'routes.txt', 'trips.txt',
'stops.txt', 'stop_times.txt', 'calendar.txt']
optional = ['calendar_dates.txt', 'shapes.txt',
'fare_attributes.txt', 'feed_info.txt']
if not zipfile.is_zipfile(gtfs_path):
return {'valid': False, 'error': 'Not a valid zip file'}
with zipfile.ZipFile(gtfs_path) as zf:
files = zf.namelist()
missing = [f for f in required if f not in files]
present_optional = [f for f in optional if f in files]
# Check file sizes — empty required files are a red flag
sizes = {f: zf.getinfo(f).file_size for f in files if f in required + optional}
return {
'valid': len(missing) == 0,
'missing': missing,
'optional_present': present_optional,
'file_sizes': sizes
}
Step 3: Coordinate Validation
Stops coded at (0, 0), a common placeholder in GTFS feeds, will place transit service in the Gulf of Guinea off the coast of West Africa. The coordinate checks below seem obvious, but we include them because this pattern appears more often than it should. Even well-maintained feeds occasionally contain placeholder coordinates from data entry errors.
import pandas as pd
def validate_coordinates(gtfs_path, expected_bounds):
"""Check stop coordinates for common problems."""
with zipfile.ZipFile(gtfs_path) as zf:
stops = pd.read_csv(zf.open('stops.txt'))
issues = {}
# Missing coordinates
missing = stops['stop_lat'].isna() | stops['stop_lon'].isna()
if missing.sum() > 0:
issues['missing_coords'] = int(missing.sum())
# Zero coordinates (common placeholder — would map to Gulf of Guinea)
zeros = (stops['stop_lat'] == 0) | (stops['stop_lon'] == 0)
if zeros.sum() > 0:
issues['zero_coords'] = int(zeros.sum())
# Out of expected bounds
# lat_min/lat_max define the north-south range for the study area
# lon_min/lon_max define the east-west range
# Stops outside these bounds may indicate a wider service area or data errors
out_of_bounds = (
(stops['stop_lat'] < expected_bounds['lat_min']) |
(stops['stop_lat'] > expected_bounds['lat_max']) |
(stops['stop_lon'] < expected_bounds['lon_min']) |
(stops['stop_lon'] > expected_bounds['lon_max'])
)
if out_of_bounds.sum() > 0:
issues['out_of_bounds'] = int(out_of_bounds.sum())
return {
'total_stops': len(stops),
'valid': len(issues) == 0,
'issues': issues,
'bounds': {
'lat': [float(stops['stop_lat'].min()), float(stops['stop_lat'].max())],
'lon': [float(stops['stop_lon'].min()), float(stops['stop_lon'].max())]
}
}
The bounding box needs to match the study area. For Santa Clara County, we used lat 37.0-37.6 and lon -122.2 to -121.5. Stops outside those bounds could mean the feed covers a wider service area (BART extends across multiple counties) or that something is wrong.
Geographic validity is necessary but not sufficient. A feed with perfect coordinates is useless if all its services expired last month.
Step 4: Calendar Validation
This is the check that would have saved us the most time. A GTFS feed with all services past their end date will produce zero transit routes in r5py. No error message, just empty results that look like every tract has no transit access. The calendar check below catches this before it reaches the routing engine.
In our California food access study, this problem showed up as a silent failure. r5py failed to build San Francisco's transit network but gave no indication whether the problem was in the GTFS data or the routing configuration. We ended up removing feeds one at a time until routing succeeded, a process that took several hours of trial-and-error before isolating Caltrain as the culprit. The feed's calendar was active, but the frequencies table was functionally empty (see Step 5). The takeaway: calendar validation is necessary but not sufficient on its own.
from datetime import datetime
def validate_calendar(gtfs_path):
"""Check for active services and expired feeds."""
with zipfile.ZipFile(gtfs_path) as zf:
calendar = pd.read_csv(zf.open('calendar.txt'))
calendar['start_date'] = pd.to_datetime(calendar['start_date'], format='%Y%m%d')
calendar['end_date'] = pd.to_datetime(calendar['end_date'], format='%Y%m%d')
today = datetime.now()
active = calendar[
(calendar['start_date'] <= today) &
(calendar['end_date'] >= today)
]
return {
'total_services': len(calendar),
'active_services': len(active),
'earliest_start': calendar['start_date'].min().strftime('%Y-%m-%d'),
'latest_end': calendar['end_date'].max().strftime('%Y-%m-%d'),
'feed_expired': len(active) == 0
}
Some agencies use calendar_dates.txt for exception-based scheduling instead of calendar.txt. If calendar.txt shows zero active services, it is worth checking whether calendar_dates.txt carries the schedule instead. Either way, the departure date parameter in the routing engine needs to fall within the feed's service window.
Step 5: Content Validation
This is where the Caltrain problem lived. The feed passed every check above: all files present, valid coordinates, active calendar. But the frequencies table was functionally empty.
What does "functionally empty" look like? An empty frequencies.txt contains only the header row:
trip_id,start_time,end_time,headway_secs
A populated one would have entries like:
trip_id,start_time,end_time,headway_secs
CT-LOCAL-1,06:00:00,09:00:00,1200
CT-LOCAL-1,09:00:00,15:00:00,1800
CT-LOCAL-1,15:00:00,19:00:00,1200
That first entry indicates 20-minute headways (1200 seconds) during the morning peak. Without these rows, the routing engine has no frequency information to work with, even though the file technically exists and has valid column headers.
Another content-level issue that bit us: agency URLs disappearing entirely. Configuring downloads from 11 agencies revealed how fragile these endpoints can be. County Connection and Tri Delta Transit (both serving Contra Costa County) returned HTTP errors with no redirect and no deprecation page. The endpoints simply stopped responding. This left Contra Costa with only BART coverage, missing two of its three transit agencies. For a food access study, that means undercounting transit options for residents who rely on local bus service rather than regional rail. The Transitland fallback approach from Step 1 handles this, though it introduces a different risk: Transitland's archived version may not be the most current feed.
def validate_content(gtfs_path):
"""Check relational integrity between GTFS tables."""
with zipfile.ZipFile(gtfs_path) as zf:
routes = pd.read_csv(zf.open('routes.txt'))
trips = pd.read_csv(zf.open('trips.txt'))
stop_times = pd.read_csv(zf.open('stop_times.txt'), low_memory=False)
issues = {}
# Routes with no trips
routes_with_trips = set(trips['route_id'].unique())
orphan_routes = set(routes['route_id']) - routes_with_trips
if orphan_routes:
issues['orphan_routes'] = len(orphan_routes)
# Trips with no stop_times
trips_with_times = set(stop_times['trip_id'].unique())
orphan_trips = set(trips['trip_id']) - trips_with_times
if orphan_trips:
issues['orphan_trips'] = len(orphan_trips)
# Check route_type values (0-7 per GTFS spec)
invalid_types = routes[~routes['route_type'].isin(range(8))]
if len(invalid_types) > 0:
issues['invalid_route_types'] = len(invalid_types)
return {
'routes': len(routes),
'trips': len(trips),
'stop_times': len(stop_times),
'valid': len(issues) == 0,
'issues': issues
}
Structural presence is not the same as relational integrity. A routes table with 50 entries means nothing if those routes connect to zero trips. A frequencies table with correct column headers means nothing if the referenced trips have no stop times.
Step 6: Multi-Agency Smoke Test
When combining feeds from multiple agencies (for example, SFMTA + BART for San Francisco), conflicts can emerge that no single-feed validation catches.
In our study, four of seven counties failed the first routing run: San Francisco (the Caltrain issue from Step 5), Sacramento, Orange, and San Diego. When that many counties fail, per-county configuration becomes necessary. We ended up writing a separate retry script (41b_calculate_transit_failed_counties.py) to handle the adjustments. A smoke test like the one below would have caught these integration failures before we committed to the full routing run.
import r5py
def smoke_test_network(osm_path, gtfs_paths):
"""Try building an r5py network to catch integration issues."""
try:
network = r5py.TransportNetwork(
osm_pbf=osm_path,
gtfs=gtfs_paths
)
return {'status': 'success', 'feeds_loaded': len(gtfs_paths)}
except Exception as e:
return {'status': 'failed', 'error': str(e)}
This is a blunt instrument: it tells us whether r5py can build a network, not whether the results will be correct. But it catches the category of failure that cost us the most time: feeds that look valid individually but break when combined.
OSM file size is another failure point; three of our counties crashed because the statewide extract was too large for r5py's memory. Regional extracts from Geofabrik or BBBike solve this, but that's outside the GTFS validation scope.
Debugging in Practice
The validation steps above emerged from three specific failures, and we've integrated the details of each into the relevant steps:
- Caltrain's empty frequencies table -- diagnosed through a painstaking process of removing feeds one at a time. The frequencies example and debugging narrative appear in Step 4 and Step 5.
- Disappearing agency URLs -- County Connection and Tri Delta Transit endpoints stopped responding without warning. The fallback strategy and its tradeoffs are discussed in Step 5.
- Four-county retry script -- when over half the counties failed routing, per-county configuration became necessary. The context for this appears in Step 6.
Those debugging sessions clarified when this validation overhead is worth the investment versus when simpler checks suffice.
When to Use This Approach
GTFS validation is overhead, and not every project needs all six layers. The investment tends to pay off when bad feeds can cascade into downstream failures that are harder to diagnose than to prevent.
Good fit:
- Multi-agency studies where one bad feed can poison the whole analysis
- Longitudinal studies where feeds may expire between data collection and analysis
- Any pipeline feeding into r5py, OpenTripPlanner, or similar routing engines
Less suitable:
- Real-time transit applications (use GTFS-realtime validation instead)
- Single-agency analysis with a known-good feed
- Quick exploratory work where a few missing stops are tolerable
Limitations
These checks are practical safeguards, not a comprehensive test suite. Several constraints are worth noting.
- Pattern-based, not exhaustive. These checks catch problems we actually encountered. Other feeds will have other problems.
- Point-in-time. Feeds update regularly. A feed that validates today may expire next month.
- Engine-specific tolerance. r5py, OpenTripPlanner, and Valhalla each handle GTFS quirks differently. A feed that crashes r5py may work fine in OTP, and vice versa.
- Requires geographic knowledge. Bounding box checks only work if we know where the stops should be.
- Does not cover GTFS-realtime. Real-time feeds (vehicle positions, trip updates, service alerts) use a different specification and require different validation tools.
Code Availability
Complete validation and routing code: GitHub repository
Key files:
scripts/30_validate_gtfs_feed.py-- Single-agency validationscripts/40_download_all_gtfs_feeds.py-- Multi-agency download with fallbacksscripts/41b_calculate_transit_failed_counties.py-- Retry script with per-county fixes
References and Notes
[1] This validation workflow was developed as part of a food access study measuring transit travel times to grocery stores across 7 California counties. See our transit routing tutorial for the routing methodology.
[2] Transitland is a community-edited open data platform maintained by Interline Technologies. Feed archive: transit.land. For California specifically, Cal-ITP (California Integrated Travel Project) aggregates GTFS feeds from 96% of the state's transit agencies: data.ca.gov.
[3] GTFS (General Transit Feed Specification) is maintained by MobilityData. Over 10,000 transit agencies in 100+ countries publish GTFS feeds. Specification: gtfs.org. Repository: github.com/google/transit
[4] MobilityData. "Canonical GTFS Schedule Validator." github.com/MobilityData/gtfs-validator