When we started this analysis, we focused on 7 Bay Area and major metro counties: approximately 2,000 census tracts.1 But to understand California's food access landscape, we need to see the whole state. Expanding to all 58 counties means 9,039 residential tracts, 200+ transit agencies, and statewide grocery store coverage. The computational scale increases 4.5-fold, but as we discovered, the methodological challenges increase more.
Here is what we learned when we scaled up: what worked, what broke, and what the statewide data reveals that county-level analysis could not show us.
The Scale of Statewide Analysis
| Metric | 7-County Pilot | Statewide |
|---|---|---|
| Census tracts | ~2,000 | 9,039 |
| Transit agencies | 8 | 200+ |
| Transit stops | 24,421 | 64,060 |
| Grocery stores | ~6,600 | 24,850+ |
| Origin-destination pairs | ~13M | ~225M |

Figure 10.1: Data Scale Expansion from Seven-County Pilot to Statewide California Analysis
Comparison of pilot (7 Bay Area counties) vs. statewide scope: census tracts (2,000 vs. 9,039), transit agencies (8 vs. 200+), transit stops (24,400 vs. 64,100), grocery stores (6,600 vs. 24,900). Transit stops deduplicated; grocery stores from SafeGraph Core Places (2023).
The 9,039 residential tracts exclude non-residential census units (parks, airports, industrial zones, water bodies). California has approximately 8,500 total census tracts; filtering for population density > 100/sq mi and population > 500 produces the residential subset.
What Scaled Well
Census Data Acquisition
ACS data downloads scaled linearly. The same API calls work for one county or 58:
from census import Census
c = Census("YOUR_API_KEY")
# Get data for all California tracts
income = c.acs5.state_county_tract(
fields=["B19013_001E"], # Median household income
state_fips="06", # California
county_fips="*", # All counties
tract="*" # All tracts
)
Processing time increased proportionally but remained manageable (minutes, not hours).
Grocery Store Distance Calculations
The KD-tree spatial indexing approach handles large datasets efficiently:
from scipy.spatial import cKDTree
# Build tree once for all stores
store_coords = stores[['latitude', 'longitude']].values
tree = cKDTree(store_coords)
# Query for each tract (vectorized)
distances, indices = tree.query(tract_centroids, k=1)
Finding the nearest store among 24,850 options for 9,039 tracts takes under 30 seconds. The O(n log n) scaling of KD-trees makes this tractable even at larger scales.
Vulnerability Index Calculation
The composite index calculation is pure arithmetic on DataFrames. It scales to any number of rows without special handling:
# Each component normalized 0-1, then weighted
df['vulnerability_index'] = (
0.25 * df['food_access_score'] +
0.25 * df['poverty_score'] +
0.20 * df['renter_score'] +
0.15 * df['minority_score'] +
0.15 * df['sprawl_score']
)
What Required Adaptation
Transit Data Aggregation
Individual agency GTFS downloads do not scale. Acquiring feeds from 200+ agencies manually would take weeks and produce inconsistent results.
Solution: Use Cal-ITP's aggregated statewide feed.2 This provides standardized GTFS data from all California transit agencies in a single download.
# Cal-ITP statewide stops (aggregated)
CALITP_URL = "https://data.ca.gov/dataset/cal-itp-gtfs-ingest-pipeline"
stops = pd.read_csv(CALITP_URL + "/stops.csv")
# Filter to California bounds
stops = stops[
(stops['stop_lat'] >= 32) & (stops['stop_lat'] <= 42) &
(stops['stop_lon'] >= -125) & (stops['stop_lon'] <= -114)
]
The trade-off is that Cal-ITP provides stop locations but not full GTFS schedules for all agencies. For statewide stop-based analysis (mobility desert classification), this works. For detailed transit time routing, individual GTFS feeds are still needed.
Memory Management
Loading 225 million origin-destination pairs into memory exceeds typical laptop RAM. Several strategies helped:
Chunked processing: Process tracts in batches of 500-1000, write results to disk, then concatenate.
Sparse representation: For binary classifications (is this tract a mobility desert?), store only the classification rather than all underlying data.
On-disk computation: Use Dask or similar tools for out-of-core DataFrames when necessary.
Sprawl Index Normalization
The pilot analysis used raw sprawl index values. These ranged from 0.005 (dense urban) to 27,942 (rural ranch land). When combined with other normalized (0-1) components, the raw sprawl values dominated the vulnerability index.
Solution: Percentile rank normalization:
# Rank-based normalization to 0-1 scale
df['sprawl_score'] = df['sprawl_index'].rank(pct=True)
This preserves the ordering (more sprawl = higher score) while constraining values to the same 0-1 range as other components.
What Statewide Data Reveals
Finding 1: Regional Clustering
The statewide map reveals geographic clusters invisible in county-level analysis:
Central Valley corridor: Merced, Madera, Fresno, Tulare, and Kern form a contiguous high-vulnerability band. This suggests regional factors (agricultural economy, transportation infrastructure, and historical investment patterns) that transcend county boundaries.
Coastal gradient: Vulnerability increases inland from the coast throughout California. This gradient is continuous, not county-bounded.
Bay Area exception: The Bay Area shows a reversed pattern, with higher vulnerability in some coastal areas (San Francisco's Tenderloin, Oakland's flatlands) than inland suburbs.
Finding 2: County Averages Hide Extreme Tracts
Los Angeles County's mean vulnerability (0.38) is moderate, but the county contains tracts ranging from 0.15 to 0.72. Statewide analysis reveals that some of California's highest-vulnerability tracts are in LA County, embedded within an overall moderate-vulnerability county.
Similar patterns appear in San Diego, Sacramento, and Fresno. County-level reporting would miss these pockets of extreme vulnerability.
Finding 3: Small Counties Drive Extreme Rankings
The most vulnerable counties by mean score are often small:
- Alpine: 1,200 residents, 2 tracts
- Sierra: 3,200 residents, 3 tracts
- Modoc: 8,700 residents, 5 tracts
With few tracts, county averages are heavily influenced by individual tract values. A single high-vulnerability tract in a 2-tract county pulls the county average dramatically.
This creates interpretation challenges; Alpine's #2 vulnerability ranking does not mean 1,200 people face worse conditions than 39 million people in lower-ranked counties. Population-weighted statewide analysis provides better perspective.
Finding 4: The 12% Mobility Desert Finding
The corrected statewide analysis identified 1,086 mobility desert tracts (12% of residential tracts).3 This finding required Cal-ITP's aggregated data from 200+ agencies. The 7-county pilot, using feeds from only 8 major regional agencies, missed municipal bus systems that serve suburban areas and could not accurately classify tracts outside those agencies' service areas.
Computational Architecture
The statewide analysis pipeline:
1. Data Acquisition (parallel)
├── Census demographics (ACS API)
├── Grocery stores (Google Places + USDA)
├── Transit stops (Cal-ITP)
└── Tract geometries (Census TIGER)
2. Preprocessing
├── Filter to residential tracts
├── Calculate population-weighted centroids
├── Geocode and validate store locations
└── Deduplicate transit stops
3. Spatial Analysis
├── KD-tree for grocery distances
├── KD-tree for transit stop distances
└── Count stops within radius
4. Index Calculation
├── Normalize each component
├── Weight and combine
└── Classify vulnerability levels
5. Aggregation
├── Tract-level output
├── County summaries
└── Regional aggregations
Total runtime: approximately 45 minutes on a modern laptop (M1 MacBook Pro). Most time is spent in data download and initial preprocessing. The core calculations are fast.
Validation Checks for Statewide Analysis
Scaling up increases error risk. These validation steps helped ensure quality:
1. Row Count Verification
After each merge operation, verify row counts match expectations:
assert len(merged_df) == 9039, f"Expected 9039 tracts, got {len(merged_df)}"
Silent row loss from failed joins is a common bug in scaled analysis.
2. Missing Value Audit
Check for unexpected nulls in key fields:
missing_report = df.isnull().sum()
assert missing_report['vulnerability_index'] == 0
3. Distribution Sanity Checks
Verify that value distributions are plausible:
# Vulnerability should be roughly normal, bounded 0-1
assert df['vulnerability_index'].min() >= 0
assert df['vulnerability_index'].max() <= 1
assert 0.2 < df['vulnerability_index'].mean() < 0.5
4. Geographic Coverage Verification
Map the data to verify all counties are present:
assert df['county_fips'].nunique() == 58
5. External Benchmark Comparison
Compare key metrics against published data:
- Total California population (Census)
- Grocery store counts (USDA)
- Transit agency counts (Cal-ITP documentation)
Remaining Limitations
Transit Time Routing Not Scaled
The statewide analysis uses stop proximity as a transit accessibility proxy. Full transit time routing (using r5py with GTFS schedules) was only done for the 7-county pilot.
Computing transit times for all 9,039 tracts to all ~25,000 stores would require:
- Acquiring and validating GTFS from 200+ agencies
- Building routing networks for each region
- Processing ~225 million origin-destination pairs
- Estimated compute time: 50-100 hours
This is feasible but was beyond scope for this analysis. Stop proximity serves as a reasonable proxy for most classification purposes.
Temporal Snapshot
All data reflects a single time point (November 2025 for most sources, 2019-2023 for ACS). Conditions change. Stores open and close. Transit routes change. Populations shift.
The analysis provides a snapshot rather than a longitudinal trend. Repeating the analysis annually would reveal trends.
Uneven Data Quality by Region
Data quality varies by county:
- Urban counties have better grocery store coverage because Google Places has more data where there are more businesses and users.
- Rural counties may have undercounted stores, particularly independent grocers and general stores not in commercial databases.
- Cal-ITP includes stops from 200+ agencies, but some demand-responsive services (dial-a-ride) and smaller rural shuttles are not in the system; stop-level metadata (accessibility features, shelter presence) also varies by agency.
These quality variations don't bias results systematically but add uncertainty to rural county estimates.
Replication Materials
All code and data for the statewide analysis are available on GitHub:
Repository: github.com/vpcholette/food-security-analysis
Key scripts:
50_download_calitp_gtfs.py- Transit data acquisition60_calculate_grocery_distances_regional.py- Spatial analysis70_create_master_analysis.py- Index calculation82_mobility_deserts_calitp.py- Mobility desert classification
Output files:
master_analysis_ca_2023.csv- Tract-level resultscounty_rankings_ca_2023.csv- County summariesmobility_deserts_calitp_ca_2023.csv- Mobility desert classification
Notes
[1] Seven-county pilot included San Francisco, San Mateo, Santa Clara, Alameda, Contra Costa, Sacramento, and San Diego. Analysis completed October 2025. ↩
[2] Cal-ITP (California Integrated Travel Project) aggregates GTFS data from California transit agencies. Data available at data.ca.gov. ↩
[3] Mobility desert classification: tracts with grocery store within 1 mile but transit stop > 0.5 miles or < 2 stops within 0.5 miles. ↩
Tags: #FoodSecurity #California #Scaling #DataPipeline #GTFS #CalITP #Methodology #BigData
Next in this series: Who gets left behind? Analyzing transit access disparities by race and ethnicity.
How to Cite This Research
Too Early To Say. "Scaling Up: From 7 Counties to Statewide." December 2025. https://www.tooearlytosay.com/research/food-security/scaling-statewide/Copy citation