Scaling Up: From 7 Counties to 9,039 Tracts Statewide

When we started this analysis, we focused on 7 Bay Area and major metro counties: approximately 2,000 census tracts.¹ But to understand California's food access landscape, we need to see the whole state. Expanding to all 58 counties means 9,039 residential tracts, 200+ transit agencies, and statewide grocery store coverage. The computational scale increases 4.5-fold, but as we discovered, the methodological challenges increase more.

Here is what we learned when we scaled up: what worked, what broke, and what the statewide data reveals that county-level analysis could not show us.

The Scale of Statewide Analysis

The Scale of Statewide Analysis
Metric	7-County Pilot	Statewide
Census tracts	~2,000	9,039
Transit agencies	8	200+
Transit stops	24,421	64,060
Grocery stores	~6,600	24,850+
Origin-destination pairs	~13M	~225M

Data Scale Expansion from Seven-County Pilot to Statewide California Analysis

Figure 10.1: Data Scale Expansion from Seven-County Pilot to Statewide California Analysis

Comparison of pilot (7 Bay Area counties) vs. statewide scope: census tracts (2,000 vs. 9,039), transit agencies (8 vs. 200+), transit stops (24,400 vs. 64,100), grocery stores (6,600 vs. 24,900). Transit stops deduplicated; grocery stores from SafeGraph Core Places (2023).

The 9,039 residential tracts exclude non-residential census units (parks, airports, industrial zones, water bodies). California has approximately 8,500 total census tracts; filtering for population density > 100/sq mi and population > 500 produces the residential subset.

What Scaled Well

Census Data Acquisition

ACS data downloads scaled linearly. The same API calls work for one county or 58:

from census import Census

c = Census("YOUR_API_KEY")

# Get data for all California tracts
income = c.acs5.state_county_tract(
    fields=["B19013_001E"],  # Median household income
    state_fips="06",         # California
    county_fips="*",         # All counties
    tract="*"                # All tracts
)

Processing time increased proportionally but remained manageable (minutes, not hours).

Grocery Store Distance Calculations

The KD-tree spatial indexing approach handles large datasets efficiently:

from scipy.spatial import cKDTree

# Build tree once for all stores
store_coords = stores[['latitude', 'longitude']].values
tree = cKDTree(store_coords)

# Query for each tract (vectorized)
distances, indices = tree.query(tract_centroids, k=1)

Finding the nearest store among 24,850 options for 9,039 tracts takes under 30 seconds. The O(n log n) scaling of KD-trees makes this tractable even at larger scales.

Vulnerability Index Calculation

The composite index calculation is pure arithmetic on DataFrames. It scales to any number of rows without special handling:

# Each component normalized 0-1, then weighted
df['vulnerability_index'] = (
    0.25 * df['food_access_score'] +
    0.25 * df['poverty_score'] +
    0.20 * df['renter_score'] +
    0.15 * df['minority_score'] +
    0.15 * df['sprawl_score']
)

What Required Adaptation

Transit Data Aggregation

Individual agency GTFS downloads do not scale. Acquiring feeds from 200+ agencies manually would take weeks and produce inconsistent results.

Solution: Use Cal-ITP's aggregated statewide feed.² This provides standardized GTFS data from all California transit agencies in a single download.

# Cal-ITP statewide stops (aggregated)
CALITP_URL = "https://data.ca.gov/dataset/cal-itp-gtfs-ingest-pipeline"
stops = pd.read_csv(CALITP_URL + "/stops.csv")

# Filter to California bounds
stops = stops[
    (stops['stop_lat'] >= 32) & (stops['stop_lat'] <= 42) &
    (stops['stop_lon'] >= -125) & (stops['stop_lon'] <= -114)
]

The trade-off is that Cal-ITP provides stop locations but not full GTFS schedules for all agencies. For statewide stop-based analysis (mobility desert classification), this works. For detailed transit time routing, individual GTFS feeds are still needed.

Memory Management

Loading 225 million origin-destination pairs into memory exceeds typical laptop RAM. Several strategies helped:

Chunked processing: Process tracts in batches of 500-1000, write results to disk, then concatenate.

Sparse representation: For binary classifications (is this tract a mobility desert?), store only the classification rather than all underlying data.

On-disk computation: Use Dask or similar tools for out-of-core DataFrames when necessary.

Sprawl Index Normalization

The pilot analysis used raw sprawl index values. These ranged from 0.005 (dense urban) to 27,942 (rural ranch land). When combined with other normalized (0-1) components, the raw sprawl values dominated the vulnerability index.

Solution: Percentile rank normalization:

# Rank-based normalization to 0-1 scale
df['sprawl_score'] = df['sprawl_index'].rank(pct=True)

This preserves the ordering (more sprawl = higher score) while constraining values to the same 0-1 range as other components.

What Statewide Data Reveals

Finding 1: Regional Clustering

The statewide map reveals geographic clusters invisible in county-level analysis:

Central Valley corridor: Merced, Madera, Fresno, Tulare, and Kern form a contiguous high-vulnerability band. This suggests regional factors (agricultural economy, transportation infrastructure, and historical investment patterns) that transcend county boundaries.

Coastal gradient: Vulnerability increases inland from the coast throughout California. This gradient is continuous, not county-bounded.

Bay Area exception: The Bay Area shows a reversed pattern, with higher vulnerability in some coastal areas (San Francisco's Tenderloin, Oakland's flatlands) than inland suburbs.

Finding 2: County Averages Hide Extreme Tracts

Los Angeles County's mean vulnerability (0.38) is moderate, but the county contains tracts ranging from 0.15 to 0.72. Statewide analysis reveals that some of California's highest-vulnerability tracts are in LA County, embedded within an overall moderate-vulnerability county.

Similar patterns appear in San Diego, Sacramento, and Fresno. County-level reporting would miss these pockets of extreme vulnerability.

Finding 3: Small Counties Drive Extreme Rankings

The most vulnerable counties by mean score are often small:

Alpine: 1,200 residents, 2 tracts
Sierra: 3,200 residents, 3 tracts
Modoc: 8,700 residents, 5 tracts

With few tracts, county averages are heavily influenced by individual tract values. A single high-vulnerability tract in a 2-tract county pulls the county average dramatically.

This creates interpretation challenges; Alpine's #2 vulnerability ranking does not mean 1,200 people face worse conditions than 39 million people in lower-ranked counties. Population-weighted statewide analysis provides better perspective.

Finding 4: The 12% Mobility Desert Finding

The corrected statewide analysis identified 1,086 mobility desert tracts (12% of residential tracts).³ This finding required Cal-ITP's aggregated data from 200+ agencies. The 7-county pilot, using feeds from only 8 major regional agencies, missed municipal bus systems that serve suburban areas and could not accurately classify tracts outside those agencies' service areas.

Computational Architecture

The statewide analysis pipeline:

1. Data Acquisition (parallel)
   ├── Census demographics (ACS API)
   ├── Grocery stores (Google Places + USDA)
   ├── Transit stops (Cal-ITP)
   └── Tract geometries (Census TIGER)

2. Preprocessing
   ├── Filter to residential tracts
   ├── Calculate population-weighted centroids
   ├── Geocode and validate store locations
   └── Deduplicate transit stops

3. Spatial Analysis
   ├── KD-tree for grocery distances
   ├── KD-tree for transit stop distances
   └── Count stops within radius

4. Index Calculation
   ├── Normalize each component
   ├── Weight and combine
   └── Classify vulnerability levels

5. Aggregation
   ├── Tract-level output
   ├── County summaries
   └── Regional aggregations

Total runtime: approximately 45 minutes on a modern laptop (M1 MacBook Pro). Most time is spent in data download and initial preprocessing. The core calculations are fast.

Validation Checks for Statewide Analysis

Scaling up increases error risk. These validation steps helped ensure quality:

1. Row Count Verification

After each merge operation, verify row counts match expectations:

assert len(merged_df) == 9039, f"Expected 9039 tracts, got {len(merged_df)}"

Silent row loss from failed joins is a common bug in scaled analysis.

2. Missing Value Audit

Check for unexpected nulls in key fields:

missing_report = df.isnull().sum()
assert missing_report['vulnerability_index'] == 0

3. Distribution Sanity Checks

Verify that value distributions are plausible:

# Vulnerability should be roughly normal, bounded 0-1
assert df['vulnerability_index'].min() >= 0
assert df['vulnerability_index'].max() <= 1
assert 0.2 < df['vulnerability_index'].mean() < 0.5

4. Geographic Coverage Verification

Map the data to verify all counties are present:

assert df['county_fips'].nunique() == 58

5. External Benchmark Comparison

Compare key metrics against published data:

Total California population (Census)
Grocery store counts (USDA)
Transit agency counts (Cal-ITP documentation)

Remaining Limitations

Transit Time Routing Not Scaled

The statewide analysis uses stop proximity as a transit accessibility proxy. Full transit time routing (using r5py with GTFS schedules) was only done for the 7-county pilot.

Computing transit times for all 9,039 tracts to all ~25,000 stores would require:

Acquiring and validating GTFS from 200+ agencies
Building routing networks for each region
Processing ~225 million origin-destination pairs
Estimated compute time: 50-100 hours

This is feasible but was beyond scope for this analysis. Stop proximity is a reasonable proxy for most classification purposes.

Temporal Snapshot

All data reflects a single time point (November 2025 for most sources, 2019-2023 for ACS). Conditions change. Stores open and close. Transit routes change. Populations shift.

The analysis provides a snapshot rather than a longitudinal trend. Repeating the analysis annually would reveal trends.

Uneven Data Quality by Region

Data quality varies by county:

Urban counties have better grocery store coverage because Google Places has more data where there are more businesses and users.
Rural counties may have undercounted stores, particularly independent grocers and general stores not in commercial databases.
Cal-ITP includes stops from 200+ agencies, but some demand-responsive services (dial-a-ride) and smaller rural shuttles are not in the system; stop-level metadata (accessibility features, shelter presence) also varies by agency.

These quality variations don't bias results systematically but add uncertainty to rural county estimates.

Replication Materials

All code and data for the statewide analysis are available on GitHub:

Repository: github.com/vpcholette/food-security-analysis

Key scripts:

50_download_calitp_gtfs.py - Transit data acquisition
60_calculate_grocery_distances_regional.py - Spatial analysis
70_create_master_analysis.py - Index calculation
82_mobility_deserts_calitp.py - Mobility desert classification

Output files:

master_analysis_ca_2023.csv - Tract-level results
county_rankings_ca_2023.csv - County summaries
mobility_deserts_calitp_ca_2023.csv - Mobility desert classification

Notes

[1] Seven-county pilot included San Francisco, San Mateo, Santa Clara, Alameda, Contra Costa, Sacramento, and San Diego. Analysis completed October 2025. ↩

[2] Cal-ITP (California Integrated Travel Project) aggregates GTFS data from California transit agencies. Data available at data.ca.gov. ↩

[3] Mobility desert classification: tracts with grocery store within 1 mile but transit stop > 0.5 miles or < 2 stops within 0.5 miles. ↩

Tags: #FoodSecurity #California #Scaling #DataPipeline #GTFS #CalITP #Methodology #BigData

Next in this series: Who gets left behind? Analyzing transit access disparities by race and ethnicity.

How to Cite This Research

Too Early To Say. "Scaling Up: From 7 Counties to Statewide." December 2025. https://tooearlytosay.com/research/food-security/scaling-statewide/

Copy citation