Spatial Analysis with GeoPandas: From Joins to Autocorrelation

The foundation of most spatial analysis is a spatial join: merging point data (e.g., provider locations) with polygon boundaries (e.g., census tracts) using gpd.sjoin(). Once we have spatially joined data, we can build neighborhood weights and test for spatial patterns. But that pipeline can break in ways that never raise an error. Coordinate reference systems can disagree silently. Tracts with missing geometries can vanish from the output without a trace. And even when the join succeeds, spatial autocorrelation in the underlying data can invalidate any statistical test we run downstream.

This walkthrough follows the full arc: loading boundaries, computing centroids, constructing spatial weights, testing for autocorrelation with Moran's I, and identifying local clusters via LISA. The workflow comes from a food security mobility study of a California county, where 108 census tracts needed accurate point-to-polygon matching and spatial diagnostics. The goal here is to flag the places where things go wrong and explain why the defaults are not always safe.

The PySAL ecosystem offers extensive spatial analysis capabilities: spreg for spatial regression, esda for exploratory analysis, and libpysal.weights for flexible neighborhood definitions. For point-in-polygon joins specifically, GeoPandas sjoin wraps the spatial index operations and CRS alignment that we handle manually below. Here we take a more deliberate approach, using pyshp for boundary loading and scipy for distance computations, keeping the workflow transparent so each assumption is visible. For production pipelines or analyses requiring contiguity-based weights, PySAL or GeoPandas would likely be more practical.

To catch the failures that convenience functions can hide, we need tools that expose the full pipeline. Here is what we use and why each component matters.

Tool Stack: pyshp, scipy, and numpy

Component	Tool	Purpose
Geometries	Census TIGER/Line shapefiles (2023)	Tract boundary polygons
Spatial operations	`shapefile` + `scipy`	Point-in-polygon matching, distance matrix construction
Spatial statistics	Manual Moran's I implementation	Autocorrelation testing without PySAL dependency
Validation	Quartile classification	Urban/rural pattern sanity checks

Step 1: Load the Census Tract Shapefile

Why not load the entire state shapefile and filter later? Memory. California's 2023 TIGER/Line tract file contains over 9,000 records with complex polygon geometries. If we only need one county, filtering at read time cuts memory use dramatically.

One thing to keep in mind: the shapeRecords() iterator still touches every record in the file, only discarding rather than storing. For truly large shapefiles, a bounding-box pre-filter using the .shx index file would be more efficient. For county-level extraction, though, iterating and filtering works without additional dependencies.

Let's start by reading the shapefile with pyshp and pulling only the tracts where COUNTYFP == '085':

import shapefile
import numpy as np

sf = shapefile.Reader("tl_2023_06_tract")
fields = [field[0] for field in sf.fields[1:]]
geoid_idx = fields.index('GEOID')
county_idx = fields.index('COUNTYFP')

tracts = {}
for record in sf.shapeRecords():
    if record.record[county_idx] == '085':
        geoid = record.record[geoid_idx]
        points = record.shape.points
        parts = record.shape.parts
        tracts[geoid] = {
            'points': points,
            'parts': parts,
            'record': record.record
        }

print(f"Loaded {len(tracts)} tracts for target county")
# Expected: 108 tracts

Step 2: Extract Tract Centroids from Polygon Geometries

With tract boundaries loaded, we need centroids for spatial statistics. The naive approach, averaging all polygon vertices, is surprisingly adequate for convex-ish census tracts, though it breaks down for tracts with unusual shapes (river-following boundaries, donut holes).

One defensive check worth building in: any tract with an empty geometry list gets skipped, and that skip is silent unless we add a warning ourselves. In our county-level analysis, all 108 tracts had valid geometries. But when we ran the same pipeline on a statewide dataset during testing, 3 tracts dropped out without any error. The len(points) == 0 guard below is the kind of check that feels unnecessary until the day it catches something.

coords = []
geoids = []

for geoid, tract in tracts.items():
    points = tract['points']
    if len(points) == 0:
        print(f"WARNING: Tract {geoid} has no geometry points")
        continue
    centroid_x = np.mean([p[0] for p in points])
    centroid_y = np.mean([p[1] for p in points])
    coords.append([centroid_x, centroid_y])
    geoids.append(geoid)

coords = np.array(coords)
print(f"Computed centroids for {len(coords)} tracts")

Step 3: Define Which Tracts Are Neighbors (k-NN Weights)

Spatial autocorrelation statistics require a weights matrix that defines which tracts are "neighbors." There are several ways to define this: contiguity (shared borders), distance bands, or k-nearest-neighbors.

Why k=8? It's a common default in spatial econometrics: enough neighbors to smooth over boundary artifacts, few enough to preserve genuinely local variation. For our 108 tracts, k=8 means each tract's spatial lag draws from roughly 7% of all tracts, which keeps the weights matrix sparse.

The code below also row-standardizes the weights, an adjustment that is easy to overlook but changes the results substantially. Without it, tracts in dense urban clusters effectively receive higher weight in aggregate statistics, because their k nearest neighbors are physically closer and more numerous in a given radius. Row-standardization forces each tract to contribute equally regardless of how tightly packed its neighbors are.

from scipy.spatial import distance_matrix

dist_matrix = distance_matrix(coords, coords)
n = len(coords)
k = 8

weights = np.zeros((n, n))
for i in range(n):
    # Sort by distance, skip self (index 0 after sort)
    nearest = np.argsort(dist_matrix[i])[1:k+1]
    weights[i, nearest] = 1

# Row-standardize: each row sums to 1
row_sums = weights.sum(axis=1)
# Guard against isolated tracts (row_sum == 0)
row_sums[row_sums == 0] = 1
weights_std = weights / row_sums[:, np.newaxis]

Step 4: Test Whether SNAP Participation Clusters Spatially

Is SNAP participation spatially clustered, or randomly distributed? Global Moran's I gives us a single number between -1 and +1, where positive values indicate clustering and negative values indicate dispersion.

def morans_i(values, W):
    n = len(values)
    x_bar = np.mean(values)
    z = values - x_bar

    S0 = W.sum()
    numerator = 0.0
    for i in range(n):
        for j in range(n):
            numerator += W[i, j] * z[i] * z[j]

    denominator = np.sum(z ** 2)

    I = (n / S0) * (numerator / denominator)

    # Expected value under null hypothesis
    E_I = -1.0 / (n - 1)

    # Variance (normality assumption)
    S1 = 0.5 * np.sum((W + W.T) ** 2)
    S2 = np.sum((W.sum(axis=1) + W.sum(axis=0)) ** 2)

    A = n * ((n**2 - 3*n + 3) * S1 - n * S2 + 3 * S0**2)
    B = (n**2 - n) * S1 - 2 * n * S2 + 6 * S0**2
    C = (n - 1) * (n - 2) * (n - 3) * S0**2

    # Kurtosis of z
    D = np.sum(z**4) / (np.sum(z**2)**2 / n)

    var_I = (A - D * B) / C - E_I**2

    z_score = (I - E_I) / np.sqrt(var_I)

    return I, E_I, var_I, z_score

# Example with SNAP participation rates per tract
I, E_I, var_I, z_score = morans_i(snap_rates, weights_std)
print(f"Moran's I: {I:.4f}, z-score: {z_score:.2f}")

In our food security analysis, SNAP participation rates returned a Moran's I above 0.3, with a z-score that rejected the null hypothesis of spatial randomness. To put that in concrete terms: a Moran's I of 0.32 with p < 0.001 suggests moderate positive spatial autocorrelation. Areas with high SNAP participation tend to cluster near other high-participation areas, and vice versa. This isn't random; there's a geographic pattern worth investigating. Any regression model that ignores this spatial structure risks inflated significance.

About the variance formula. The S0/S1/S2 components are easy to get wrong, and the result will still look plausible. A Moran's I of 0.32 with a z-score of 3.1 seems reasonable whether the variance is correct or not. The only way to verify is to compare against a known implementation (PySAL's esda.Moran, for instance) on the same data. During development, this cross-check caught an off-by-one error in the S2 calculation that had shifted our p-values by about 0.02.

Step 5: Identify Local Clusters via LISA

Global Moran's I tells us that clustering exists. It doesn't tell us where. Local Indicators of Spatial Association (LISA) decompose the global statistic into tract-level contributions. LISA identifies four cluster types: High-High (hot spots where high values cluster near other high values), Low-Low (cold spots where low values cluster), and two spatial outlier types (High-Low and Low-High) where a tract differs sharply from its neighbors. The outlier types often reveal the most interesting stories.

def lisa(values, W):
    n = len(values)
    x_bar = np.mean(values)
    z = values - x_bar
    m2 = np.sum(z ** 2) / n

    local_i = np.zeros(n)
    for i in range(n):
        lag = np.sum(W[i, :] * z)
        local_i[i] = (z[i] / m2) * lag

    return local_i, z

local_i, z = lisa(snap_rates, weights_std)

# Classify into quadrants
spatial_lag = weights_std @ z
for i in range(n):
    if z[i] > 0 and spatial_lag[i] > 0:
        cluster_type = "HH"  # Hot spot
    elif z[i] < 0 and spatial_lag[i] < 0:
        cluster_type = "LL"  # Cold spot
    elif z[i] > 0 and spatial_lag[i] < 0:
        cluster_type = "HL"  # Spatial outlier (high amid low)
    elif z[i] < 0 and spatial_lag[i] > 0:
        cluster_type = "LH"  # Spatial outlier (low amid high)

In our analysis, a handful of tracts in otherwise low-SNAP areas showed elevated participation (HL outliers), suggesting localized pockets of food insecurity that a non-spatial analysis would average away.

Step 6: Validate Against Known Urban/Rural Patterns

The last step is a reality check. Do the spatial clusters align with what we know about the county's geography? Dense urban core tracts should show different patterns than affluent suburban areas. If the LISA map shows hot spots in wealthy neighborhoods, something went wrong upstream.

# Quartile classification for quick visual validation
quartiles = np.percentile(snap_rates, [25, 50, 75])
classifications = np.digitize(snap_rates, quartiles)

# Cross-tabulate with LISA clusters
for q in range(4):
    mask = classifications == q
    hh_count = np.sum((local_i[mask] > 0) & (z[mask] > 0))
    print(f"Quartile {q+1}: {hh_count} hot-spot tracts out of {mask.sum()}")

If quartile 4 (highest SNAP participation) contains most of the HH clusters and quartile 1 contains most of the LL clusters, the spatial structure is consistent. Misalignment between quartile rank and cluster type warrants investigation: it could indicate a data error, a CRS mismatch, or genuinely unusual spatial patterns.

That completes the core workflow. But each of these steps can fail silently, producing output that looks plausible until we discover tracts are missing or p-values are wrong.

What Can Go Wrong

Several failure modes here produce plausible-looking output rather than errors:

CRS mismatch between shapefile and point data produces joins that look correct but assign points to wrong polygons. TIGER/Line shapefiles use NAD83 (EPSG:4269). If our point data is in WGS84 (EPSG:4326), the coordinates differ by meters, enough to shift boundary assignments in dense urban areas.
The Moran's I variance formula requires S0/S1/S2 matrices, and getting it wrong produces plausible but incorrect p-values. The formula involves fourth-moment kurtosis corrections that differ between normality and randomization assumptions. Using the wrong assumption shifts p-values, potentially crossing significance thresholds.
Large shapefiles consume memory fast. California's full tract file is manageable, but loading nationwide TIGER/Line data without county filtering will consume several gigabytes. Filtering by FIPS code before constructing geometries is essential for statewide or national analyses.

Given these pitfalls, most of which produce silent failures rather than crashes, when does this manual approach make sense?

When to Use This Approach

This manual approach trades convenience for transparency. Whether that tradeoff makes sense depends on the project.

Good fit:

Moderate-scale analyses (50-500 spatial units) where spatial dependence must be diagnosed before modeling
Projects where minimizing dependencies matters (this approach uses only shapefile, scipy, and numpy)
Exploratory work where understanding the mechanics of spatial statistics, rather than the output alone, is the goal

Less suitable:

Large-scale production pipelines (10,000+ units) where vectorized PySAL operations are significantly faster
Analyses requiring contiguity-based weights (shared borders), which are more complex to extract from raw shapefiles
Projects where reproducibility demands a well-tested, community-maintained library rather than manual implementations

Limitations

Building spatial operations from scratch exposes assumptions, but it also means accepting simplifications that established libraries handle better.

Centroid approximation. Averaging polygon vertices produces reasonable centroids for compact tracts but distorts for elongated or non-convex shapes. Area-weighted centroids would be more accurate.
Fixed k for all tracts. Using k=8 neighbors everywhere assumes similar spatial density. Adaptive bandwidth methods (varying k by local density) would better capture rural-urban differences.
Normality assumption in variance. Our Moran's I significance test assumes normally distributed values. For skewed distributions (common in SNAP participation data), a permutation-based test would be more robust.
No edge correction. Tracts at the county boundary have fewer potential neighbors, which can bias LISA statistics toward non-significance at the margins.
Single time period. This workflow analyzes a cross-sectional snapshot. Temporal dynamics in SNAP participation or store openings/closures require panel-data extensions.

Code Availability

The full implementation, including data preprocessing and visualization scripts, is available at github.com/dphdmae/foodsecurity_mobility. The spatial validation workflow corresponds to scripts/phase5_spatial_validation.py.

References and Notes

[1] TIGER/Line shapefiles are produced by the U.S. Census Bureau and available at census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html. The 2023 vintage used here includes boundary updates from the 2020 Census.

[2] Moran's I was introduced by Patrick Moran in 1950. The variance formula used here follows the normality assumption derivation in Cliff and Ord (1981), which remains the standard reference for spatial autocorrelation statistics.

[3] The study county contains 108 census tracts as of the 2020 Census. The tract count reflects post-2020 redistricting and may differ from older TIGER/Line vintages.