Research Methods

Reproducible applied economics research combining AI-assisted workflows with rigorous data validation. Every article includes Python code, methodology documentation, and links to GitHub replication materials.

2.7M
Transit routes calculated
94%
Classifier accuracy
6,613
Stores validated
18
Replication repos

Why Methods Matter: The Reproducibility Crisis and AI-Assisted Research

Applied economics faces a reproducibility problem. Studies report findings, but replication attempts often fail. Sometimes the data isn't available. Sometimes the code doesn't run. Sometimes the methodology described in the paper doesn't match what was actually done. The result: a literature where individual findings are difficult to verify and systematic reviews struggle to synthesize evidence.

This publication takes a different approach. Every article includes complete methodology documentation, working Python code, and links to public GitHub repositories containing all data and scripts needed to reproduce the analysis. When we calculate 2.7 million transit routes or validate 6,613 grocery stores, anyone can verify our work.

AI-Assisted Research: A New Paradigm

The integration of AI into research workflows represents a fundamental shift in how applied economics gets done. Large language models like Claude can translate methodology paragraphs into working code, identify edge cases in data pipelines, and help refactor sprawling research codebases into maintainable systems.

But AI assistance isn't a shortcut around rigor—it's a force multiplier for careful methodology. When we use Claude Code to build a transit routing pipeline, the AI doesn't replace domain expertise. It accelerates the translation of that expertise into working code. The researcher still needs to understand what a GTFS feed contains, why origin-destination matrices matter, and how to interpret accessibility indices. The AI helps implement those ideas more quickly and with fewer bugs.

Our approach to AI-assisted research centers on a simple artifact: the CLAUDE.md file. This context file captures project requirements, data constraints, variable definitions, and methodological decisions in one place. When context is explicit, AI assistance becomes more reliable. When requirements are documented, code reviews become more meaningful. When methodology is written down, replication becomes possible.

Data Validation: The Foundation of Credible Research

Most applied economics papers describe their data sources in a paragraph or two. They might mention "grocery store locations from the USDA Food Access Research Atlas" or "transit schedules from the General Transit Feed Specification." But they rarely document the validation process: How accurate are these data? What are the known limitations? How were edge cases handled?

Our approach makes validation explicit. When we collected 6,613 California grocery stores, we didn't just download the USDA's list. We cross-validated against multiple authoritative sources: the official SNAP retailer database, California ABC license records, and manual verification of edge cases. When Google Places classified a store as "grocery," we checked whether it actually sold groceries or was actually a convenience store, liquor store, or pharmacy.

The result was a grocery store classifier with 94% accuracy—validated through 400 hand-labeled examples across multiple rounds of iterative improvement. That 94% matters because conclusions about food access depend on knowing which stores actually sell food. When published research uses unvalidated lists, the findings inherit whatever errors exist in the source data.

Transit Routing at Scale: From GTFS to Accessibility Indices

Transit accessibility research requires calculating travel times from many origins to many destinations. A typical county-level study might need travel times from 1,000 census tracts to 200 grocery stores at multiple departure times—that's 200,000+ routing calculations per departure time. Commercial tools like ArcGIS Network Analyst or Google Directions API can do this, but they're expensive at scale.

The r5py library offers a free alternative. Built on Conveyal's R5 routing engine, r5py can calculate millions of transit routes using publicly available GTFS data. Our methodology article walks through the complete process: downloading transit feeds, setting up the routing engine, handling coordinate system transformations, and interpreting the results. The code is available on GitHub.

But transit routing is just the first step. Raw travel times need context. A 45-minute transit trip might be acceptable in a dense urban area but prohibitive in a suburb. The residualized accessibility index addresses this by controlling for baseline accessibility—what transit access would we expect given population density, income levels, and urban form? Residuals identify communities with unusually good or poor access relative to their characteristics.

API Collection: Building Resilient Data Pipelines

Modern applied economics increasingly relies on data from APIs: Google Places, Census Bureau, Bureau of Labor Statistics, OpenStreetMap. Each API has its own rate limits, authentication requirements, and failure modes. A robust data collection pipeline needs to handle all of them gracefully.

Our API collection methodology emphasizes resilience. When we collected 6,613 store locations at a cost of $147, the pipeline handled network timeouts, rate limit errors, and intermittent API failures without losing data. Checkpointing saved progress every 50 stores. Exponential backoff prevented rate limit violations. Detailed logging captured every API response for later validation.

The investment in resilient infrastructure pays dividends. When collection runs for hours or days, failures are inevitable. A pipeline that can resume from the last checkpoint, retry failed requests, and log everything for debugging is the difference between a weekend of work and a month of frustration.

Key Methodological Insights

Context Files Enable AI Assistance

CLAUDE.md files that capture project requirements, variable definitions, and methodological decisions make AI-assisted coding more reliable. Explicit context reduces hallucination and enables meaningful code review.

Validation Requires Multiple Sources

Cross-validating grocery stores against USDA, SNAP retailer lists, and ABC license records identified classification errors invisible to any single source. 94% accuracy required 400 hand-labeled training examples.

Free Tools Scale to Millions

r5py with public GTFS data calculated 2.7 million transit routes at zero cost. The methodology works for any US metro area with publicly available transit feeds.

Resilient Pipelines Prevent Data Loss

Checkpointing, exponential backoff, and comprehensive logging let our API collection run for days without losing data to network errors or rate limits. 6,613 stores, $147, zero lost records.

Frequently Asked Questions

What is AI-assisted research?

AI-assisted research uses large language models like Claude to accelerate the translation of methodological expertise into working code. The researcher provides domain knowledge, variable definitions, and methodological decisions through context files (CLAUDE.md). The AI helps implement these ideas as code, identifies edge cases, and assists with refactoring. AI assistance doesn't replace expertise—it multiplies its impact. See our article on context files in research.

How do you calculate transit accessibility for free?

We use r5py, a Python library built on Conveyal's R5 routing engine. Combined with publicly available GTFS transit feeds, it can calculate millions of multimodal routes at zero cost. Our r5py tutorial walks through the complete process with working code examples.

How do you validate data quality?

We cross-validate against multiple authoritative sources. For grocery store data, we compared USDA Food Access Atlas listings against the official SNAP retailer database, California ABC license records, and manual verification. This iterative process, documented in our grocery store classifier article, achieved 94% accuracy through 400 hand-labeled examples.

Can I replicate your research?

Yes. Every article links to a public GitHub repository containing all data and code needed to reproduce the analysis. Our main replication repository contains 18 research projects with complete documentation.