Open-Source Methods

Python tutorials for researchers transitioning from licensed software. The same econometrics, spatial analysis, and machine learning, implemented with free tools and reproducible code.

Why Python for Applied Economics

Most applied economists learn Stata in graduate school. It works, it is well-documented, and departments have site licenses. The problem surfaces after graduation: a Stata/SE annual license costs $595 per year [1], research assistants need their own copies, and collaborators at different institutions may not have access at all.

Python eliminates these barriers. pandas handles the same data manipulation as Stata's generate and replace commands. statsmodels runs the same regressions [2]. scikit-learn and XGBoost provide machine learning tools that Stata lacks entirely [3]. GeoPandas opens up spatial analysis that would otherwise require an ArcGIS license at $1,500 per year.

The transition is not painless. Python requires more boilerplate, error messages are less informative than Stata's, and the ecosystem is fragmented across dozens of packages. These tutorials exist because we went through the transition ourselves and documented every edge case, silent failure, and workaround along the way.

Each tutorial solves a real problem from our research: how to validate GTFS feeds before they crash the routing engine, how to catch the census tracts that spatial joins silently drop, how to build a classifier when 94% accuracy means the model learned to predict the majority class. The code is available on GitHub, and every analysis can be reproduced from raw data to final output.

[1] StataCorp. "Stata pricing." stata.com/order/new/edu/profplus/student-pricing/, accessed February 2026.
[2] Seabold, S. and Perktold, J. (2010). "Statsmodels: Econometric and statistical modeling with Python." Proceedings of the 9th Python in Science Conference.
[3] Pedregosa, F. et al. (2011). "Scikit-learn: Machine learning in Python." Journal of Machine Learning Research, 12, 2825-2830.

Python vs. Licensed Software for Applied Economics

Capability Python (Free) Stata ($595/yr) ArcGIS ($1,500/yr)
OLS / IV / Panel regression statsmodels, linearmodels Built-in N/A
Difference-in-differences statsmodels + manual event study Built-in + csdid N/A
Machine learning (RF, XGBoost, SHAP) scikit-learn, xgboost, shap Limited (Stata 18 lasso only) N/A
Spatial joins & analysis GeoPandas, shapely, scipy Limited (spmap) Built-in
Census API integration requests + custom pipeline Manual download Manual download
Transit routing (GTFS) r5py, gtfs-kit N/A Network Analyst ($$$)
Annual cost (single user) $0 $595 $1,500

Python Tutorials

Tutorial

How to Estimate Difference-in-Differences in Python

A statsmodels workflow for event study estimation, with the diagnostics that separate credible estimates from noise.

February 2026
Tutorial

How to Build a Census Data Pipeline That Doesn't Silently Fail

A Python workflow for pulling ACS data from the Census API, with the validation checks that prevent bad data from reaching the analysis.

February 2026
Tutorial

Spatial Analysis with GeoPandas: From Joins to Autocorrelation

A spatial analysis workflow from point-to-polygon joins through spatial weights, Moran's I, and LISA cluster detection.

February 2026
Tutorial

How to Interpret a Classifier with SHAP Values

A Python workflow for understanding what drives model predictions, and what SHAP importance actually measures.

February 2026
Tutorial

How to Build a Classifier When 94% Accuracy Means Nothing

A scikit-learn workflow for imbalanced classification, with the evaluation metrics that actually matter.

February 2026
Tutorial

How to Validate GTFS Feeds Before They Break the Routing Engine

A Python workflow for catching the transit data problems that structural checks miss.

February 2026
Tutorial

How to Calculate 2.7M Transit Routes for Free

Step-by-step guide to r5py, GTFS data, and multimodal accessibility analysis.

November 2025

Data Collection & Validation

6,613 Stores, $147, Zero Lost Data

Building resilient data pipelines that handle API failures, rate limits, and edge cases.

November 2025

EBT Verification Methodology

Cross-validating SNAP retailer data against multiple authoritative sources.

October 2025

400 Labels to 94% Accuracy

Building and validating a grocery store classifier through iterative labeling.

October 2025

91% of "Grocery Stores" Aren't Really Groceries

How we classified 25,000 stores without setting foot in one, and what we found about food environment quality in California.

October 2025

Spatial & Geographic Methods

Residualized Accessibility Index

Separating transit access from confounding factors using regression residuals.

November 2025

Why County Rankings Mislead: Policy vs Context

Merced County's vulnerability index is 2.3x higher than San Francisco's. Before drawing policy conclusions, we need to understand what that number measures.

July 2025

Crime Geography: 22x Variation Within One County

Swap county averages for neighborhood data and a 22-fold range in crime rates emerges.

July 2025

Causal Inference & Evaluation

Understanding the Limits of Parallel Trends Tests

Why a high p-value on parallel trends tests can mislead, and how sensitivity analysis reveals fragile causal claims.

September 2025

Scaling Up: From 7 Counties to 9,039 Tracts Statewide

Expanding from 2,000 to 9,039 census tracts reveals what scales linearly and what requires adaptation.

September 2025

Key takeaways

  • Python replaces $2,000+/year in licensed software for econometrics, spatial analysis, and machine learning with free, open-source alternatives.
  • Every tutorial solves a real research problem encountered during applied economics work, with edge cases and validation steps documented.
  • All code is publicly available on GitHub with data and documentation for full reproducibility.
  • Stata and R knowledge transfers: tutorials assume familiarity with econometric concepts and explain Python equivalents of common operations.

Frequently Asked Questions

What Python version do the tutorials use?

All tutorials use Python 3.11+ with standard data science libraries. Each tutorial lists exact package versions in its requirements file.

Can I follow these if I only know Stata or R?

Yes. The tutorials assume familiarity with econometric concepts but not Python syntax. Each explains the Python equivalent of common Stata or R operations.

Where can I download the replication data?

Every tutorial links to a public GitHub repository. The main replication repository contains 18 research projects with complete documentation.

How are these different from documentation?

Each tutorial solves a real research problem. They document what went wrong, what edge cases surfaced, and what validation steps caught errors that simpler approaches would miss.