Open-Source Methods
Python tutorials for researchers transitioning from licensed software. The same econometrics, spatial analysis, and machine learning, implemented with free tools and reproducible code.
Why Python for Applied Economics
Most applied economists learn Stata in graduate school. It works, it is well-documented, and departments have site licenses. The problem surfaces after graduation: a Stata/SE annual license costs $595 per year [1], research assistants need their own copies, and collaborators at different institutions may not have access at all.
Python eliminates these barriers. pandas handles the same data manipulation as Stata's generate and replace commands. statsmodels runs the same regressions [2]. scikit-learn and XGBoost provide machine learning tools that Stata lacks entirely [3]. GeoPandas opens up spatial analysis that would otherwise require an ArcGIS license at $1,500 per year.
The transition is not painless. Python requires more boilerplate, error messages are less informative than Stata's, and the ecosystem is fragmented across dozens of packages. These tutorials exist because we went through the transition ourselves and documented every edge case, silent failure, and workaround along the way.
Each tutorial solves a real problem from our research: how to validate GTFS feeds before they crash the routing engine, how to catch the census tracts that spatial joins silently drop, how to build a classifier when 94% accuracy means the model learned to predict the majority class. The code is available on GitHub, and every analysis can be reproduced from raw data to final output.
[1] StataCorp. "Stata pricing." stata.com/order/new/edu/profplus/student-pricing/, accessed February 2026.
[2] Seabold, S. and Perktold, J. (2010). "Statsmodels: Econometric and statistical modeling with Python." Proceedings of the 9th Python in Science Conference.
[3] Pedregosa, F. et al. (2011). "Scikit-learn: Machine learning in Python." Journal of Machine Learning Research, 12, 2825-2830.
Python vs. Licensed Software for Applied Economics
| Capability | Python (Free) | Stata ($595/yr) | ArcGIS ($1,500/yr) |
|---|---|---|---|
| OLS / IV / Panel regression | statsmodels, linearmodels | Built-in | N/A |
| Difference-in-differences | statsmodels + manual event study | Built-in + csdid | N/A |
| Machine learning (RF, XGBoost, SHAP) | scikit-learn, xgboost, shap | Limited (Stata 18 lasso only) | N/A |
| Spatial joins & analysis | GeoPandas, shapely, scipy | Limited (spmap) | Built-in |
| Census API integration | requests + custom pipeline | Manual download | Manual download |
| Transit routing (GTFS) | r5py, gtfs-kit | N/A | Network Analyst ($$$) |
| Annual cost (single user) | $0 | $595 | $1,500 |
Python Tutorials
How to Estimate Difference-in-Differences in Python
A statsmodels workflow for event study estimation, with the diagnostics that separate credible estimates from noise.
How to Build a Census Data Pipeline That Doesn't Silently Fail
A Python workflow for pulling ACS data from the Census API, with the validation checks that prevent bad data from reaching the analysis.
Spatial Analysis with GeoPandas: From Joins to Autocorrelation
A spatial analysis workflow from point-to-polygon joins through spatial weights, Moran's I, and LISA cluster detection.
How to Interpret a Classifier with SHAP Values
A Python workflow for understanding what drives model predictions, and what SHAP importance actually measures.
How to Build a Classifier When 94% Accuracy Means Nothing
A scikit-learn workflow for imbalanced classification, with the evaluation metrics that actually matter.
How to Validate GTFS Feeds Before They Break the Routing Engine
A Python workflow for catching the transit data problems that structural checks miss.
How to Calculate 2.7M Transit Routes for Free
Step-by-step guide to r5py, GTFS data, and multimodal accessibility analysis.
Data Collection & Validation
6,613 Stores, $147, Zero Lost Data
Building resilient data pipelines that handle API failures, rate limits, and edge cases.
EBT Verification Methodology
Cross-validating SNAP retailer data against multiple authoritative sources.
400 Labels to 94% Accuracy
Building and validating a grocery store classifier through iterative labeling.
91% of "Grocery Stores" Aren't Really Groceries
How we classified 25,000 stores without setting foot in one, and what we found about food environment quality in California.
Spatial & Geographic Methods
Residualized Accessibility Index
Separating transit access from confounding factors using regression residuals.
Why County Rankings Mislead: Policy vs Context
Merced County's vulnerability index is 2.3x higher than San Francisco's. Before drawing policy conclusions, we need to understand what that number measures.
Crime Geography: 22x Variation Within One County
Swap county averages for neighborhood data and a 22-fold range in crime rates emerges.
Causal Inference & Evaluation
Understanding the Limits of Parallel Trends Tests
Why a high p-value on parallel trends tests can mislead, and how sensitivity analysis reveals fragile causal claims.
Scaling Up: From 7 Counties to 9,039 Tracts Statewide
Expanding from 2,000 to 9,039 census tracts reveals what scales linearly and what requires adaptation.
Key takeaways
- Python replaces $2,000+/year in licensed software for econometrics, spatial analysis, and machine learning with free, open-source alternatives.
- Every tutorial solves a real research problem encountered during applied economics work, with edge cases and validation steps documented.
- All code is publicly available on GitHub with data and documentation for full reproducibility.
- Stata and R knowledge transfers: tutorials assume familiarity with econometric concepts and explain Python equivalents of common operations.
Frequently Asked Questions
What Python version do the tutorials use?
All tutorials use Python 3.11+ with standard data science libraries. Each tutorial lists exact package versions in its requirements file.
Can I follow these if I only know Stata or R?
Yes. The tutorials assume familiarity with econometric concepts but not Python syntax. Each explains the Python equivalent of common Stata or R operations.
Where can I download the replication data?
Every tutorial links to a public GitHub repository. The main replication repository contains 18 research projects with complete documentation.
How are these different from documentation?
Each tutorial solves a real research problem. They document what went wrong, what edge cases surfaced, and what validation steps caught errors that simpler approaches would miss.