Can I follow these tutorials if I only know Stata or R?

Yes. The tutorials assume familiarity with econometric concepts but not Python syntax. Each article explains the Python equivalent of common Stata or R operations, making the transition from licensed to open-source tools practical.

How are the tutorials different from documentation?

Each tutorial solves a real research problem encountered during our applied economics work. They document what went wrong, what edge cases surfaced, and what validation steps caught errors that simpler approaches would miss.

Open-Source Methods

Q: Where can I download the replication data?

Every tutorial links to a public GitHub repository containing all data and scripts needed to reproduce the analysis. The main replication repository at github.com/dphdame/tooearlytosay-analysis contains 18 research projects.

Python tutorials for researchers transitioning from licensed software. The same econometrics, spatial analysis, and machine learning, implemented with free tools and reproducible code.

Why Python for Applied Economics

Most applied economists learn Stata in graduate school. It works, it is well-documented, and departments have site licenses. The problem surfaces after graduation: a Stata/SE annual license costs $595 per year [1], research assistants need their own copies, and collaborators at different institutions may not have access at all.

Python eliminates these barriers. pandas handles the same data manipulation as Stata's generate and replace commands. statsmodels runs the same regressions [2]. scikit-learn and XGBoost provide machine learning tools that Stata lacks entirely [3]. GeoPandas opens up spatial analysis that would otherwise require an ArcGIS license at $1,500 per year.

The transition is not painless. Python requires more boilerplate, error messages are less informative than Stata's, and the ecosystem is fragmented across dozens of packages. These tutorials exist because we went through the transition ourselves and documented every edge case, silent failure, and workaround along the way.

Each tutorial solves a real problem from our research: how to validate GTFS feeds before they crash the routing engine, how to catch the census tracts that spatial joins silently drop, how to build a classifier when 94% accuracy means the model learned to predict the majority class. The code is available on GitHub, and every analysis can be reproduced from raw data to final output.

[1] StataCorp. "Stata pricing." stata.com/order/new/edu/profplus/student-pricing/, accessed February 2026.
[2] Seabold, S. and Perktold, J. (2010). "Statsmodels: Econometric and statistical modeling with Python." Proceedings of the 9th Python in Science Conference.
[3] Pedregosa, F. et al. (2011). "Scikit-learn: Machine learning in Python." Journal of Machine Learning Research, 12, 2825-2830.

Python vs. Licensed Software for Applied Economics

Capability	Python (Free)	Stata ($595/yr)	ArcGIS ($1,500/yr)
OLS / IV / Panel regression	statsmodels, linearmodels	Built-in	N/A
Difference-in-differences	statsmodels + manual event study	Built-in + csdid	N/A
Machine learning (RF, XGBoost, SHAP)	scikit-learn, xgboost, shap	Limited (Stata 18 lasso only)	N/A
Spatial joins & analysis	GeoPandas, shapely, scipy	Limited (spmap)	Built-in
Census API integration	requests + custom pipeline	Manual download	Manual download
Transit routing (GTFS)	r5py, gtfs-kit	N/A	Network Analyst ($$$)
Annual cost (single user)	$0	$595	$1,500

November 2025

Data Collection & Validation

6,613 Stores, $147, Zero Lost Data

Building resilient data pipelines that handle API failures, rate limits, and edge cases.

November 2025

EBT Verification Methodology

Cross-validating SNAP retailer data against multiple authoritative sources.

October 2025

400 Labels to 94% Accuracy

Building and validating a grocery store classifier through iterative labeling.

October 2025

91% of "Grocery Stores" Aren't Really Groceries

How we classified 25,000 stores without setting foot in one, and what we found about food environment quality in California.

October 2025

Spatial & Geographic Methods

Residualized Accessibility Index

Separating transit access from confounding factors using regression residuals.

November 2025

Why County Rankings Mislead: Policy vs Context

Merced County's vulnerability index is 2.3x higher than San Francisco's. Before drawing policy conclusions, we need to understand what that number measures.

July 2025

Crime Geography: 22x Variation Within One County

Swap county averages for neighborhood data and a 22-fold range in crime rates emerges.

July 2025

Causal Inference & Evaluation

Understanding the Limits of Parallel Trends Tests

Why a high p-value on parallel trends tests can mislead, and how sensitivity analysis reveals fragile causal claims.

September 2025

Scaling Up: From 7 Counties to 9,039 Tracts Statewide

Expanding from 2,000 to 9,039 census tracts reveals what scales linearly and what requires adaptation.

September 2025

Key takeaways

Python replaces $2,000+/year in licensed software for econometrics, spatial analysis, and machine learning with free, open-source alternatives.
Every tutorial solves a real research problem encountered during applied economics work, with edge cases and validation steps documented.
All code is publicly available on GitHub with data and documentation for full reproducibility.
Stata and R knowledge transfers: tutorials assume familiarity with econometric concepts and explain Python equivalents of common operations.

Frequently Asked Questions

What Python version do the tutorials use?

All tutorials use Python 3.11+ with standard data science libraries. Each tutorial lists exact package versions in its requirements file.

Can I follow these if I only know Stata or R?

Yes. The tutorials assume familiarity with econometric concepts but not Python syntax. Each explains the Python equivalent of common Stata or R operations.

Where can I download the replication data?

Every tutorial links to a public GitHub repository. The main replication repository contains 18 research projects with complete documentation.

How are these different from documentation?

Each tutorial solves a real research problem. They document what went wrong, what edge cases surfaced, and what validation steps caught errors that simpler approaches would miss.