How to Estimate Difference-in-Differences in Python

A statsmodels workflow for event study estimation, with the diagnostics that separate credible estimates from noise.

COVID-19 hit every county, but it didn't hit every labor market the same way. When we want to estimate how income shifted across different types of communities (high-inequality metros versus low-inequality rural areas, for instance), we need a research design that separates the COVID effect from pre-existing trends. Difference-in-differences (DiD) does exactly this, if the assumptions hold. And that "if" carries a lot of weight. The parallel trends assumption is the backbone of any DiD design, and violating it quietly turns our estimates into noise dressed up as findings.

Several packages handle DiD estimation: linearmodels offers PanelOLS with absorbed fixed effects, the R package fixest (accessible via rpy2) provides fast two-way fixed effects with Sun-Abraham corrections, and csdid implements the Callaway-Sant'Anna estimator for staggered designs. Here we use statsmodels for its transparency and accessibility, so that each step of the estimation is visible and modifiable.

One structural point worth noting upfront: ACS microdata is a repeated cross-section, not a true panel. We observe different individuals each year, identified only by their labor market area (LMA). So our coefficients capture how the income distribution within an LMA shifted over time, not how particular people's incomes changed. This distinction matters throughout the workflow.

Let's walk through a complete statsmodels workflow: from constructing interaction terms to clustering standard errors, with the diagnostics that tell us whether our estimates are worth trusting.

Before we can run those diagnostics, we need the right estimation infrastructure. Here is the tool stack this workflow relies on.


Tool Stack: statsmodels, WLS, and ACS Microdata

ComponentToolPurpose
Estimationstatsmodels WLSWeighted least squares with fixed effects
Standard errorsget_robustcov_resultsCluster-robust SEs at LMA level
DataACS microdataPerson-year repeated cross-section, 2018–2023
WeightsperwtCensus person weights

The American Community Survey provides person-level observations across six years, giving us a repeated cross-section at the labor market area (LMA) level. Each person carries a census weight (perwt) that makes our estimates representative of the population, but those weights need careful handling in estimation, as we'll see.


Step 1: Construct Year × Treatment Interaction Terms

The core of an event study DiD is a set of interaction terms between year indicators and treatment group membership. We need to exclude one reference year (here, 2019, the last pre-treatment period) so that all coefficients are interpreted relative to that baseline.

Why 2019 instead of 2018? The convention is to use the period immediately before treatment as the reference. This makes pre-treatment coefficients (here, 2018 alone) into a direct test of parallel trends: if the 2018 coefficient is statistically indistinguishable from zero, the treated and control groups were on similar trajectories before COVID arrived. One pre-treatment period is minimal for this test. More is better, and with ACS data starting in 2018, it's what we have.

# Create year dummies, excluding 2019 as reference year.
# Each dummy == 1 for observations in that year, 0 otherwise.
# 2019 is omitted so all coefficients measure change relative to the last pre-COVID year.
for year in [2018, 2020, 2021, 2022, 2023]:
    df[f'year_{year}'] = (df['year'] == year).astype(int)

Step 2: Define Treatment Groups

Our treatment isn't binary. Labor market areas fall into typologies based on two dimensions: income level and income inequality. This gives us four groups:

  • HI-HI: High income, high inequality
  • HI-LI: High income, low inequality
  • LI-HI: Low income, high inequality
  • LI-LI: Low income, low inequality (reference group)

The logic behind this grouping: COVID-era policies like expanded SNAP benefits, enhanced unemployment insurance, and stimulus payments landed differently depending on a community's pre-existing income structure. High-inequality areas may have had larger populations eligible for transfers, while high-income areas may have had stronger labor market recoveries. By interacting these typologies with year indicators, we can disentangle how each combination of income level and inequality experienced the shock differently from the LI-LI reference group.

# Create treatment group indicators
df['hi_hi'] = ((df['income_level'] == 'high') & (df['inequality_level'] == 'high')).astype(int)
df['hi_li'] = ((df['income_level'] == 'high') & (df['inequality_level'] == 'low')).astype(int)
df['li_hi'] = ((df['income_level'] == 'low') & (df['inequality_level'] == 'high')).astype(int)

# Generate interaction terms: year × treatment group
for year in [2018, 2020, 2021, 2022, 2023]:
    for group in ['hi_hi', 'hi_li', 'li_hi']:
        df[f'{group}_x_{year}'] = df[group] * df[f'year_{year}']

This produces 5 years × 3 treatment groups = 15 interaction terms per outcome. Combined with the 3 main effects for treatment groups (absorbed into LMA fixed effects in practice), we're estimating a rich set of differential trajectories. Across three outcomes (labor income, transfer income, and total income), that's 18 interaction coefficients per outcome, or 54 event study coefficients total.


Step 3: Set Up Weighted Least Squares

Census person weights ensure our estimates reflect the actual population rather than the sample alone. But raw perwt values can be enormous (some exceed 100), which creates numerical instability in matrix operations. Normalizing to mean 1 preserves the relative weighting while keeping the math well-behaved.

import statsmodels.api as sm
from statsmodels.regression.linear_model import WLS

# Normalize weights to mean 1
weights = df['perwt'].values / df['perwt'].mean()

# Define outcome and design matrix
y = df['labor_income'].values
X = df[interaction_cols + fe_cols].values  # interactions + fixed effects
X = sm.add_constant(X)

model = WLS(y, X, weights=weights)

Here's the subtlety: WLS in statsmodels multiplies each observation by the square root of its weight. So if we pass raw weights that range from 1 to 300, we're effectively telling the estimator that some observations are 300 times more informative than others. Normalization preserves relative influence across observations while preventing the weight magnitudes from causing floating-point issues during estimation.


Step 4: Estimate with Year Fixed Effects

Year fixed effects absorb any shocks common to all LMAs in a given year: national recessions, federal stimulus, inflation. Without them, our interaction terms would conflate the treatment-specific effect with the aggregate time trend.

One trap to watch for: if we include both year fixed effects and a post-treatment dummy (e.g., post_2020 = 1 for years >= 2020), the post dummy is a linear combination of the year dummies. The model will either drop a column silently or throw a singular matrix error. The event study specification, using individual year × treatment interactions, already captures the post-treatment effect year by year, so a separate post dummy is redundant.

# Year fixed effects (2019 absorbed as reference)
for year in [2018, 2020, 2021, 2022, 2023]:
    df[f'fe_year_{year}'] = (df['year'] == year).astype(int)

# LMA fixed effects absorb time-invariant group differences
lma_dummies = pd.get_dummies(df['lma_code'], prefix='lma', drop_first=True)
df = pd.concat([df, lma_dummies], axis=1)

# Combine into design matrix
fe_cols = [c for c in df.columns if c.startswith('fe_year_') or c.startswith('lma_')]
X = df[interaction_cols + fe_cols]
X = sm.add_constant(X)

results = WLS(y, X.values, weights=weights).fit()

Step 5: Apply Cluster-Robust Standard Errors

Observations within the same LMA are correlated across years. Standard OLS errors assume independence and will be too small, which means we'll reject null hypotheses we shouldn't. Clustering at the LMA level accounts for this within-group correlation.

results_robust = results.get_robustcov_results(
    cov_type='cluster',
    groups=df['lma_code'].values
)

But cluster-robust standard errors have their own assumption: we need enough clusters for the asymptotic approximation to work. The usual rule of thumb is 30–50 clusters minimum. If our LMA typology produces only, say, 15 clusters in the HI-HI group, the cluster-robust SEs may be unreliable. In that case, HC1 (heteroskedasticity-robust but unclustered) is a defensible fallback, with the caveat that it ignores within-LMA serial correlation.

# Fallback if cluster count is low
if df['lma_code'].nunique() < 30:
    results_robust = results.get_robustcov_results(cov_type='HC1')

To get a sense of what these estimates might look like in practice: if the interaction term coefficient for, say, hi_hi_x_2020 comes out to -0.03 with p < 0.05, that would suggest the COVID shock was associated with a 3 percentage-point reduction in labor income in high-income, high-inequality LMAs relative to the LI-LI reference group, beyond what the common year trend explains. Coefficients in later years (2021, 2022, 2023) then trace out whether that gap persisted, widened, or closed, which is the whole point of the event study structure.


Step 6: Extract Event Study Coefficients with 95% CI

Before extracting the post-treatment coefficients, it's worth pausing on the diagnostic question: do the 2018 coefficients overlap zero? If the confidence interval for, say, hi_hi_x_2018 excludes zero, the HI-HI group was already diverging from the reference before COVID. That's a red flag for parallel trends. A small divergence may still be tolerable, though it warrants investigation. With that check in mind, we can pull out the full set of interaction coefficients and their confidence intervals to visualize the event study.

import numpy as np

coef_names = [c for c in X.columns if '_x_' in c]
coefs = results_robust.params[[X.columns.get_loc(c) for c in coef_names]]
ses = results_robust.bse[[X.columns.get_loc(c) for c in coef_names]]

# 95% confidence intervals
ci_lower = coefs - 1.96 * ses
ci_upper = coefs + 1.96 * ses

event_study_df = pd.DataFrame({
    'term': coef_names,
    'estimate': coefs,
    'se': ses,
    'ci_lower': ci_lower,
    'ci_upper': ci_upper
})

With these coefficients extracted, a natural next question is whether the estimates hold up under alternative specifications. Temporal cross-validation offers one way to test that.


Step 7: Robustness Checks

No single specification tells the whole story. Here are a few checks that can strengthen or weaken our confidence:

# Alternative treatment definition: continuous inequality measure
df['gini_x_post'] = df['gini_coefficient'] * df['post_2020']

# Alternative reference group: use HI-LI instead of LI-LI
df_alt = df[df['typology'] != 'LI-LI']
# Re-estimate with HI-LI as reference

# Placebo test: pretend treatment happened in 2019
df['placebo_post'] = (df['year'] >= 2019).astype(int)

If estimates are sensitive to the reference group or treatment definition, that itself is a finding worth reporting. It tells us the effect depends on how we carve up the comparison.


What Can Go Wrong

Several design-level problems can undermine the entire analysis, beyond the implementation-level warnings we've already covered inline:

Unbalanced panels. If certain LMAs drop out of the sample in post-treatment years (perhaps because they're too small for ACS to reliably estimate), the missingness may correlate with treatment. This is a form of selection bias that fixed effects won't solve. Checking for balanced panels before estimation is basic yet often skipped.

Pre-trends tests are necessary but not sufficient. A non-significant 2018 coefficient does not prove parallel trends; it could reflect low power rather than true equality. With only one pre-treatment period, the test is especially weak. Reporting the point estimate and confidence interval matters more than the p-value here.

Staggered treatment timing. Standard two-way fixed effects (TWFE) DiD can produce biased estimates when treatment turns on at different times for different units. If some LMAs experienced COVID-related economic disruption earlier than others (plausible given regional variation in shutdown timing), TWFE weights already-treated units negatively, contaminating the estimate. Callaway-Sant'Anna or Sun-Abraham corrections address this, though they require stronger data structure than what basic ACS panels provide.


When This Design Works Well

Not every policy question fits this framework. DiD with event study estimation tends to work well when the data structure supports its core assumptions:

  • Treatment timing is sharp and common across treated units (COVID onset in early 2020)
  • A credible control group exists (LMAs with different structural characteristics but similar pre-trends)
  • Panel data spans enough pre-treatment periods to test parallel trends
  • The outcome responds to treatment within the observation window

When It's Less Suitable

  • Treatment is continuous or gradual rather than discrete
  • No plausible control group shares pre-treatment trends
  • Anticipation effects exist (units change behavior before treatment)
  • Sample is too small for reliable clustering

When the standard statsmodels approach does not fit the data structure, particularly with staggered treatment timing or very large panels, the alternative packages mentioned at the top of this article can help.


Alternatives Worth Considering

PackageLanguageStrength
linearmodels (Python)PythonBuilt-in panel data models, absorbed fixed effects
csdidPython/RCallaway-Sant'Anna estimator for staggered DiD
did (R package)RComprehensive toolkit for heterogeneous treatment effects
fixest (R)RFast TWFE with Sun-Abraham corrections built in

For projects where staggered treatment timing is a concern, csdid implements the Callaway and Sant'Anna (2021) estimator, which avoids the negative weighting problem in standard TWFE. The linearmodels package offers PanelOLS with absorbed fixed effects, which is more memory-efficient than manually creating dummy variables for large panels.


Limitations

The single pre-treatment year (2018) limits our ability to test parallel trends rigorously. More pre-treatment periods would strengthen the design considerably.

LMA-level treatment assignment means we're working with a relatively small number of clusters, which affects standard error reliability as discussed in Step 5.


Code Availability

The full estimation pipeline is implemented in two scripts:

  • 08_event_study.py (467 lines): Event study DiD with WLS, cluster-robust SEs, and visualization
  • 09_triple_diff.py (409 lines): Triple-difference extension incorporating a third dimension of heterogeneity

Methodological Notes

[1] LMA typologies are constructed from ACS 5-year estimates of median household income and Gini coefficients, classified at the median split. Alternative classifications (terciles, continuous measures) serve as robustness checks.

[2] Person weights (perwt) from IPUMS reflect the Census Bureau's sampling design and post-stratification adjustments. Using unweighted estimates would over-represent areas where ACS sampling rates are higher relative to population.