When we think about using difference-in-differences (DiD), the first question is whether it is the right tool for the comparison in front of us, and where its conclusions stop being reliable.
Four worked analyses anchor that question. Each includes open code and data, with the COVID-19 income analysis as a running example. The analyst owns the estimation. The agent drafts the implementation and runs the specified diagnostics.
This is written for applied researchers who know the DiD formula but want a clearer rule for when to reach for it. A familiar failure mode: a two-group, before-and-after comparison is easy to run and easy to over-trust. The design choice that determines whether the estimate has a causal interpretation often gets made by habit rather than explicitly.
DiD supports a causal interpretation only if the comparison group represents what the treated group would have done in the absence of treatment. Each assumption below tests that claim, and each failure mode shows how it can break.
When DiD is the right tool
DiD answers a specific question: how much did an outcome change for a treated group, relative to a comparison group, over the same period?
The method shifts the burden from requiring the groups to be similar to requiring that they move in parallel before treatment. This is a weaker and more defensible condition.
In the COVID-19 income analysis, the shock affected all counties, but labor markets adjusted at different rates. The design compares income changes in high-inequality metros to those in low-inequality areas over the same period, and we read that difference as the differential COVID effect.
DiD fits when three conditions hold:
- The treatment timing is clearly defined for a specific group.
- A valid untreated or later-treated comparison group is available.
- The groups track each other in the pre-treatment period.
When any of these fail, the sections below describe what goes wrong.
The assumptions as decisions
Each DiD assumption is a decision made before trusting the estimate.
Parallel trends. The comparison group stands in for the treated group’s untreated path. The empirical check is whether pre-treatment trends move together. The harder question is sensitivity: how much would small deviations change the estimate?
No anticipation. The treated group does not change behavior in advance of treatment. If a policy is announced well before implementation, the “pre” period may already reflect treatment effects.
Stable composition. The groups represent the same underlying units over time, rather than a changing mix. In the COVID income example, units are defined at the labor market level, so the estimate reflects changes in place-level income distributions, not individual income trajectories.
When it breaks
Below are the main ways DiD stops being credible, each shown in a Too Early To Say case.
Weak power in pre-trends tests
A high p-value in a parallel-trends test does not rule out meaningful violations. In the SNAP sensitivity case, pre-treatment event-study coefficients are statistically indistinguishable from zero, yet the main effect remains significant only under small deviations from parallel trends.
Rambachan-Roth sensitivity bounds quantify this fragility. They scale allowable post-treatment violations relative to the largest pre-treatment deviation and report the breakdown point where the effect loses significance. In this case, the breakdown value is M = 0.35, meaning the effect stays significant only if post-treatment violations are no more than 35 percent as large as the largest pre-treatment deviation. The breakdown value carries more information than the p-value alone.
A second case shows the same issue from the failing side. In the SNAP BBCE replication, a two-way fixed effects model on a 51-state panel estimates a +1.37 percentage point effect on take-up, a 3.35 percent increase over a 0.410 baseline, with p = 0.18. The event study fails the parallel-trends check on one pre-treatment lead. Rather than discarding the design, the analysis asks what remains once that violation is taken seriously.
Staggered adoption and small treated samples
When treatment timing varies across units, standard two-way fixed effects can produce biased estimates by comparing already-treated units to newly treated ones. When the treated group is very small, cluster-robust standard errors understate uncertainty.
In the rolling DiD case, both issues appear. A rolling estimator (lwdid) recovers a reasonable effect with one treated unit and three controls, while a standard specification overstates the effect by a factor of two. Unit-specific detrending brings the estimate back in line.
With so few clusters, inference shifts away from asymptotic cluster-robust methods toward exact t-tests, HC3 corrections, and randomization inference. The key step is matching the estimator to the treatment structure before interpreting any coefficient.
Correct design, incorrect implementation
A valid design can still produce incorrect results through implementation errors. The Python implementation case walks through a statsmodels event-study workflow and highlights three common pitfalls:
- A post-treatment indicator that is collinear with year fixed effects and dropped from the model without any error.
- Standard errors clustered at the wrong level.
- Interaction terms built using the wrong reference period.
A simple safeguard is to test the code on simulated data with a known effect and confirm that the implementation recovers it before using real data.
A decision table
One rule per situation. Read across: the case in front of us, the tool to use, and the check that determines whether the result is credible.
| If the situation is | Use | And check |
|---|---|---|
| One treated group, one comparison group, and a known treatment date | Canonical two-way fixed-effects DiD | That pre-treatment trends run in parallel |
| The parallel-trends test passes and the stakes are high | DiD with Rambachan-Roth sensitivity bounds | The breakdown value M, not the p-value alone (SNAP case: M = 0.35) |
| The pre-trends test fails on a lead | A re-examination of the design rather than discarding it | Whether the violation is isolated or systematic (SNAP BBCE: +1.37 percentage points remains after one failing lead) |
| Units adopt treatment at different times | A heterogeneity-robust estimator (Callaway-Sant’Anna or rolling lwdid) | The Goodman-Bacon decomposition for forbidden comparisons |
| The treated set includes only a few units | Rolling DiD with small-sample inference | Exact t-tests, HC3, or randomization inference, not cluster-robust asymptotics |
| Moving from a textbook formula to implemented code | A statsmodels or linearmodels workflow | That a planted-effect simulation is recovered before applying the code to real data |
The discipline
Across all four cases, the same standard applies: state the assumption as a decision, validate the implementation with simulation, report the breakdown value alongside the estimate, and match the estimator to the treatment rollout. Each case links to open code and data, so the underlying decisions can be rerun and challenged.
- The parallel-trends sensitivity case shows when a passing pre-trends test still leaves a fragile result, using Rambachan-Roth breakdown bounds.
- The SNAP BBCE TWFE-DiD replication shows what remains when the parallel-trends test fails on one lead.
- The rolling DiD (lwdid) case shows how to estimate effects when treatment reaches only a few units or adoption is staggered.
- The difference-in-differences in Python walkthrough shows the statsmodels implementation and the diagnostics that separate signal from noise.
Cite this article
Cholette, V. (2026, July 2). using difference-in-differences in practice. Too Early To Say. https://tooearlytosay.com/research/methodology/when-to-use-did/