Rolling difference-in-differences with few units: An article-reported simulation

Victoria Cholette

June 2026

Rolling DiD with Few Units: Article-Reported Simulation

An instructional guide to transformation choice and small-sample inference. The exact simulation estimates, standard errors, p-values, and Python rebuild described below are article-reported and not publicly reproduced.

Difference-in-differences (DiD) is the workhorse of policy evaluation. One place changes a policy, similar places do not, and we compare how some outcome moves in the treated place against how it moves in the others. The untreated places stand in for what would have happened to the treated place without the policy. When their trends would have stayed parallel, the comparison gives us a clean read on what the policy did.

Real policy settings strain that logic in two ways at once. The places adopt at different times, so there is no single clean before-and-after. And there are very few of them: often one treated state and a handful of comparison states. With only a handful, the two checks we lean on to defend a DiD each fail in a specific way. The first, a test that the groups were moving together before the policy, has little power when the panel is short, so it rarely catches a real violation. The second, the standard errors behind the confidence interval, come from large-sample approximations that assume many independent groups; with three or four, the formulas understate the true uncertainty and the interval comes out too narrow. The reported estimate looks more precise than the data can support.

A recent method takes direct aim at that corner. The rolling difference-in-differences estimator of Lee and Wooldridge, packaged as the Stata command lwdid, handles staggered adoption and carries an explicit small-sample path, with inference designed for a handful of units rather than a crowd.¹ This piece is about how to put it to work: what it estimates, the choices it asks of us, how to read what it returns, and how to check our own answer on a real few-unit policy problem.

For readers who work in Python, the article describes a small-sample procedure rebuilt from the method description in a few hundred lines. That rebuild is not publicly available, so this page cannot establish that it matches the maintained Stata command or the published estimator. The teaching goal is narrower: explain how a planted-truth simulation can expose sensitivity to transformation and inference choices before a method is applied where the true effect is unknown. The estimator’s formal properties come from the underlying papers, not from the unreleased Python rebuild.

Rolling difference-in-differences, in plain English

Start with what we are trying to estimate. For each group of units that adopts the policy at the same time, we want the average treatment effect on the treated (ATT): how that group’s outcome moved after adoption relative to units not yet treated, read group-by-group across time and then averaged into an event-time path. The design rests on two conditions: units do not anticipate treatment before adoption, and, absent the policy, treated and comparison units would have moved in parallel.

We can look at each unit’s pre-treatment periods to describe what it was doing before the policy: its average level, or its level plus a linear trend, or its level, trend, and seasonal swing. We subtract that description from the unit’s later outcomes, and what remains is the part the past did not predict. Then, in each post-treatment period, we compare that remainder for treated units against not-yet-treated units with a simple cross-sectional regression. The coefficient is the treatment effect for that period.

Two features make this worth the trouble in small-sample policy work. The detrending choice lets each unit keep its own linear trend, which relaxes the strict parallel-trends requirement that sinks so many few-unit designs. And the same subtraction, applied to the pre-treatment periods, hands us placebo tests for free: if the method is sound, the estimated effect before the policy should sit near zero.

The method needs four columns: an identifier for each unit (here, each state), a time index (the quarter or the year), a marker for the period when each unit adopted the policy, with a zero for units that never did, and the outcome we care about. What it does require is enough pre-treatment periods, because we can only subtract a pattern we can estimate from the pre-period. A mean asks for almost nothing; a linear trend needs at least two pre-treatment periods; a quarterly seasonal pattern needs enough quarters to estimate its four seasonal terms.

An Article-Reported Python Rebuild

Before applying the method to a real policy question, it helps to run an implementation where we already know the answer, and data for that comes in two kinds. We can simulate a panel ourselves: choose the number of treated and control units, plant a treatment effect we know, and build in the trend differences and seasonality we want to test. Or we can use a real policy panel with staggered timing and a small control pool. The difference between them is the answer key. Simulated data comes with the true effect we set, so we can score each estimate against it; real data does not, so it shows whether the method behaves on messy inputs but cannot tell us whether an estimate is correct. We start with the simulated panel for that reason, dialing the trend gap and the seasonal swing up and down to watch the estimator respond. Recovering a planted effect is a necessary implementation check, not proof that the rebuild is correct or that an estimate from real data is identified.

The article reports four lessons from simulations with the true effect set to 1.0. The exact outputs are not publicly reproduced.

First, the article reports that transformation choice barely changes the estimate in an easy simulated case with 6 treated units, 6 controls, parallel trends, and no seasonality. It reports ATT estimates of 0.958 for demean, 0.859 for detrend, and 0.823 for detrend-plus-seasonality, with standard errors of 0.089, 0.136, and 0.169. It also reports exact-t and HC3 p-values below 0.001 and randomization-inference p-values around 0.002. Without a public run package, these values illustrate the claimed pattern but do not establish a reproduced benchmark.

Second, the article reports strong transformation sensitivity in a hard simulated case with 3 treated units, 3 controls, diverging trends, and quarterly seasonality. The reported ATT is 2.061 under demeaning, 1.127 under detrending, and 1.525 under detrending plus seasonality, against a planted effect of 1.0. The article attributes the spread to differential trends and limited pre-treatment information. That interpretation is plausible within the stated design, but neither the simulation nor the implementation is public enough to verify the exact values or diagnosis.

Third, the spread across transformations is a sensitivity diagnostic. Agreement shows that these transformations produce similar estimates in the stated simulation; it cannot rule out a violation shared by all of them. Divergence shows that the answer depends on the transformation, but additional diagnostics are needed to identify why.

Fourth, with few units, the inference mode decides how much we can claim. The article reports that both hard-case estimates use the same six units, so the exact-t uses four degrees of freedom (N minus two) either way; the contrast concerns inference, not sample size. It reports demean exact-t p = 0.001 and RI p = 0.041, then detrend exact-t p = 0.108 and RI p = 0.179. It also reports detrended pre-treatment placebo estimates of -0.014, -0.001, and -0.020 and a later upward event-time drift. These values are article-reported and not independently reproduced. Near-zero placebos would be consistent with the stated pre-treatment fit, but they would not by themselves validate the design or the rebuild.

The table below records the article's reported simulation design and outputs. It is a documentation summary, not a reproduced result table.

What each part of the analysis checks

Article-reported simulation design and results, not publicly reproduced
Part of the analysis	How we set it	What it checks	In plain terms
Panel structure	6 treated + 6 control (easy); 3 treated + 3 control (hard); 16 quarterly periods	Whether the estimator behaves with realistically small unit counts	Real small-N policy studies have only a few places; this is the setting the method is meant for
Treatment timing	One common adoption period (easy and hard); staggered cohorts adopting at three different times (event study)	Single-period logic and the cohort-by-time-to-event aggregation	Some policies start everywhere at once, others phase in across years; we test both
Known effect	True effect set to 1.0 in every design	Grading: every estimate is scored against a fixed, known answer	A planted answer can test an implementation once the code, data, and run record are available
Diverging trends	Treated places trending faster than controls (hard and staggered); same trend (easy)	Whether the method recovers the truth when parallel trends fail; isolates the demeaning bias	Treated and comparison places can be on different paths even without the policy, the case that breaks plain DiD
Seasonality	A fixed quarter-of-year pattern switched on (hard) or off (easy and staggered)	Whether seasonal swings contaminate estimates the seasonal adjustment should remove	Outcomes that rise and fall by quarter can fool the estimate unless removed first
Transformations	demean, detrend, detrendq, each fit on pre-treatment periods only	Sensitivity: the spread across the three is the diagnostic (easy 0.96/0.86/0.82; hard 2.06/1.13/1.52)	The three ways to strip out a unit’s past; if they disagree, the answer depends on a modeling choice
Inference modes	exact-t, HC3-robust, and randomization inference (2,000 permutations)	Small-sample sensitivity: hard-case demean exact-t p=0.001 vs randomization p=0.041; detrend randomization p=0.179	Three inference procedures with different assumptions. A permutation test is valid only when the proposed treatment assignments are defensibly exchangeable.
Pre-policy placebo periods	Effect estimated for periods before adoption, under detrend (about -0.01 to -0.02)	Design check: estimates near zero before the policy are consistent with the fitted pre-policy trend	Near-zero pre-policy estimates are one diagnostic, not proof that the design or rebuild is valid
Easy vs hard contrast	The same estimator on a parallel-trends panel and on a diverging-trends-plus-seasonal panel	Where the method is stable (all transforms near 1.0) versus where it breaks (demean 2.06 vs detrend 1.13)	An easy-versus-hard comparison can reveal where assumptions need closer scrutiny; the exact outputs still require reproduction

Arizona’s Medicaid freeze: small-N DiD in practice

The Arizona Medicaid enrollment freeze is the kind of setting small-sample DiD was built for. On July 8, 2011, Arizona froze new and re-enrollment in Medicaid for childless adults below 100% of the federal poverty level. The policy question is direct: when a state shuts a coverage door, who shows up at the hospital uninsured? The rolling-DiD paper measures two insurance-composition outcomes among adults aged 50 to 64 admitted with non-deferrable diagnoses, such as heart attacks and strokes: the share of hospital discharges whose primary payer is Medicaid, and the share recorded as self-pay.³ A rising self-pay share means more uninsured hospitalizations, which is consequential for both patients and the hospitals absorbing the cost.

The data come from the State Inpatient Databases (SID) within the Healthcare Cost and Utilization Project (HCUP), spanning the first quarter of 2008 through the fourth quarter of 2013. Arizona is the treated state. Arkansas, Maryland, and New Jersey serve as controls, none having contracted Medicaid comparably in the window. Discharge microdata are aggregated to the state-by-quarter level, yielding a balanced panel of four units, one treated and three control, with a common treatment date in the third quarter of 2011 and a quarterly frequency.

Four features make this hard, and each maps onto something the lwdid implementation of the rolling estimator is built to handle. First, four units is a small-sample problem. Cluster-robust standard errors and large-sample asymptotics are unreliable when the cluster count is three controls plus one treated state; this is why exact-t, HC3, and randomization inference enter the picture rather than conventional standard errors. Second, hospital discharge composition carries quarter-of-year seasonality, which the deseasonalizing options (demeanq, detrendq) are meant to strip out. Third, a standard two-way fixed-effects (TWFE) DiD event study shows differential pre-trends: Arizona’s Medicaid share was already rising relative to controls before the freeze, reflecting an earlier waiver-based expansion. That pre-existing drift biases the unadjusted DiD toward zero, and the unit-specific detrending (detrend, detrendq) is designed to remove exactly this kind of trend. Fourth, the discharge records are repeated cross-sections, yet the rolling logic still applies once outcomes are formed at the state-quarter level.

The control pool stays thin for a real reason. Few states ran a comparable Medicaid contraction in the same window, so we cannot pad the comparison group with 40 more states that happen to resemble Arizona. The single common treatment date here is also a simplification of the broader policy world; many real Medicaid changes arrive staggered across states and years, which is the case the rolling estimator is more generally written to address.¹ Arizona is the clean common-timing instance; the staggered version is where the method’s machinery is meant to scale.

Applying it to a few-unit panel: a checklist

Applying the method is not just running it once. With four units and adjustment options that each move the estimate, the answer we report depends on choices we make, so applying it well means running it several ways and seeing whether the answer holds. The checklist below is the minimum we would run on our own panel before reporting a result from it.

Pre-trust checklist: rolling DiD on a small-N policy panel

Detrending choice. Re-run with demean and with detrend, and report both. If it shifts as strongly as the article reports in its hard case, from 2.06 under demeaning to 1.13 under detrending against a planted effect of 1.0, the trend assumption may be driving the answer. Those exact values remain unreproduced.

Seasonality adjustment. Compare raw outcomes against demeanq and detrendq deseasonalizing. Quarter-of-year composition swings should not be driving the effect.

Placebo adoption dates. Assign a fake treatment date in the pre-period and re-estimate. Pre-period cells should stay near zero; a large placebo effect signals a confound.

Leave-one-control-out. Drop each control state one at a time and re-estimate. With only three controls, check whether removing any single state flips the sign or the significance.

Event-time pattern. Plot the effect by event time. Look for a clean jump at adoption versus a pre-existing drift that started before the freeze.

Cross-method comparison. Re-estimate with at least one familiar staggered DiD approach (Callaway-Sant’Anna csdid, or plain TWFE), then explain any difference in the comparison groups, weighting, or identifying assumptions.

Each line is a question with a pass condition built in. A reader can lift the box, swap in a different panel, and run it without the surrounding text.

Applying it well

Applying the method comes down to two choices: one sets the estimate, the other sets how much certainty we can attach to it. The first is the transformation. Demean assumes treated and control places were on parallel paths; detrend lets each place keep its own trend. The article's unreproduced hard-case simulation reports a gap from 2.06 to 1.13 against a planted effect of 1.0. In an actual analysis, we make the transformation choice deliberately and report what the alternative would have produced. The second choice is the inference mode. Exact-t, HC3, and randomization inference answer different questions under different assumptions. Randomization inference is appropriate only when the treatment-assignment or permutation scheme is defensible for the observational policy setting; small N by itself does not make it valid.

The rest is discipline, and it is the discipline in the checklist. We run the transformations side by side and report the spread; we read the pre-policy placebos as one diagnostic rather than proof that the design holds; we drop each control in turn to see whether one state is carrying the result. We also watch the event-time path, because detrending extrapolates each unit’s trend forward and that extrapolation drifts at long horizons, so the latest post-policy estimates deserve the least weight.

The Arizona freeze is one example of such a question, a single state changing Medicaid rules against three controls; the same steps apply to any policy that reaches only a few places at a time.

Public-material status

No public reproduction package currently supports the reported Python simulation results. An independent check would require the rebuild source, simulation-generating process, seeds, dependency versions, run commands, and saved estimates for every transformation and inference mode. Until those materials are released, this page should be used as an instructional guide to the method and its diagnostics, not as evidence that the exact reported outputs were verified.

Notes

Lee, S. J., Wooldridge, J. M., & Hur, E. K. (2026, June 10). Rolling difference-in-differences estimation for small and large panels (SSRN No. 6502558). https://doi.org/10.2139/ssrn.6502558. Introduces the lwdid Stata command, available via ssc install lwdid.
The rolling transformation and its small-sample inference procedures are developed in Lee, S. J., & Wooldridge, J. M., “A Simple Transformation Approach to Difference-in-Differences Estimation for Panel Data,” and a companion paper on inference with small cross-sectional sample sizes; cited as Lee and Wooldridge (2026a, 2026b) in [1].
The Arizona Medicaid enrollment-freeze application is drawn from Hur (2026), as presented in [1]; data are from the Healthcare Cost and Utilization Project (HCUP) State Inpatient Databases.

Cite this article

Cholette, V. (2026, June 13). Rolling DiD with few units: Article-reported simulation. Too Early To Say. https://tooearlytosay.com/research/methodology/lwdid-rolling-difference-in-differences/

Share

[ref-1] Lee, S. J., Wooldridge, J. M., & Hur, E. K. (2026, June 10). Rolling difference-in-differences estimation for small and large panels (SSRN No. 6502558). https://doi.org/10.2139/ssrn.6502558. Introduces the lwdid Stata command, available via ssc install lwdid.

[ref-2] The rolling transformation and its small-sample inference procedures are developed in Lee, S. J., & Wooldridge, J. M., “A Simple Transformation Approach to Difference-in-Differences Estimation for Panel Data,” and a companion paper on inference with small cross-sectional sample sizes; cited as Lee and Wooldridge (2026a, 2026b) in [1].

[ref-3] The Arizona Medicaid enrollment-freeze application is drawn from Hur (2026), as presented in [1]; data are from the Healthcare Cost and Utilization Project (HCUP) State Inpatient Databases.