When we rebuild an estimator from a paper or its original package in Python, the code can run without error, return a plausible coefficient, and still be wrong. That is true whether the first draft came from us, a colleague, or an AI assistant. A clean run shows only that the code executed, not that the implementation matches the estimator we meant to use.
That distinction matters more now because AI assistants make a working reimplementation fast to generate. They hand back a few hundred lines that import cleanly and return numbers, then fail on the part the instruction left underspecified. They do not change what verification requires: checking whether the estimand, weighting, sample construction, and indexing match the source definition.
In one rolling difference-in-differences (DiD) exercise, an assistant-written reimplementation returned 0.18 where the planted true effect was 1.0 and the correct estimate was 1.13.2 That is the kind of coefficient an analyst could summarize in a memo as “no meaningful effect.” It was also wrong by construction. The failure had nothing to do with whether the code ran and everything to do with whether the implementation matched the estimator.
The companion piece is about the rolling DiD method itself and the small-N policy settings, those with one treated unit and a few controls, that it is built for.1 This piece is about the verification routine that makes a reimplementation trustworthy.
Name the failure points in the spec
A hand-built implementation usually gets the headline regression right. The mistakes sit in the lower-visibility decisions around it, and that is what the spec is for.
In an estimator like this one, the failure points are predictable:
- the estimand: which exact quantity is estimated, and how it is aggregated;
- the comparison set: which units count as controls in each cohort-time comparison;
- the weighting: how cohort-time effects are combined into an event-time path;
- the transformation window: which periods are used to remove levels, trends, or seasonality;
- the indexing: how event time is aligned to each cohort’s treatment date.
A loose instruction such as “aggregate cohort-time effects to event time” leaves room for error; a usable spec says exactly how. In the rolling DiD paper, the event-time aggregation is a cohort-size-weighted average across cohort-time cells that share a common relative time r = t - g.1
A precise spec is necessary but insufficient on its own: we named cohort-size weighting in it, and the implementation dropped the weight anyway. The spec sets the target; verification checks whether the code reached it.
Check against a known answer
The cleanest verification is one economists already use in Monte Carlo work. Specify a data-generating process (DGP) we control, choose a true value for the estimand, simulate data from that process, and check whether the implementation recovers the planted value across many draws.
Setting the true effect to 1.0 makes the result a direct check rather than a judgment about whether a coefficient looks plausible. Against that known 1.0, the correct implementation recovered 1.13 and the broken one returned 0.18.2 The broken result was wrong by construction, and a clean run did not reveal it.
The defect was an input-ordering error: the estimator ran correctly on the wrong simulated panel, so nothing crashed and no warning appeared. Read on its own, 0.18 looks like a plausible null. Read against the planted truth, it is a failed implementation.
The logic extends past this estimator. Wherever we can write down a DGP and plant a true value for the target parameter, we can test whether the implementation recovers it. A clean run that misses the planted value is a failed run; a clean run that hits it is the start of trust, not the end of it.
Where the simulation check is not enough
A known-truth simulation catches only the errors that change the estimate on the test data we happened to build. Build that data too symmetric, and a class of mistakes passes unnoticed.
Take the dropped weight. When every cohort holds the same number of treated units, a cohort-size-weighted mean and a plain mean are identical, so the unweighted code passes the planted-truth check exactly. In the saved output, that is what happened: with equal cohorts the weighted and unweighted event-time paths matched at every event time. Rebuilt with unequal cohorts, the two diverged, by as much as 0.14 at the longest horizon.2
That points to a two-part fix. First, build simulation data that break the convenient symmetries: unequal cohort sizes, unbalanced timing, missing cells. Second, read the implementation against its source on every weight, aggregation rule, and index. Some errors appear only in the code review, not in the simulation output.
That was the case here. A line-by-line comparison against the paper exposed the missing weight at once, because the paper weights and the implementation averaged. A separate defect was only a false label: a run marked “4 cohorts” was built on a three-cohort panel. It changed no estimate, and it was wrong on every figure that carried the label.
The routine
Stripped of the rolling DiD example, the routine is short enough to reuse:
- Write down the estimand from the paper or canonical implementation before reading the generated code.
- In the spec, name the low-visibility choices implementations get wrong: comparison set, weighting, transformation window, indexing.
- Build a controlled DGP and plant the true value of the parameter.
- Run the implementation across many draws and require it to recover that value.
- Break the easy symmetries in the test data so weighting and aggregation mistakes become visible.
- Read the implementation against the source on each weight, aggregation step, index, and any label that asserts a fact about the data.
The first two steps define the target; the rest check whether the implementation reaches it. The verification standard comes from the estimator and the source definition, not from the fact that the code ran.
When a reimplementation is worth the cost
This routine is the cost of rebuilding an estimator outside its original software, and it is worth paying only when the reimplementation gives us something the canonical package cannot. A reimplementation is worth the cost when:
- the analysis has to run in an open or Python-based workflow;
- the estimator needs to sit inside a larger reproducible pipeline;
- the specification we need does not exist in a maintained package and would have to be built anyway.
When a maintained package already covers the specification, the result is one-off, and nothing downstream needs the new language, the package is the better choice. A reimplementation adds the full verification burden; left unpaid, the new code is worse than the original, carrying the same appearance of precision with less assurance behind it.
AI assistants move this calculation in one direction only. They lower the cost of producing a first draft. They do not lower the cost of verification, and they raise it by making omissions easier to miss.
Closing
The question is whether the implementation matches the estimator and the design we claim to be using, not whether the code runs. That has always been true in econometrics; AI assistants make it more immediate, because they cut the cost of a plausible first draft without cutting the cost of verification.
The routine is portable: write down the estimand, name the low-visibility choices, plant a known truth in a DGP, break the convenient symmetries, and read the implementation against the source. It applies to any estimator, in any language, to code written by us, by colleagues, or by an assistant. The tool drafts the code; the standard it has to meet comes from the estimator, and meeting that standard is our work.
Notes
- Cholette, V. (2026, June 13). When a policy reaches only a few units: rolling difference-in-differences (lwdid). Too Early To Say. https://tooearlytosay.com/research/methodology/lwdid-rolling-difference-in-differences/
-
Numbers are from the saved reproduction output. Against a planted true
effect of 1.0, the corrected code returns a detrend estimate of 1.13 and
the input-ordering-error version returns 0.18
(
results_hard_buggy.csvvsresults_hard_correct.csv). For the weighting check (weighting_blindspot.csv): on the equal-cohort panel the weighted and unweighted event-time paths are identical (maximum absolute difference 0.000); on the unequal-cohort panel they diverge, with a maximum absolute difference of 0.139 at event time r = 6 (weighted 0.54, unweighted 0.68).
Cite this article
Cholette, V. (2026, June 17). how do we know an AI's estimator does what we meant? Too Early To Say. https://tooearlytosay.com/research/methodology/validate-ai-econometric-code/