How do we know an AI's estimator does what we meant?

Victoria Cholette

June 2026

How do we know an AI's estimator does what we meant?

Part of the AI for Applied Researchers series · Step 4: Quality assurance

When we rebuild an estimator from a paper or its original package in Python, the code can run without error, return a plausible coefficient, and still be wrong. Four moves catch it: name the low-visibility choices in a spec, plant a known truth in a simulation, break the easy symmetries, and read the code against the source.

A reimplementation can clear every check we usually trust and still be wrong: it imports, it runs, it returns a coefficient in a plausible range. That is true whether the first draft came from us, a colleague, or an AI assistant. A clean run shows only that the code executed, not that the implementation matches the estimator we meant to use.

That distinction matters more now because AI assistants make a working reimplementation fast to generate. They hand back a few hundred lines that import cleanly and return numbers, then fail on the part the instruction left underspecified. They do not change what verification requires: checking whether the estimand, weighting, sample construction, and indexing match the source definition.

In one rolling difference-in-differences (DiD) exercise, a reimplementation that used the wrong within-unit transformation returned an effect of 2.03 where the planted true effect was 1.0, nearly double the truth.² The code ran without error and the coefficient looked reportable. A planted-truth simulation is what exposed it: against a process with a known effect of 1.0, the wrong transformation recovered about 2.03 on average while the correct one recovered 1.0. The failure had nothing to do with whether the code ran and everything to do with whether the implementation matched the estimator the design required.

The companion piece is about the rolling DiD method itself and the small-N policy settings, those with one treated unit and a few controls, that it is built for.¹ This piece is about the verification routine that makes a reimplementation trustworthy.

Name the failure points in the spec

A hand-built implementation usually gets the headline regression right. The mistakes concentrate in the lower-visibility decisions around it, and that is what the spec is for.

In an estimator like this one, the failure points are predictable:

the estimand: which exact quantity is estimated, and how it is aggregated;
the comparison set: which units count as controls in each cohort-time comparison;
the weighting: how cohort-time effects are combined into an event-time path;
the transformation window: which periods are used to remove levels, trends, or seasonality;
the indexing: how event time is aligned to each cohort’s treatment date.

A loose instruction such as “aggregate cohort-time effects to event time” leaves room for error; a usable spec says exactly how. In the rolling DiD paper, the event-time aggregation is a cohort-size-weighted average across cohort-time cells that share a common relative time r = t - g.¹

A precise spec is necessary but insufficient on its own: we named cohort-size weighting in it, and the implementation dropped the weight anyway.

Check against a known answer

The cleanest verification is one economists already use in Monte Carlo work. Specify a data-generating process (DGP) we control, choose a true value for the estimand, simulate data from that process, and check whether the implementation recovers the planted value across many draws.

Setting the true effect to 1.0 makes the result a direct check rather than a judgment about whether a coefficient looks plausible. The check fits in one small, reusable helper: plant a known effect, run the estimator across many simulated draws, and compare the average against the truth. It is short enough to lift into any project.

import numpy as np

def verify_estimator(estimator, simulate_dgp, true_effect, tol, reps=1000, seed=0):
    """Plant a known effect, run the estimator across many draws, and check
    whether the mean recovers the truth. Catches biasing bugs; it does NOT
    validate identification, and does not detect bugs that do not bias the
    estimate (see below)."""
    est = np.array([estimator(simulate_dgp(true_effect, np.random.default_rng(seed + i)))
                    for i in range(reps)])
    bias = est.mean() - true_effect
    return dict(mean=est.mean(), bias=bias, passed=abs(bias) <= tol)

# the two rolling-DiD transformations, planted true effect = 1.0
negative_control = verify_estimator(demean_estimator, simulate_hard, 1.0, tol=0.10)
candidate = verify_estimator(detrend_estimator, simulate_hard, 1.0, tol=0.10)

# Unexpected statuses stop the process with a nonzero exit.
assert not negative_control["passed"], "negative control unexpectedly passed"
assert candidate["passed"], f"candidate failed planted-truth gate: {candidate}"

Run against a hard design with diverging unit trends, the two transformations part ways. The demean transformation, which removes each unit's level but not its trend, recovered about 2.03 against the planted 1.0, biased upward by roughly four sampling standard deviations. The correct unit-specific detrend recovered 1.0.²

Transformation	Recovered (planted effect = 1.0)	What the check showed
Demean (wrong for diverging trends)	2.03	biased about 4 SDs high; fails the check
Unit-specific detrend (correct)	1.00	recovers the planted effect; passes

The logic extends past this estimator. Wherever we can write down a DGP and plant a true value for the target parameter, we can test whether the implementation recovers it. A clean run that misses the planted value is a failed run; a clean run that hits it is the start of trust, not the end of it.

Where the simulation check is not enough

A known-truth simulation cannot catch every error, and knowing where it stops is the real skill. It catches only the ones that bias the estimate on the process we chose to simulate, which leaves three kinds of error it cannot detect, worth naming.

The first is the process itself. The demean check above fails only because the simulated design has diverging unit trends; on a design without them, demean recovers 1.0 and passes. The check is only as strong as the process we plant, and it does not catch a bug the chosen DGP never exercises.

The second is a bug that need not bias the estimate in expectation. In the same exercise, an input-ordering error changed shared random-number state before the simulated panel was drawn. On the reference seed, the faithful buggy pipeline returned 0.18 and the correct pipeline returned 1.13 against the planted 1.0. The public package reproduces that faithful bug once. For its 2,000-seed comparison, it uses a faster raw-draw proxy that shifts the random-number stream rather than rerunning the full randomization-inference pipeline. The proxy produces indistinguishable means, which supports the narrower point that a stream shift can change a draw without moving the expected estimate. Reading the source and asserting that the estimated panel is the generated panel exposes the actual execution-order defect.²

The third is symmetry. Build the test data too symmetric, and a class of mistakes passes unnoticed.

Take a dropped cohort-size weight. When every cohort holds the same number of treated units, a cohort-size-weighted mean and a plain mean are algebraically identical, so symmetric test data cannot expose the missing weight. Unequal cohort sizes are the needed negative control. The current pinned package does not include that unequal-cohort experiment or a saved weighting output, so this page treats it as a test-design rule rather than a reproduced numerical result.

That points to a two-part fix. First, build simulation data that break the convenient symmetries: unequal cohort sizes, unbalanced timing, missing cells. Second, read the implementation against its source on every weight, aggregation rule, and index. Some errors appear only in the code review, not in the simulation output.

The same review applies to labels and metadata. A cohort count, sample label, or figure annotation should be asserted against the object that generated it. The current pinned release checks estimator statuses and finite outputs; it does not include a public cohort-label test, so this page makes no reproduced claim for one.

The routine

Stripped of the rolling DiD example, the routine is short enough to reuse:

Write down the estimand from the paper or canonical implementation before reading the generated code.
In the spec, name the low-visibility choices implementations get wrong: comparison set, weighting, transformation window, indexing.
Build a controlled DGP and plant the true value of the parameter.
Run the implementation across many draws and require it to recover that value.
Break the easy symmetries in the test data so weighting and aggregation mistakes become visible.
Read the implementation against the source on each weight, aggregation step, index, and any label that asserts a fact about the data.

The first two steps define the target; the rest check whether the implementation reaches it. The verification standard comes from the estimator and the source definition, not from the fact that the code ran.

The reproduction is public and rerunnable. The generic verify_estimator helper and the worked check that produced the numbers above are at pinned public package. Clone it, plant an effect, and the check reports whether the estimator recovers it. The release assertions require the intentionally broken negative control to fail and the candidate implementation to pass; an unexpected status exits nonzero.

When a reimplementation is worth the cost

This routine is the cost of rebuilding an estimator outside its original software, and it is worth paying only when the reimplementation gives us something the canonical package cannot. A reimplementation is worth the cost when:

the analysis has to run in an open or Python-based workflow;
the estimator needs to sit inside a larger reproducible pipeline;
the specification we need does not exist in a maintained package and would have to be built anyway.

When a maintained package already covers the specification, the result is one-off, and nothing downstream needs the new language, the package is the better choice. A reimplementation adds the full verification burden; left unpaid, the new code is worse than the original, carrying the same appearance of precision with less assurance behind it.

AI assistants move this calculation in one direction only. They lower the cost of producing a first draft. They do not lower the cost of verification, and they raise it by making omissions easier to miss.

Closing

The question is whether the implementation matches the estimator and the design we claim to be using, not whether the code runs. That has always been true in econometrics; AI assistants only make it more urgent.

The routine travels: it applies to any estimator, in any language, to code written by us, by colleagues, or by an assistant. The tool drafts the code; meeting the estimator's standard is our work.

Notes

Cholette, V. (2026, June 13). When a policy reaches only a few units: rolling difference-in-differences (lwdid). Too Early To Say. https://tooearlytosay.com/research/methodology/lwdid-rolling-difference-in-differences/
Numbers are from seeded, rerunnable reproduction code (pinned public package, validate-ai-econometric-code). Against a planted true effect of 1.0, the demean transformation recovers a mean of 2.03 (bias +1.03, about four sampling standard deviations) and the correct detrend recovers 1.00; the correct estimator's nominal 95% confidence interval covers 0.956. The faithful input-ordering bug returns 0.18 on the reported seed while the correct pipeline returns 1.13. For speed, the package's 2,000-seed comparison uses a raw-draw proxy for the random-number stream consumed by randomization inference rather than rerunning the faithful pipeline. The proxy orderings have statistically indistinguishable means (Welch p = 0.62). The package does not contain a public unequal-cohort weighting experiment or cohort-label result.

Cite this article

Cholette, V. (2026, June 17). How do we know an AI's estimator does what we meant? Too Early To Say. https://tooearlytosay.com/research/methodology/validate-ai-econometric-code/

Share

[ref-1] Cholette, V. (2026, June 13). When a policy reaches only a few units: rolling difference-in-differences (lwdid). Too Early To Say. https://tooearlytosay.com/research/methodology/lwdid-rolling-difference-in-differences/

[ref-2] Numbers are from seeded, rerunnable reproduction code (pinned public package, validate-ai-econometric-code). Against a planted true effect of 1.0, the demean transformation recovers a mean of 2.03 (bias +1.03, about four sampling standard deviations) and the correct detrend recovers 1.00; the correct estimator's nominal 95% confidence interval covers 0.956. The faithful input-ordering bug returns 0.18 on the reported seed while the correct pipeline returns 1.13. For speed, the package's 2,000-seed comparison uses a raw-draw proxy for the random-number stream consumed by randomization inference rather than rerunning the faithful pipeline. The proxy orderings have statistically indistinguishable means (Welch p = 0.62). The package does not contain a public unequal-cohort weighting experiment or cohort-label result.