AI Econometrics: Using AI for Code, Not for Identification

Victoria Cholette

June 2026

AI Econometrics: Using AI for Code, Not for Identification

AI econometrics is the use of an AI assistant as an instrument inside econometric work: the assistant drafts code and runs specifications, while we set the estimand, argue the identification, and verify the code. It is not the model doing the econometrics, and it is not automated causal inference. Two worked instances show what the division of labor looks like in practice.

The term is still settling. “AI econometrics,” “AI for econometrics,” and “LLM econometrics” all name roughly the same thing, and which one wins is not decided. We use “AI econometrics” here for the workflow itself, and treat the other two as labels for the same practice. The practice matters more than the label, so it is worth defining the practice precisely before the label fixes a looser meaning in place.

The definition we work from:

AI econometrics is the use of an AI assistant as an instrument inside econometric work. The assistant drafts code, reimplements an estimator from its description, and runs specifications on command. We make the judgments that decide whether an estimate means anything: setting the estimand, arguing the identification, and verifying that the code computes what the design claims. The assistant lowers the cost of a first draft. It does not change what makes a result credible.

That block is the whole claim. Everything below is what it rules out, what it looks like when applied, and how the labor divides.

What AI econometrics is not

Two readings of the term need to be set aside before the definition can do any work.

It is not the model doing the econometrics. The assistant produces code; whether that code matches the estimator we meant to use is a separate question the assistant cannot answer about itself. A clean run shows that the code executed, not that the implementation is correct. The judgment about correctness stays with the analyst.

It is not automated causal inference. Identification is an argument about the world: which units are valid controls, what assumption lets a comparison stand in for the missing counterfactual, when that assumption fails. That argument is made by the economist from knowledge of the setting. No assistant supplies it, and a coefficient returned without it is a number, not a causal effect.

These two exclusions are the same point from two directions. The mechanical labor can be delegated. The claim that an estimate is identified and correctly computed cannot.

What it looks like in practice: two worked instances

The definition is easier to hold against concrete work than in the abstract. Two pieces on this site are instances of it, and each carries one number that shows the division of labor at the point where it matters.

Instance	What the assistant produced	The number	The judgment the economist made
Reimplementing a rolling DiD estimator	Runnable Python from the method’s description	2.06 under demeaning vs 1.13 under detrending, against a planted true effect of 1.0	Which transformation matched the parallel-paths assumption: an identification call
Verifying assistant-written estimator code	A reimplementation that ran with no error or warning	0.18 returned where the truth was 1.0 and the correct estimate 1.13	Planting a known truth to expose an input-ordering bug: a verification call

The first is a reimplementation: a rolling difference-in-differences (DiD) estimator rebuilt in Python from its description, the kind of small-sample method built for one treated unit and a few controls.¹ An assistant can draft that code. What the assistant cannot decide is the transformation choice, and in a hard case with diverging trends the choice decided the answer: demeaning reported an effect of 2.06 against a planted true effect of 1.0, while unit-specific detrending recovered 1.13. The spread from 2.06 to 1.13 is not a coding question. It is an identification question the economist answers: whether treated and control units were on parallel paths.

The second is verification of assistant-written estimator code.² Here an assistant-written reimplementation of the same estimator returned 0.18 where the planted true effect was 1.0 and the correct estimate was 1.13. An analyst could have summarized 0.18 in a memo as “no meaningful effect.” It was wrong by construction: an input-ordering error meant the estimator ran correctly on the wrong simulated panel, so nothing crashed and no warning appeared. The defect was caught by planting a known truth in a simulation and checking whether the implementation recovered it, the same Monte Carlo logic economists already use. The assistant wrote the code. The check that exposed the failure was the economist’s.

Both instances make the case in numbers. The assistant produced runnable code in each. The number that separated a correct result from a misleading one came from a judgment the assistant did not make.

The division of labor

State the split plainly, because it is the operational content of the definition.

The assistant drafts code from a specification, reimplements an estimator from its description, and runs specifications across the variations we ask for. These are mechanical: search, translation, first-pass implementation, repeated execution. Delegating them is the source of whatever time the workflow saves.

We do three things the assistant cannot. Before any code runs, we fix the estimand: which quantity to estimate, and how to aggregate it. We argue the identification from the setting: the assumption that lets the comparison recover a causal effect. And we verify that the code computes that estimand under that identification, running a planted-truth simulation and reading the source line by line.

The cost structure is what makes this a stable division rather than a temporary one. An assistant lowers the cost of producing a plausible first draft. It does not lower the cost of verification, and by making omissions easier to miss, it raises it. The work therefore shifts toward the part the economist keeps. A faster draft is worth little if the verification it now requires is skipped, because the unchecked code carries the same appearance of precision with less assurance behind it.

Where it goes

The practice is young enough that its hard questions are still open, and naming them honestly is more useful than predicting how they resolve.

How much of verification can itself be delegated without circularity. An assistant can draft a simulation harness, but if the same kind of process that wrote the estimator also writes its test, a shared blind region in the test data can let the same class of error pass twice. Where the independent check has to come from outside the assistant, and where it can be assisted, is unsettled.

What a usable specification looks like for an estimator. A loose instruction leaves the low-visibility choices, weighting, comparison set, transformation window, indexing, to a default the analyst never chose. A precise spec is necessary and still insufficient: in one case we named cohort-size weighting in the spec and the implementation dropped it anyway. The form a spec should take so that verification can check it against the code is an open methodological question, not a solved one.

When a reimplementation is worth its verification cost at all. Rebuilding an estimator outside its maintained package adds the full verification burden, justified only when the workflow needs the open language, a reproducible pipeline, or a specification no package supplies. The assistant moves that calculation in one direction, cheaper drafts, and leaves the verification side where it was.

The label will settle on its own. The division of labor under it is the part worth getting right: the assistant drafts the code and runs the specifications, and the economist keeps the estimand, the identification, and the check that the implementation matches the design we claim to be using.

Notes

Cholette, V. (2026, June 13). When a policy reaches only a few units: rolling difference-in-differences (lwdid). Too Early To Say. https://tooearlytosay.com/research/methodology/lwdid-rolling-difference-in-differences/ The 2.06 (demean) versus 1.13 (detrend) contrast against a planted true effect of 1.0 is from the hard-case simulation reported there.
Cholette, V. (2026, June 15). How do we know an AI's estimator does what we meant? Too Early To Say. https://tooearlytosay.com/research/methodology/validate-ai-econometric-code/ Against a planted true effect of 1.0, the corrected implementation returns 1.13 and the input-ordering-error version returns 0.18.

Cite this article

Cholette, V. (2026, June 21). AI econometrics: Using AI for code, not for identification. Too Early To Say. https://tooearlytosay.com/research/methodology/ai-econometrics/

Share

[ref-1] Cholette, V. (2026, June 13). When a policy reaches only a few units: rolling difference-in-differences (lwdid). Too Early To Say. https://tooearlytosay.com/research/methodology/lwdid-rolling-difference-in-differences/ The 2.06 (demean) versus 1.13 (detrend) contrast against a planted true effect of 1.0 is from the hard-case simulation reported there.

[ref-2] Cholette, V. (2026, June 15). How do we know an AI's estimator does what we meant? Too Early To Say. https://tooearlytosay.com/research/methodology/validate-ai-econometric-code/ Against a planted true effect of 1.0, the corrected implementation returns 1.13 and the input-ordering-error version returns 0.18.