What Agents Actually Do (And What They Don't)

In a companion post, we argued that AI output quality tracks specification quality, not model capability. A specific version of this argument has been circulating in recent Substack discussions about AI for econometrics, and it involves agents: the kind that read project files, write code, and run analyses autonomously.

The Multi-Language Audit Fallacy

A proposal from a recent Substack series on agents for econometrics goes something like this: to verify an agent’s econometric code, run the same estimator in multiple languages. Estimate a regression in R, Stata, and Python. If the outputs diverge, something is wrong. The reasoning: if LLM errors are stochastic (random syntax mistakes that vary by language), they should be independent across implementations. Cross-language agreement means the code is correct.

This checks whether the code runs correctly. It does not check whether the code runs the right analysis. Software engineers call this a unit test: give a piece of code a known input and verify it produces the expected output. It is a good practice for catching syntax errors. If we hand did in R, csdid in Stata, and a Python implementation the same data, the same covariates, and the same option flags, correct code will produce the same numbers. If the numbers disagree, something is broken.

Where it stops helping is with errors that do not live in the syntax. An LLM might make up a function name in R that does not exist, and that mistake would not appear in Stata. Cross-language checking catches syntax errors like these. But when a prompt says “control for county characteristics” without listing which variables, the same ambiguous instruction goes to all three implementations. All three agents might select the same wrong variables, or different wrong variables. The source of the error is upstream of the language.

Consider a concrete case. A methods section says “include demographic controls.” Agent A in R selects population, median income, and percent nonwhite. Agent B in Stata selects population density, poverty rate, and unemployment. Agent C in Python selects a mix of both. All three pass syntax checks. All three produce different estimates. The multi-language audit sees disagreement and flags a “bug.” The real problem is that “demographic controls” was never defined. Three correct implementations of an ambiguous spec.

When one agent clusters standard errors at the state level and another at the county level because the methods section says “at the appropriate level,” that gap tells us something. But the multi-language approach was not built to surface it. It holds the specification fixed and only checks execution.

Packages can also diverge on default settings. The R did package and Stata csdid actually do differ: R defaults to multiplier bootstrap standard errors and simultaneous confidence bands, while Stata defaults to asymptotic standard errors and pointwise confidence intervals. They even use different doubly robust estimators (dripw in R, drimp in Stata). Same data, same method, different numbers by default. The fix is to read the package docs and align options explicitly. A documentation problem.

For deterministic code, the design is the verification problem. Execution is the easy part.

What Agents Actually Are

Here is a working definition. An agent is a process that reads project state (files, data, documentation), decides what to do next, takes action (writes code, runs commands, edits files), and observes results. Then it adjusts.

The distinction from a chatbot matters. A chatbot answers one question. An agent can read the data dictionary, notice a variable is coded differently than expected, write a cleaning step, run it, and check the output. The value is in the loop: sequenced actions informed by project context.

The implication: agent output quality depends on the quality of what the agent can read. If our project docs are thin, our variable names are cryptic, and our methods section is vague, the agent has little to work with. The model does not make up for missing context with hidden knowledge.

We wrote about this in The Cold Start Problem: every new session starts empty. The agent’s effectiveness depends entirely on what context it can access. A CLAUDE.md file that persists project context across sessions changes what the agent can do. The process has better inputs, so it produces better outputs. The model stayed the same.

An agent, then, is a specification-bounded process: it can only be as good as what we give it to work with. Less exciting than a personality with hidden talents waiting to be unlocked by the right prompt. More useful, because it tells us exactly where to invest.

When Agents Matter for Research

If agents are specification-bounded processes, which patterns actually matter for empirical research? Four seem to hold up across the work we have documented on this site.

Context persistence

The agent that knows our variable naming conventions, our identification strategy, and our data structure does better work than one that does not. This is not about model quality. It is about input quality.

A CLAUDE.md file that specifies “treatment is defined as the county’s first year above the 75th percentile of the distribution” removes an entire class of ambiguity. Without it, the agent has to guess. Or more likely, it makes a choice we did not intend, and we have to catch it during review. We covered the mechanics of this in CLAUDE.md for Research Context, and the pattern has held up across multiple projects since.

Phase-aware prompting

Exploration, implementation, and documentation are different tasks. They need different things from an agent.

During exploration (what does this data look like, where is missingness concentrated), we want the agent to surface surprises. During implementation (estimate this staggered DiD with these exact specifications), we want it to follow the spec precisely. During documentation (write up what was estimated and why), we want it to reference what was actually done. Not hallucinate a cleaner version of the analysis.

Treating all three phases the same wastes the agent’s strengths. We laid out phase-specific prompting strategies in an earlier post, and the pattern still holds. Matching the prompt to the research phase produces better output than a generic “do the analysis” request.

The verification tax

Every line of agent-generated code needs checking. This is not a flaw in the technology. It is the cost of the speed gain.

The 93% time reduction we documented when converting a well-specified methods section into code still included verification time. The generation was fast. The checking still had to happen. We have written about the verification tax elsewhere: the speed gain on generation comes with a checking cost that does not go away.

What seems to modulate the tax is specification precision. When the prompt is precise (exact estimator, exact control group, exact clustering level, exact robustness checks), there are fewer places for the agent to make unforced choices. Fewer unforced choices means fewer things to verify. The tax is real. It also responds to specification quality.

Specification as the bottleneck

A well-written methodology section is already most of the implementation spec. If our methods section says “estimate a staggered DiD using Callaway and Sant’Anna (2021) with never-treated as the control group, clustered at the state level, with county and year fixed effects,” that is nearly executable. An agent can turn that into running code. Few ambiguous choices remain.

A vague methods section (“we use difference-in-differences with appropriate controls”) produces vague code regardless of how capable the agent is. The agent cannot invent the research design. It cannot decide what “appropriate” means. That judgment is upstream.

Fred Brooks made a version of this argument in 1986: the essential difficulty of software is deciding what to build, not the coding. The “accidental” complexity of syntax and compilation can be automated away. The “essential” complexity of specification cannot. Agents automate the accidental part faster than ever. The essential part remains.

We saw this when converting a methodology section to code: the time savings scaled with how precise the methods section was, not with any trick in the prompting. Even with a precise spec, context window budgeting remains a practical constraint. We have to be strategic about what the agent sees, because it cannot see everything at once.

All four patterns point in the same direction. The agent shifts the bottleneck, but it does not eliminate the hard part.

The Researcher’s Role (For Now)

What remains human, and why?

Agents shift effort from “how do I code this” to “what exactly should I measure and why.” The implementation gets faster. The design thinking does not.

Research design: what comparison identifies the causal effect? Identification strategy: what assumptions are we making and are they credible? Result interpretation: what does this estimate mean? These remain human tasks, at least for now. They require judgment about the world (knowledge of institutions, politics, implementation realities) that the data and the prompt do not contain. An agent can estimate a treatment effect. It cannot judge whether parallel trends is plausible given what we know about the policy process.

The prompt is the research judgment. When we write a detailed prompt specifying the estimator, the control group, the clustering level, and the robustness checks, we are making research decisions. The agent executes them. The quality of those decisions determines the quality of the output.

This is why the copy-paste ceiling we wrote about earlier exists: at some point, the workflow demands more than copying model output into a script. It demands that we know what we want the script to do and why.

“AI will replace researchers” and “AI is useless for research” are both wrong, for the same reason. Both treat the agent as if its value depends on raw model capability: either it is smart enough to do research or it is not. The actual mechanism is different. The tool is powerful when the researcher is clear about what to ask for. It is weak when the researcher is vague. The constraint moved. It did not disappear.

What to Do Differently

Five changes that seem to matter, based on the patterns above.

Audit the specification, not the execution. When agent output is wrong, the first question should be: was the specification precise enough that two independent researchers would make the same implementation choices? If not, the fix is upstream of the agent.
Invest in project context. A CLAUDE.md file, clear variable naming, documented data dictionaries. These are the input that determines output quality.
Match the prompting to the phase. Exploration prompts should invite surprises. Implementation prompts should constrain choices. Documentation prompts should reference what was actually done, not what we wish we had done.
Budget for verification. Plan to check everything. The time savings come from faster generation; the review still happens. The verification tax is lower when the specification is tighter, and it never reaches zero.
Write the methods section first. If we cannot write a precise methods section, the agent cannot write precise code. The methods section is the spec. This is the same logic behind pre-analysis plans: locking the design before seeing results prevents loose choices from becoming researcher degrees of freedom. Investing in it pays off twice: once for the paper, once for the implementation.

Too Early to Say

Whether agents transform empirical research depends less on model capability and more on whether researchers learn to specify precisely what they want. That is a human skill. It is hard. Too early to say how it plays out.

For those who want to start somewhere concrete: CLAUDE.md for Research Context covers the project context piece, and From Methodology to Code walks through the specification-to-implementation pipeline in practice.

Suggested Citation

Cholette, V. (2026, March 2). What agents actually do (and what they don't). Too Early To Say. https://tooearlytosay.com/research/methodology/what-agents-actually-do/

Copy citation