Documentation: Make the Analysis Reproducible by a Stranger

Victoria Cholette

AI for Applied Researchers · Step 5 of 5

Updated July 21, 2026

Documentation

This step is the pattern we use to turn a finished analysis into a result someone else can rerun. It produces a reproducibility document that ties raw data, specification, diagnostics, and headline numbers into a single trail.

The problem this step solves

An analysis is only as credible as a reader's ability to retrace it. Which raw data. Which specification. Which diagnostics. Which number traces to which line of code. When that trail is missing, a reader cannot tell a defensible estimate from an artifact.

In the Broad-Based Categorical Eligibility (BBCE) case, the article describes a two-way fixed-effects difference-in-differences estimate built from an Integrated Public Use Microdata Series (IPUMS-USA) extract of the American Community Survey 2005-2016, aggregated to 612 state-year observations across 51 states. A matching public package is not currently linked. A headline estimate of +1.37 percentage points on Supplemental Nutrition Assistance Program (SNAP) take-up means little to a reader who cannot see how the panel was built, which states were ever treated, and how standard errors were clustered. Documentation supplies that trail.

When to use this step, and when not to

The right moment is once the analysis has stabilized. The specification is chosen, the diagnostics have run, and the numbers in the draft match the numbers the code returns. Documentation written before the estimate settles documents a moving target and goes stale on the next rerun.

We defer this step while the work is still exploratory and the specification is changing daily. There is no value in a polished data dictionary for a panel we may rebuild tomorrow. The signal that documentation is due is simple. We are about to hand the result to a co-author, a referee, or our future self, and we want that reader to reproduce it without asking a single question.

Inputs required

Before drafting, we gather:

The raw data source and the exact extract. For BBCE, the IPUMS-USA American Community Survey 2005-2016 pull and the screen that defines eligibility, such as households at or below 130 percent of the federal poverty guideline.
The construction steps from raw rows to the analysis panel. For BBCE, aggregation to state-year cells, the 612-observation panel, and the split into 41 ever-treated and 10 never-treated states.
The specification. A two-way fixed-effects single-event design with state and year fixed effects, SNAP take-up as the headline outcome and log SNAP per capita as a robustness outcome, standard errors clustered at the state level.
The diagnostics already run. The pre-period parallel-trends F-test, the placebo test, the leave-one-out sweep across all 41 ever-treated states, and the Goodman-Bacon decomposition.
The code repository that produced the numbers, so every documented value can be traced back to its source.

The AI-assisted move

We hand the analysis code and the raw-data description to the model. Its job is to draft the reproducibility documentation: a data-provenance section, a specification section, a diagnostics section, and a numbered replication path. The model is fast at turning a working script into prose that names each input, transformation, and output.

The model does not get the last word on any number. We treat its draft as a starting point and then reconcile every quantitative claim against the code. For BBCE, that means checking the +1.37 percentage-point estimate, the 612-observation panel count, the 41-and-10 state split, the leave-one-out range, the parallel-trends F-test at p = 0.041, and the Goodman-Bacon forbidden-comparison weight share of 0.512. Documentation inherits the same standard as the analysis. It states what the code returns, including the awkward parts.

Copy-paste prompt

We paste the analysis script or the key functions and the raw-data description in place of the bracketed blocks, then run this prompt:

You are helping document an empirical analysis so a stranger can
reproduce it from raw data without contacting the author.

Here is the analysis code:
[PASTE SCRIPT OR KEY FUNCTIONS]

Here is the raw data source and how the analysis panel is built:
[PASTE: data source, extract details, screens, aggregation steps,
 unit of observation, sample window, treated/control split]

Write reproducibility documentation with these sections:

1. DATA PROVENANCE
   - Name the exact raw source and extract.
   - List every screen and transformation from raw rows to the
     analysis panel, in order.
   - State the unit of observation and the final sample counts.

2. SPECIFICATION
   - State the estimator, the fixed effects, the outcome variable(s),
     and the standard-error clustering.
   - Define every variable, including how the outcome is constructed.

3. DIAGNOSTICS REPORTED
   - For each diagnostic in the code (e.g. parallel-trends test,
     placebo, leave-one-out, decomposition), state what it tests
     and the value it returns. Do not soften a failing test.

4. REPLICATION PATH
   - A numbered list of commands a stranger runs to go from a clean
     checkout of the repository to the headline numbers.

Rules:
- Pull every number, count, and variable name from the code or the
  data description I gave you. Do not invent values.
- For any number you cannot trace to what I pasted, write
  [VERIFY: ] instead of guessing.
- Be specific. "Standard errors are clustered" is not enough; say
  on what.

Failure check and validation

The draft is not done until two checks pass.

First is the traceability check. We read the documentation against the code and confirm that every number, count, and variable name appears in the script or in its outputs. Any value the model could not source should arrive marked "[VERIFY: ...]." If the model produced a clean-looking number with no basis in the code, that is a fabrication and the documentation fails.

A concrete test is simple. Pick three numbers from the draft. For BBCE, the +1.37 percentage-point estimate, the 612 observations, and the leave-one-out range. Search the repository for each. If a number is not in the code or in the logged outputs, it does not belong in the documentation.

Second is the stranger-rerun check. We follow our own replication path on a clean checkout, running each numbered command in order. If a step assumes a file, a package, or an environment variable that the documentation never mentioned, the path is broken and we add the missing step. The documentation passes when a clean checkout reaches the headline numbers with no undocumented move.

Deliverable

The deliverable is a reproducibility document that travels with the analysis. It includes a data-provenance section naming the raw source and every transformation, a specification section defining the estimator and every variable, a diagnostics section reporting each test and its value, and a numbered replication path from a clean checkout to the headline numbers.

For the BBCE case, the article describes the document that should sit alongside code and data. A matching public package is not currently linked, so the stranger-rerun check remains a publication requirement rather than a completed public verification.

Provenance from our work

This standard comes from the SNAP BBCE article, "When the parallel-trends test fails on one lead, what is left." The article traces the +1.37 percentage-point estimate, the 612-observation panel, the 41-and-10 state split, and the leave-one-out range through its narrative. It reports the awkward parts plainly, including the parallel-trends F-test failing at p = 0.041 and the 0.512 forbidden-comparison weight share. Public-material status: article only. A stranger-rerunnable package is not currently linked.