Quality assurance

Victoria Cholette

AI for Applied Researchers · Step 4 of 5

Updated July 21, 2026

Quality assurance

This step is the pattern we use for headline difference-in-differences estimates. It produces a QA block with a one-line verdict and the diagnostics that support it.

The problem this step solves

A difference-in-differences or event-study estimate can clear a conventional pre-trend test even when that test has little power against violations large enough to matter. Quality assurance therefore asks a design-specific question: could the available pre-periods detect the departure that would change the study's conclusion?

A precise specification is what lets an agent do useful work. An artificial intelligence (AI) assistant can implement prespecified power, placebo, sensitivity, and trend checks; we choose the meaningful deviation, the comparison, and the rule for changing the causal sentence.

The bank-closure and Supplemental Nutrition Assistance Program participation case illustrates the problem: its article reports that a conventional pre-trend result and harder diagnostics point toward different interpretations, while the numerical record remains unreproduced.

When to use this step, and when not to

We use this step after freezing the estimand and primary specification but before writing causal language for a difference-in-differences or event-study result. It requires a defined comparison group, observed pre-treatment periods, treatment timing, and code that can rerun the same estimand under diagnostic specifications.

Decision rule: run the QA block when those inputs exist and a substantively meaningful violation can be stated on the outcome scale before diagnostics are inspected. If the design lacks an identified comparison or usable pre-periods, return to design rather than classify the estimate.

Inputs required

Before bringing in the assistant, we assemble:

The headline estimate, standard error, estimand, estimator, analysis sample, and clustering level.
Pre-treatment event-study coefficients and their full variance-covariance matrix.
The estimation code and treatment-timing field, so every diagnostic preserves the same sample and outcome definition.
A prespecified violation of substantive concern, placebo window, sensitivity model, unit-trend specification, and decision rule.

The AI-assisted move

We hand the assistant the frozen estimate, event-study output, covariance matrix, decision thresholds, and code. Its role is to run four traceable checks under the same estimand and sample.

It evaluates power against the prespecified pre-trend shape and magnitude using the design's covariance structure. A high parallel-trends p-value is reported as a failure to reject, alongside the probability of detecting the violation of concern.

It writes a fake-timing placebo that shifts treatment backward and re-estimates, a test that should return null.

It reports the breakdown M, the smallest violation of parallel trends that makes the confidence interval (CI) include zero, using a sensitivity-bounds method.¹

It adds unit-specific linear time trends to see whether the effect holds under a specification that absorbs pre-existing trajectories.

In the bank-closure and Supplemental Nutrition Assistance Program (SNAP) participation case, the same four-check record organizes the conflict among the article-reported diagnostics. The project figures remain in Provenance because no matching public run reproduces them.

The assistant runs and records the checks. We judge whether each diagnostic is appropriate, interpret the design-specific thresholds, and decide whether the causal sentence survives.

Copy-paste protocol

Paste the following into the assistant, filling the bracketed fields with our own numbers, design, and code.

You are a hostile referee for a difference-in-differences result.
Your job is to break the causal claim, not to confirm it. Do not
reassure me; find the weakest point.

CONTEXT
- Design: [e.g., Callaway-Sant'Anna staggered difference-in-differences]
- Outcome: [e.g., SNAP participation rate, percentage points]
- Treatment: [e.g., bank branch closure]
- Units: [N treated], [N comparison], [panel periods]
- Average treatment effect on the treated (ATT): [value]
- Standard error (SE): [value]
- Pre-treatment event-study coefficients (event time: coef, SE):
    [e=-3: ..., e=-2: ..., e=-1: ...]
- Pre-treatment coefficient covariance matrix: [paste matrix]
- Estimation code is below.

RUN AND REPORT, IN THIS ORDER:
1. POWER. Use the pre-period covariance matrix to evaluate power
   against [PRESPECIFIED PRE-TREND SHAPE] with magnitude [DELTA] on
   the outcome scale. Report the rejection probability for the joint
   pre-trend test. Do not infer power from CI width alone and do not
   treat a high p-value as evidence that trends are parallel.
2. PLACEBO. Write a fake-timing test that shifts treatment back
   [k] periods and re-estimates. Report the estimate and confidence
   interval, then test whether the interval is fully contained in the
   prespecified equivalence band [-DELTA, +DELTA].
3. SENSITIVITY. Set up Rambachan-Roth sensitivity bounds. Report the
   breakdown M (smallest M whose identified set includes zero) and
   the 95% CI at M = 0, 0.5, and 1. Do not apply a universal cutoff
   to M. Interpret the breakdown value against a justified deviation
   model and the study context.
4. UNIT TRENDS. Re-estimate adding unit-specific linear time
   trends. Report ATT, SE, and p-value. Compare the change with the
   prespecified material-change rule; do not choose the rule after
   seeing the result.

OUTPUT
- A verdict line: ROBUST / FRAGILE / NOT IDENTIFIED.
- The single check that most threatens the claim.
- Runnable code for each test, using my variable names.
- A machine-readable QA record containing inputs, thresholds, sample,
  package versions, estimates, and output paths.

If a required value or variable name is missing, state what is missing.
Do not guess.

CODE:
[paste estimation code]

STOP after the requested output.

Failure check and validation

Failure condition: the placebo confidence interval is not fully contained inside the prespecified equivalence band. Define the band before running the placebo, using an effect small enough to leave the causal interpretation unchanged.

Pass condition: the placebo confidence interval lies entirely inside the prespecified band, and the saved QA record reproduces the estimate from the frozen sample and code. The power, sensitivity, and unit-trend results remain separate evidence in the QA block and are interpreted against their own prespecified design criteria.

The deliverable

The deliverable is a QA block we can paste into the paper. It contains a one-line classification such as robust, fragile, or not identified; the power read on the parallel-trends test; the placebo estimate, confidence interval, and equivalence-band result; the Rambachan-Roth breakdown M with confidence intervals at selected M values; and the unit-trend re-estimate.

The block also names the frozen sample, estimand, thresholds, code version, output paths, and the single diagnostic that most changes the interpretation.

Provenance from our work

The worked bank-closure case is documented in Parallel-Trends Sensitivity: An Article-Reported Case. The article reports the following values:

The panel contains 1,408 counties.
The Callaway-Sant'Anna estimate is -0.47 percentage points.
The reported standard error is 0.22.
The two-way fixed-effects estimate is -0.50.
The joint pre-trend p-value is 0.9997.
The placebo p-value is 0.04.
The Rambachan-Roth breakdown M is 0.35.
The county-trend estimate is +0.003.
The reported county-trend p-value is 0.87.

These figures are article-reported and not publicly reproduced. The reported pre-period uncertainty requires a design-specific power calculation, not a rule comparing confidence-interval width with the point estimate.

Public-material status: article only. No matching analysis script, frozen input manifest, run record, or saved output is currently available for this case. The CAPHE interactive methods lab is the closest relevant public instructional material, but it does not reproduce the bank-closure analysis.

The open known-truth validation package demonstrates the fail-loud QA protocol and is open to rerun. It does not reproduce the bank-closure estimates.

References

Rambachan, A., & Roth, J. (2023). A more credible approach to parallel trends. The Review of Economic Studies, 90(5), 2555-2591. ↩