The problem this step solves
Can a result clear every usual check and still be wrong. The bank-closure case shows that it can.
In that analysis, the joint parallel-trends test returned p = 0.9997. The Callaway-Sant'Anna average treatment effect on the treated was -0.47 percentage points, the two-way fixed-effects estimate was -0.50 percentage points, and the effect built over time the way a real barrier would. Every standard diagnostic suggested that the causal claim was solid.
Adding county-specific linear time trends flipped the sign to +0.003 and pushed the p-value to 0.87. The effect was pre-existing trajectory, not causation.
Quality assurance here is about the gap between a result that passes the usual checks and one that is still correct under harder ones. The AI assistant comes in as a second reader whose job is to propose and run the checks a tired analyst might skip.
When to use this step, and when not
We reach for this step when we have a headline estimate from a difference-in-differences or event-study design and we are about to write a causal sentence. It is especially important when pre-treatment periods are few and standard errors are large, the conditions under which a parallel-trends test can pass without meaning much. In the bank-closure case there were only three pre-treatment periods, with standard errors of 0.5 to 0.6, wider than the -0.47 effect itself.
This step does not substitute for clinical or institutional judgment. A sensitivity bound tells us how fragile a number is, not whether the mechanism is plausible. It is not a way to manufacture confidence. The point is to find the weakest point, not to collect a passing stamp. If a design has no pre-periods to test or no comparison group, the right move is to redesign, not to run QA on an unidentified estimate.
Inputs required
Before bringing in the assistant, we assemble:
- The point estimate and its standard error. In the bank-closure case, an average treatment effect on the treated of -0.47 percentage points with a standard error of 0.22.
- Pre-treatment event-study coefficients and their standard errors. For example, at event time -3, -0.056 percentage points, at -2, -0.030, at -1, 0.000, with standard errors around 0.5 to 0.6.
- The estimation code, so the assistant can propose a placebo and a unit-trend specification that run on the same data.
- The number of treated and comparison units and the panel structure. In this case, 1,408 counties with staggered closure timing.
The AI-assisted move
We hand the assistant the estimate, the pre-period coefficients, and the code. Its role is a hostile referee. It runs four checks we might otherwise rush through.
It reads the pre-period standard errors against the effect size and flags when a high parallel-trends p-value reflects low power rather than evidence.
It writes a fake-timing placebo that shifts treatment backward and re-estimates, a test that should return null.
It reports the breakdown M, the smallest violation of parallel trends that makes the confidence interval include zero, using a sensitivity-bounds method.1
It adds unit-specific linear time trends to see whether the effect holds under a specification that absorbs pre-existing trajectories.
In the bank-closure case, those four moves separated a real effect from a spurious one. The assistant runs the code we ask for. The verdict stays with us.
Copy-paste protocol
Paste the following into the assistant, filling the bracketed fields with our own numbers, design, and code.
You are a hostile referee for a difference-in-differences result.
Your job is to break the causal claim, not to confirm it. Do not
reassure me; find the weakest point.
CONTEXT
- Design: [e.g., Callaway-Sant'Anna staggered DiD]
- Outcome: [e.g., SNAP participation rate, percentage points]
- Treatment: [e.g., bank branch closure]
- Units: [N treated], [N comparison], [panel periods]
- Point estimate (ATT): [value] (SE [value])
- Pre-treatment event-study coefficients (event time: coef, SE):
[e=-3: ..., e=-2: ..., e=-1: ...]
- Estimation code is below.
RUN AND REPORT, IN THIS ORDER:
1. POWER. Compare each pre-period SE to the |ATT|. State plainly
whether the parallel-trends p-value can detect a violation the
size of the ATT. If pre-period CIs are wider than the ATT, say so.
2. PLACEBO. Write a fake-timing test that shifts treatment back
[k] periods and re-estimates. Report the placebo p-value. It
should be null; flag it if it is significant.
3. SENSITIVITY. Set up Rambachan-Roth (2023) bounds. Report the
breakdown M (smallest M whose identified set includes zero) and
the 95% CI at M = 0, 0.5, and 1. Note that the conventional
robust threshold is M > 1.
4. UNIT TRENDS. Re-estimate adding unit-specific linear time
trends. Report ATT, SE, p-value. State whether the effect
survives or disappears.
OUTPUT
- A verdict line: ROBUST / FRAGILE / NOT IDENTIFIED.
- The single check that most threatens the claim.
- Runnable code for each test, using my variable names.
CODE:
[paste estimation code]
Failure check and validation
The protocol passes only when all four checks line up. We treat the result as fragile if any one of these conditions holds.
- A pre-period confidence interval is wider than the point estimate. In the bank-closure case the pre-period standard errors of 0.5 to 0.6 dwarfed the -0.47 effect, so p = 0.9997 was a power artifact rather than evidence of parallel trends.
- The fake-timing placebo is significant. Here it came back at p = 0.04 when it should have been null, the first clear warning sign.
- The Rambachan-Roth breakdown M is below a robust threshold such as 1. In this case breakdown M was 0.35, meaning the effect tolerates a violation only about 35 percent as large as the maximum observed pre-trend movement before the confidence interval includes zero.
- The effect changes sign or loses significance under unit-specific trends. Here the ATT moved from -0.47 with p = 0.036 to +0.003 with p = 0.87.
A single failure shifts the burden of proof. All four failing together, as they did in the bank-closure case, means the headline number was driven by pre-existing trajectories and should be reported as an association rather than a cause.
Deliverable
The deliverable is a QA block we can paste into the paper. It contains a one-line classification such as robust, fragile, or not identified; the power read on the parallel-trends test; the placebo p-value; the Rambachan-Roth breakdown M with confidence intervals at selected M values; and the unit-trend re-estimate.
In the bank-closure case that verdict rewrote the conclusion. Instead of a clean causal claim, we wrote that bank closures are associated with a 0.5 percentage point reduction in Supplemental Nutrition Assistance Program (SNAP) participation rates, that sensitivity analysis shows treated counties were already on declining trajectories, and that causality cannot be established with the current identification strategy. Less exciting than a causal finding and honest.
Provenance from our work
These four checks are the sequence that overturned our own SNAP-and-bank-closure result. The case is documented in Understanding the Limits of Parallel Trends Tests. The estimate cleared parallel trends at p = 0.9997 but failed the placebo at p = 0.04, broke at a Rambachan-Roth M of 0.35, and flipped sign under county trends. The published article and code record the full path from headline effect to demoted association.
References
- Rambachan, A., & Roth, J. (2023). A more credible approach to parallel trends. The Review of Economic Studies, 90(5), 2555-2591. ↩