Understanding the Limits of Parallel Trends Tests

A sensitivity analysis case study showing why a high p-value on parallel trends tests can mislead, and how Rambachan-Roth bounds reveal fragile causal claims.

When we run a difference-in-differences analysis, what does due diligence look like? At minimum, we check parallel trends. If the p-value is high, we're good, right?

Let's take a look at a case where the parallel trends test returned p = 0.9997, yet the causal claim still fell apart.


The Setup

Say we want to know whether bank branch closures affect SNAP (food stamp) participation. The hypothesis makes sense: when banks close, residents lose access to services that facilitate benefit delivery. Transaction costs rise. Enrollment might fall.

Here's what makes this a natural experiment. From 2010 to 2020, thousands of bank branches closed across the United States, largely driven by mergers and consolidation. We can track SNAP participation rates at the county level before and after closures.

With 1,408 counties in our sample, some experiencing closures and others not, we can apply the Callaway-Sant'Anna (2021) estimator. This handles staggered treatment timing without the bias problems that plague traditional two-way fixed effects.

The main estimate: a 0.47 percentage point decline in SNAP participation following bank closures. At the county level, that represents thousands of households potentially losing access to food assistance.

So far, so good. But can we trust this estimate?


The Standard Checks

Let's run through the usual diagnostics.

Parallel trends test: The joint test of pre-treatment coefficients gives us p = 0.9997. That's about as clean as these tests get. Here are the pre-treatment coefficients:

Event Time Coefficient Standard Error
e = -3 -0.056 pp 0.620
e = -2 -0.030 pp 0.582
e = -1 +0.000 pp 0.549

All statistically indistinguishable from zero. The event study shows flat pre-trends, then declining SNAP participation after treatment.

Estimator consistency: The Callaway-Sant'Anna estimate (-0.47 pp) matches our two-way fixed effects estimate (-0.50 pp). When different estimators agree, that's reassuring.

Plausible dynamics: The effect builds over time, as we'd expect if bank closures created persistent barriers. Counties don't suddenly drop SNAP participation; they drift downward over several years.

At this point, the causal claim looks solid. But here's where things get interesting.


A Warning Sign

One robustness check gives us pause: the fake timing test. The idea is to artificially shift treatment backward by two years and re-estimate. If parallel trends truly hold, this placebo should produce a null result.

It doesn't. The fake timing test is significant at p = 0.04.

What does this mean? A significant placebo suggests that "pre-treatment" periods (under the fake timing) still show negative effects. That pattern is consistent with pre-existing trends, not parallel trends.

We could dismiss this as a fluke. The main parallel trends test passed overwhelmingly. One alternative test failing doesn't necessarily invalidate everything.

But let's dig deeper.


The Power Problem

Look again at those pre-treatment coefficients. Notice anything about the standard errors? They're enormous. Each pre-period coefficient has a confidence interval spanning more than 2 percentage points.

Our treatment effect is only -0.47 pp. A pre-trend of similar magnitude would fall well within the confidence interval around zero.

Here's the thing: the parallel trends test can't actually detect violations of the size that would matter.

This is the statistical power problem. A "passing" parallel trends test can mean two things:

  1. Parallel trends genuinely hold, or
  2. The test lacks power to detect violations

With only three pre-treatment periods and standard errors of 0.5-0.6, option two becomes very plausible. We're not testing whether parallel trends hold. We're testing whether our data can reject parallel trends—and it can't.

That high p-value (0.9997) doesn't validate our assumption. It reflects our inability to detect violations.


This Is Where Sensitivity Analysis Comes In

Rambachan and Roth (2023) offer a framework that doesn't assume parallel trends hold exactly. Instead, it parameterizes potential violations through a parameter M:

  • M = 0: Parallel trends assumed exactly (the standard assumption)
  • M = 1: Violations can be as large as the maximum observed pre-treatment coefficient movement
  • Breakdown M: The smallest M where the identified set includes zero

The idea here is to ask: how large do violations need to be before our conclusion changes?

Let's calculate the bounds for different values of M:

M ATT Bounds 95% CI Excludes Zero?
0 [-0.47, -0.47] [-0.91, -0.03] Yes
0.25 [-0.49, -0.45] [-0.93, -0.01] Yes
0.35 Breakdown
0.50 [-0.51, -0.42] [-0.95, +0.01] No
1.0 [-0.56, -0.38] [-0.99, +0.06] No

Our breakdown M is 0.35.

What does that tell us? The effect is only robust to violations 35% as large as the maximum observed pre-trend movement. At M = 0.5, the confidence interval already includes zero. At M = 1, the identified set is wide enough to include positive effects.

The conventional threshold for a "robust" finding is M > 1: the effect should survive violations at least as large as what we observed in the pre-period. Our result doesn't come close.

Sensitivity analysis showing how identified sets expand as M increases, with breakdown point at M = 0.35
Sensitivity bounds expand as we allow larger deviations from parallel trends. The breakdown point at M = 0.35 indicates a fragile causal claim.

Sensitivity analysis tells us the result is fragile. But it doesn't tell us what's actually happening. Is there selection into treatment?

Let's try adding county-specific linear time trends to our specification. This absorbs pre-existing trajectories. If the treatment effect is real, it should be identified off deviations from each county's own trend. If the effect is driven by pre-trends, it should disappear.

Specification ATT SE p-value
Baseline (County + Year FE) -0.47 0.22 0.036
With County Trends +0.003 0.016 0.87

The effect doesn't just weaken. It disappears entirely. The sign flips from negative to positive. The p-value goes from 0.036 to 0.87.

The entire "effect" gets absorbed by county-specific trends. Counties that experienced bank closures were already on declining SNAP trajectories before the closures happened.


What's Going On Here?

The pattern now makes sense. Bank closures don't happen randomly. They happen in counties experiencing economic decline, population loss, reduced commercial activity. These same forces also reduce SNAP participation: fewer eligible residents, out-migration of low-income families, changing local economies.

Our difference-in-differences design captured the association between bank closures and SNAP declines. But it couldn't separate the causal effect of closures from the pre-existing trajectories of declining counties.

The parallel trends test passed because it lacked power, not because parallel trends held. The fake timing test, which detected something odd, was the correct warning signal.


So What Can We Take Away?

Passing is not validation. A high p-value on a parallel trends test provides some evidence, but it's not proof. When pre-treatment periods are limited and standard errors are large, the test can't detect meaningful violations. We should report the power of our parallel trends tests alongside p-values.

Sensitivity analysis should be standard. Rambachan-Roth bounds tell us how robust our findings are to deviations from parallel trends. A breakdown M of 0.35 raises red flags immediately. This check should be routine.

Unit-specific trends are informative. Adding county-specific trends is a demanding specification, and some argue it's too conservative. But when the effect disappears entirely, we learn something important: the baseline result was driven by pre-existing trajectories.

Listen to the warning signs. The fake timing test failed at p = 0.04. That's the signal to investigate, not dismiss.


The Revised Conclusion

We didn't publish a causal claim about bank closures and SNAP participation. Instead, we reported the finding as an association:

"Bank closures are associated with a 0.5 percentage point reduction in SNAP participation rates. However, sensitivity analysis reveals that treated counties were already on declining SNAP trajectories prior to bank closures. Causality cannot be established with the current identification strategy."

Less exciting than a clean causal finding. But honest.


A Practical Checklist

For anyone running difference-in-differences analyses, here's what to check:

  1. How many pre-treatment periods do we have? Three or fewer often means low power. Be cautious about interpreting passing parallel trends tests.
  2. How large are the pre-treatment standard errors? If confidence intervals around pre-treatment coefficients are wider than the treatment effect, the test can't detect violations that matter.
  3. What is the Rambachan-Roth breakdown M? Values below 0.5 suggest fragility. Values above 1 suggest robustness. This should be reported in every diff-in-diff paper.
  4. Does the effect survive unit-specific trends? This is a demanding check, but informative. If the effect disappears, pre-existing trajectories may explain the result.
  5. Do alternative diagnostic tests agree? When the joint test passes but fake timing or placebo tests fail, investigate the discrepancy.

Sensitivity analysis doesn't make causal inference harder. It makes it honest.


References

Callaway, B., & Sant'Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.

Rambachan, A., & Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5), 2555-2591.


This article describes actual research conducted on bank closures and SNAP participation. Full analysis code and data documentation are available in the project repository.

This project serves as the basis for an interactive methods lab at CAPHE (California Association of Public Health Economists): Understanding the Limits of Parallel Trends Tests. The lab includes an interactive Rambachan-Roth slider that lets readers explore how sensitivity bounds expand as M increases.

Suggested Citation

Cholette, V. (2025, December 10). When parallel trends tests lie. Too Early To Say. https://tooearlytosay.com/research/methodology/parallel-trends-sensitivity/
Copy citation