CLASSICAL STATISTICS IN THE AGE OF AI — PART 1 OF 3

Prediction-powered inference corrects AI-imputed survey estimates

Treating AI-imputed survey responses as real observations can understate a prevalence estimate by a factor of three while reporting a tight confidence interval. A regression adjustment a decade older than the models lets the predictions sharpen the estimate without ever making it worse.

The first-order answer to a small survey is always more and better data, and nothing in this series argues otherwise. But applied work keeps handing us samples that cannot grow: the population is hard to reach, the field window has closed, the budget is spent, or the subgroup we care most about is simply rare. When the data cannot get bigger, the practical question becomes what else might buy back statistical power. One tempting answer: an artificial intelligence (AI) model can read each respondent’s demographic record and guess how the people we never reached would have answered. Let it fill in the missing responses, the idea goes, and the small survey starts to behave like a bigger one: narrower intervals, and subgroup estimates that were too noisy to publish.

Whether that hope is justified is the question Avi Feller took up in a recent Stanford talk, “Classical Statistics in the Age of AI.”1 His answer, roughly: used one way, AI guesses push the estimate off target while shrinking its reported uncertainty; used another way, through a regression adjustment older than the models, they can sharpen it without ever making it worse. His paper and code are not yet public; we can still implement the method as he described it on other data and learn something either way. This first post tests both uses.

A guess dressed up as an observation

Every respondent in our analysis frame from the Health Information National Trends Survey (HINTS 7), a national health survey, answered the four-item Patient Health Questionnaire (PHQ-4), a brief screen for psychological distress; the frame keeps only complete responses, and HINTS itself is a survey, not a census, with plenty of households that never returned it. That makes 14.2% the truth about this frame rather than about the country, and for the method test it is exactly enough: we know the right answer for every person in the frame, so when we pretend we never reached some of them and let the model guess, the guesses are gradeable against what those same people actually said. One more choice makes the test easy for the model: we pick who to “miss” at random, while real nonresponse rarely is.

The staging has three layers. The analysis frame holds 5,975 respondents. A 45% slice trains the predictive model, a gradient-boosted classifier that guesses distress status from six demographic and health covariates. The remaining 55%, 3,287 respondents, is the evaluation universe, and its true distress share is 14.2%; that figure is the target every estimate below should hit. Within the evaluation universe, 500 respondents play the sample we managed to reach, and the model guesses the answer for the other roughly 2,800.

The naive estimate then does what the tempting idea says to do. Each of those roughly 2,800 guesses is thresholded to a yes or a no, and the share of yes is read straight off the guesses, as if the guesses themselves were the collected data. The estimate comes back at 4.3%. The real share, again, is 14.2%. The procedure does not stumble or warn; it returns a clean number that understates distress by a factor of three. The dressed-up guess earns the name naive imputation, and the mechanism of the miss generalizes: a predictive model asked to guess a rare outcome pulls its guesses toward the typical case, so a smooth guesser produces fewer flagged cases than the world contains.

The second failure is false precision. Real respondents disagree with each other; a model’s guesses mostly agree, because every guess comes from the same trained model, and a single model maps similar inputs to similar answers. Feed thousands of agreeing guesses into a standard-error formula and the interval shrinks as if thousands of independent people had answered. Our run did not save that interval, so no number appears here; the shrinkage follows from imputation itself, not from any quirk of our model.

When the sample cannot grow, nothing in the output flags either failure; detecting them requires exactly the known truth a real survey lacks. Our staged version has that truth, and that is the only reason the 4.3% is visible as an error rather than publishable as a finding.

The guess as a covariate, in plain terms

Horizontal bar chart titled 'Treating AI guesses as data understates distress threefold.' Three estimates of moderate-to-severe distress prevalence are shown against a dashed vertical line marking the true value of 14.2 percent: the naive estimate treating AI guesses as data comes in at 4.3 percent, far short of the truth; the survey-only estimate from 500 real interviews comes in at 14.8 percent with a wide confidence interval; and the corrected do-no-harm estimate comes in at 15.0 percent. Both survey-based estimates straddle the true value.

Feller’s alternative is older than the AI it disciplines: treat the AI’s guess as a covariate rather than as the answer, one more column in a regression, sitting alongside the real outcomes we actually collected.

Suppose we run a randomized experiment, with real outcomes for the people we measured and an AI prediction of that outcome for everyone. The adjusted estimator regresses the real outcomes on three terms: a treatment indicator, the AI prediction centered at its mean, and the product of those two, the treatment-by-prediction interaction. The coefficient on the treatment indicator is the estimate. The properties of this adjustment were established for experimental data in 2013, well before the AI models it now disciplines existed.2 The estimate still rests entirely on the real observations; the prediction’s only job is to explain away noise.

The setting just moved: the imputation failure was demonstrated on a survey mean, while the guarantee that follows is established for randomized experiments. A do-no-harm estimator exists for survey means too, under the name prediction-powered inference.3 We have not yet run that estimator on HINTS, so the rest of this post stays in the experimental setting, where we can show the guarantee operating.

This design has a property the naive approach lacks, and it is the property that makes the whole idea usable: it does no harm, at least asymptotically, which is a large-sample guarantee; in our 2,000-person runs below it held. If the AI’s predictions track the truth, the adjustment soaks up variance and the confidence interval narrows. If the predictions are useless, the adjustment contributes nothing and the estimator falls back to the plain difference in means, the same answer we would have gotten with no AI at all. The guess can help; it cannot take over.

The contrast with imputation is the whole story. Imputation substitutes the model’s guesses for the people we failed to reach. Adjustment uses the guesses only to reduce noise in an estimate that the real respondents still determine.

Putting numbers on the guarantee

The simulation gives the guarantee a measurable form. Each run draws 2,000 people and a continuous outcome with a true treatment effect of 0.5. Half the people are assigned to treatment at random. Every person also carries an AI prediction, constructed so that its correlation with the true outcome equals a chosen value rho. We repeat this 2,000 times per value of rho and measure the variance of the treatment-effect estimate across the runs. Rho is the AI-quality knob: at rho = 0.9 the model is an excellent guesser; at rho = 0 it produces noise that merely looks like a prediction.

At rho = 0.9, the adjusted estimator’s variance comes in 82% below the unadjusted difference in means. Two cautions about reading that number, because it is easy to over-celebrate. First, variance and interval width are different units: an 82% variance reduction means the confidence interval narrows by about 58%, since width scales with the square root of variance. The interval we would publish is a bit more than half as wide, which is the practical equivalent of having roughly five and a half times the sample size for this estimate (one divided by the unreduced variance share, 1/0.18). Second, that payoff is conditional on the AI actually being that good, which on hard outcomes it often is not.

The other end of the dial is the do-no-harm case. At rho = 0, the variance reduction is -0.2%, and the negative sign is simulation noise around zero rather than evidence of harm; a finite set of runs never lands on exact zero. The estimator with a useless AI strapped to it performs like the estimator with no AI at all; it has reverted to the plain difference in means, exactly as the theory says it should. We pay nothing for bringing a worthless prediction to the table.

And across every quality level we ran, from useless to excellent, the 95% confidence intervals covered the truth between 94.5% and 95.8% of the time for both the adjusted and unadjusted estimators, right at the nominal rate, the 95% the intervals advertise. That is the line the naive imputation crossed. The adjusted estimator never claims precision it does not have, whether the AI is brilliant or broken.

This clean pattern, a gain at high rho, reversion to the difference in means at rho = 0, and advertised coverage throughout, is exactly what the 2013 theory predicts,2 written down more than a decade before this simulation existed. We did not record an advance document for this simulation, the predictions written down before an analysis runs, a discipline parts 2 and 3 of this series lean on heavily; the clean pattern here is therefore best read as a check of the theory’s standing forecast rather than as one of our own registered predictions.

The fine print on “do no harm”

Two-panel line chart. The left panel, titled 'Good AI sharpens the estimate; useless AI does no harm,' shows confidence-interval narrowing rising with AI prediction quality, measured as correlation with the outcome: 0 percent narrowing at correlation 0, 4 percent at 0.3, 19 percent at 0.6, and 58 percent at 0.9. The right panel, titled 'Coverage stays honest throughout,' shows coverage of the 95 percent confidence interval staying between roughly 0.945 and 0.958 across all prediction-quality levels, hugging the dashed nominal 0.95 line.

The do-no-harm property protects one thing: the variance of the estimator, in large samples, when the adjustment is fitted within the design. It does not repair a broken sample. If nonresponse is informative, meaning the people who decline to answer differ in the outcome from the people who do, the adjustment inherits that bias untouched. If the frame, the list from which the sample is drawn, misses part of the population, the adjustment cannot conjure the missing people. If the measured outcome is itself off, the adjustment estimates the wrong quantity more precisely. A prediction entering as one more covariate can shrink the noise around whatever answer the sample points to; it cannot change which answer the sample points to.

The failure mode above is demonstrated on real data, the HINTS imputation, while the recovery guarantee is demonstrated in simulation, where we set the prediction quality rho by construction. Applying the corrected estimator end to end on HINTS itself remains future work for the replication code; the simulation result is evidence for the guarantee, and it is simulation evidence.

The guarantee also stays silent about gains. The 82% figure lives at rho = 0.9, and a prediction that tracks a genuinely hard human outcome that closely is rare. The summary of the simulation is asymmetric in a useful way: the downside is capped near zero, and the upside is real when, and only when, the predictions are good. That asymmetry is what makes the method usable in practice: we do not have to know rho, or audit anyone’s accuracy claims in advance, to avoid being hurt by it.

What the correction buys, then, is the right to be agnostic. We can bolt the AI onto the estimator without first settling the argument about whether the AI is any good. The data settles it for us, run by run.

Where the equity stakes concentrate

The temptation to impute is strongest where the data are thinnest: small subgroups, hard-to-reach populations, languages the survey under-samples, the cells with the widest intervals. We flag before going further that what follows is a mechanistic expectation, untested in our implementation, which reports a single aggregate and contains no subgroup analysis. The people in those thin cells are also the ones the model has seen least in training, so they are where its guesses deserve the least trust. If the smoothing mechanism operates the way the aggregate result suggests, the naive approach would concentrate its bias on the very groups the survey is supposed to illuminate, while reporting its tightest, most misleading intervals about them. That concentration is a prediction of the mechanism, awaiting a subgroup table we have not built.

The adjusted estimator’s refusal to fake precision is most valuable in those same cells. Where the model guesses well, the subgroup interval narrows. Where the model guesses poorly, the interval stays wide, and a wide interval for an underrepresented group is information: it says the survey needs more real interviews there, a budget argument the fake-precision version would have erased. A method that tells us where our data are inadequate seems more useful, for equity purposes, than one that papers over the gaps.

The rule this leaves us with

A workable rule falls out of these results: never substitute predictions for observations; always keep a real probability sample, a sample in which every member of the population has a known chance of being selected, even a modest one; and fold the AI predictions in as a regression adjustment,2 where they can shrink the interval when they earn it and cost nothing when they do not. The coverage stays at the advertised rate either way, so the published uncertainty remains a promise we can keep.

It is a rule we can defend without first resolving how good the AI actually is.

Two questions remain open, and they are the rest of this series. Part 2 asks whether we can make the AI a better guesser in the first place by nudging its internal representations, and what classical object that nudge secretly is. Part 3 takes the whole apparatus to a head-to-head on real survey data, where the AI meets a plain logistic regression.

Notes

  1. Feller, A. "Classical Statistics in the Age of AI." Talk, Stanford Bay Area Tech Economics Seminar, June 4, 2026. The paper and code are not public; this series implements the method as described in the talk, and all numbers reported here are from our implementation, not his.
  2. Lin, W. (2013). On regression adjustment to experimental data; establishes the agnostic do-no-harm property used here.
  3. Angelopoulos, A. N., et al. (2023). Prediction-powered inference. Science.

Cite this article

Cholette, V. (2026, June 11). prediction-powered inference corrects AI-imputed survey estimates. Too Early To Say. https://tooearlytosay.com/research/methodology/prediction-powered-inference-survey-imputation/