Prediction-powered inference for AI-assisted survey estimation

Victoria Cholette

CLASSICAL STATISTICS IN THE AGE OF AI · PART 1 OF 3

June 2026

Prediction-powered inference for AI-assisted survey estimation

This article explains why AI predictions should enter estimation as covariates rather than substitute observations. Its local HINTS imputation example and separate regression-adjustment simulation are article-reported and not publicly reproduced.

The first-order answer to a small survey is always more and better data, and nothing in this series argues otherwise. But applied work keeps handing us samples that cannot grow: the population is hard to reach, the field window has closed, the budget is spent, or the subgroup we care most about is simply rare. When the data cannot get bigger, the practical question becomes what else might buy back statistical power. One tempting answer: an artificial intelligence (AI) model can read each respondent’s demographic record and guess how the people we never reached would have answered. Let it fill in the missing responses, the idea goes, and the small survey starts to behave like a bigger one: narrower intervals, and subgroup estimates that were too noisy to publish.

Whether that hope is justified is the question Avi Feller took up in a recent Stanford talk, “Classical Statistics in the Age of AI.”¹ His answer, roughly: used one way, AI guesses push the estimate off target while shrinking its reported uncertainty; used another way, through a regression adjustment older than the models, they can sharpen it without ever making it worse. His paper and code are not yet public. This article describes a separate local implementation of the method on other data. Its scripts, data provenance, and matched outputs are also not public, so the numerical claims below are article-reported rather than independently reproduced.

A guess dressed up as an observation

In the article’s local analysis frame from HINTS 7, every respondent answered the four-item Patient Health Questionnaire (PHQ-4), a brief screen for psychological distress. The frame keeps only complete responses, and HINTS itself is a survey, not a census, with plenty of households that never returned it. The article reports 14.2% as the complete-case frame’s reference value rather than a population estimate. For the method test, that frame provides a known answer against which the staged guesses can be compared. One more choice makes the test easy for the model: the local implementation selects who to “miss” at random, while real nonresponse rarely is.

The article reports three staging layers. The analysis frame holds 5,975 respondents. A 45% slice trains the predictive model, a gradient-boosted classifier that guesses distress status from six demographic and health covariates. The remaining 55%, 3,287 respondents, is the evaluation universe, and its reported distress share is 14.2%. Within the evaluation universe, 500 respondents play the sample reached, and the model guesses the answer for the other roughly 2,800. These counts have not been matched to a public data-provenance record or run output.

In the article-reported local run, the naive estimate does what the tempting idea says to do. Each of those roughly 2,800 guesses is thresholded to a yes or a no, and the share of yes is read straight off the guesses, as if the guesses themselves were collected data. The article reports 4.3% against the frame reference value of 14.2%, an understatement by about a factor of three. The dressed-up guess earns the name naive imputation, and the mechanism of the miss generalizes: a predictive model asked to guess a rare outcome pulls its guesses toward the typical case, so a smooth guesser produces fewer flagged cases than the world contains.

The second failure is false precision. Real respondents disagree with each other; a model’s guesses mostly agree, because every guess comes from the same trained model, and a single model maps similar inputs to similar answers. Feed thousands of agreeing guesses into a standard-error formula and the interval shrinks as if thousands of independent people had answered. The article says the local run did not preserve that interval, so no value can be checked against a saved output.

When the sample cannot grow, nothing in the output flags either failure; detecting them requires exactly the known answer a real survey lacks. Within this staged example, that reference value is the only reason the reported 4.3% is visible as an error rather than publishable as a finding.

The guess as a covariate, in plain terms

Feller’s alternative is older than the AI it disciplines: treat the AI’s guess as a covariate rather than as the answer, one more column in a regression, sitting alongside the real outcomes we actually collected.

Suppose we run a randomized experiment, with real outcomes for the people we measured and an AI prediction of that outcome for everyone. The adjusted estimator regresses the real outcomes on three terms: a treatment indicator, the AI prediction centered at its mean, and the product of those two, the treatment-by-prediction interaction. The coefficient on the treatment indicator is the estimate. The properties of this adjustment were established for experimental data in 2013, well before the AI models it now disciplines existed.² The estimate still rests entirely on the real observations; the prediction’s only job is to explain away noise.

The setting just moved: the imputation failure was demonstrated on a survey mean, while the guarantee that follows is established for randomized experiments. A do-no-harm estimator exists for survey means too, under the name prediction-powered inference.³ We have not yet run that estimator on HINTS, so the rest of this post stays in the experimental setting, where we can show the guarantee operating.

This design has a property the naive approach lacks, and it is the property that makes the whole idea usable: it does no harm, at least asymptotically, which is a large-sample guarantee. The article reports that the property held in its local 2,000-person simulation, whose outputs are not publicly reproduced. If the AI’s predictions track the truth, the adjustment soaks up variance and the confidence interval narrows. If the predictions are useless, the adjustment contributes nothing and the estimator falls back to the plain difference in means, the same answer we would have gotten with no AI at all. The guess can help; it cannot take over.

The contrast with imputation is the whole story. Imputation substitutes the model’s guesses for the people we failed to reach. Adjustment uses the guesses only to reduce noise in an estimate that the real respondents still determine.

Article-reported simulation results

The article’s local simulation gives the guarantee a measurable form. Each run draws 2,000 people and a continuous outcome with a treatment effect set to 0.5. Half the people are assigned to treatment at random. Every person also carries an AI prediction, constructed so that its correlation with the outcome equals a chosen value rho. The article reports 2,000 repetitions per value of rho and measures the variance of the treatment-effect estimate across the runs. Rho is the AI-quality knob: at rho = 0.9 the model is an excellent guesser; at rho = 0 it produces noise that merely looks like a prediction.

In the article-reported local simulation, the adjusted estimator’s variance at rho = 0.9 is 82% below the unadjusted difference in means. Two cautions matter. First, variance and interval width are different units: an 82% variance reduction implies an interval about 58% narrower, since width scales with the square root of variance. The article translates that result to roughly five and a half times the sample size for this estimate. Second, that payoff is conditional on the prediction being that accurate, which on hard outcomes it often is not. These exact values are not matched to a public saved output.

At the other end of the dial, the article reports a variance reduction of -0.2% at rho = 0. The interpretation is simulation noise around zero rather than evidence of harm. In this local example, the estimator with a useless prediction performs like the estimator with no prediction and reverts to the plain difference in means, as the theory says it should.

Across the quality levels in the local simulation, the article reports that the 95% confidence intervals covered the set treatment effect between 94.5% and 95.8% of the time for both the adjusted and unadjusted estimators. Those values sit near the nominal 95% rate, but they have not been independently reproduced from a public script and matched output.

The reported pattern, a gain at high rho, reversion to the difference in means at rho = 0, and near-nominal coverage, is what the 2013 theory predicts.² The article reports no advance document for this local simulation, a discipline parts 2 and 3 discuss. The values are therefore best read as an article-reported check of the theory’s standing forecast rather than as a registered prediction.

The fine print on “do no harm”

The do-no-harm property protects one thing: the variance of the estimator, in large samples, when the adjustment is fitted within the design. It does not repair a broken sample. If nonresponse is informative, meaning the people who decline to answer differ in the outcome from the people who do, the adjustment inherits that bias untouched. If the frame, the list from which the sample is drawn, misses part of the population, the adjustment cannot conjure the missing people. If the measured outcome is itself off, the adjustment estimates the wrong quantity more precisely. A prediction entering as one more covariate can shrink the noise around whatever answer the sample points to; it cannot change which answer the sample points to.

The failure mode above is illustrated with the article’s local HINTS analysis, while the recovery guarantee is illustrated in simulation, where prediction quality rho is set by construction. Applying the corrected estimator end to end on HINTS itself remains future work for this local project. No public replication package currently reproduces either the HINTS example or the simulation, so the numerical values above should be read as article-reported results. The simulation is meant to illustrate the guarantee, and it remains simulation evidence.

The guarantee also says nothing about gains. The article-reported 82% figure lives at rho = 0.9, and a prediction that tracks a genuinely hard human outcome that closely is rare. Within that unverified local simulation, the reported downside is near zero and the upside appears when the predictions are good. The theoretical asymmetry is what makes the method usable in practice: we do not have to know rho, or audit anyone’s accuracy claims in advance, to avoid being hurt by it.

What the correction buys, then, is the right to be agnostic. We can bolt the AI onto the estimator without first settling the argument about whether the AI is any good. The data settles it for us, run by run.

Where the equity stakes concentrate

The temptation to impute is strongest where the data are thinnest: small subgroups, hard-to-reach populations, languages the survey under-samples, the cells with the widest intervals. We flag before going further that what follows is a mechanistic expectation, untested in our implementation, which reports a single aggregate and contains no subgroup analysis. The people in those thin cells are also the ones the model has seen least in training, so they are where its guesses deserve the least trust. If the smoothing mechanism operates the way the aggregate result suggests, the naive approach would concentrate its bias on the very groups the survey is supposed to illuminate, while reporting its tightest, most misleading intervals about them. That concentration is a prediction of the mechanism, awaiting a subgroup table we have not built.

The adjusted estimator’s refusal to fake precision is most valuable in those same cells. Where the model guesses well, the subgroup interval narrows. Where the model guesses poorly, the interval stays wide, and a wide interval for an underrepresented group is information: it says the survey needs more real interviews there, a budget argument the fake-precision version would have erased. A method that tells us where our data are inadequate seems more useful, for equity purposes, than one that papers over the gaps.

The rule this leaves us with

The theory suggests a workable rule: never substitute predictions for observations; always keep a real probability sample, a sample in which every member of the population has a known chance of being selected, even a modest one; and fold the AI predictions in as a regression adjustment,² where they can shrink the interval when they earn it and cost nothing when they do not. In the article’s local simulation, the reported coverage stayed near the advertised rate. A public script and matched output would be needed to independently verify that numerical check.

It is a rule we can defend without first resolving how good the AI actually is.

Two questions remain open, and they are the rest of this series. Part 2 asks whether we can make the AI a better guesser in the first place by nudging its internal representations, and what classical object that nudge secretly is. Part 3 takes the whole apparatus to a head-to-head on real survey data, where the AI meets a plain logistic regression.

Notes

Feller, A. "Classical Statistics in the Age of AI." Talk, Stanford Bay Area Tech Economics Seminar, June 4, 2026. The paper and code are not public. This series describes a separate local implementation; its scripts, data provenance, and matched outputs are also not public, so its numerical claims are not independently reproduced.
Lin, W. (2013). On regression adjustment to experimental data; establishes the agnostic do-no-harm property used here.
Angelopoulos, A. N., et al. (2023). Prediction-powered inference. Science.

Cite this article

Cholette, V. (2026, June 11). Prediction-powered inference for AI-assisted survey estimation. Too Early To Say. https://tooearlytosay.com/research/methodology/prediction-powered-inference-survey-imputation/

Share

[ref-1] Feller, A. "Classical Statistics in the Age of AI." Talk, Stanford Bay Area Tech Economics Seminar, June 4, 2026. The paper and code are not public. This series describes a separate local implementation; its scripts, data provenance, and matched outputs are also not public, so its numerical claims are not independently reproduced.

[ref-2] Lin, W. (2013). On regression adjustment to experimental data; establishes the agnostic do-no-harm property used here.

[ref-3] Angelopoulos, A. N., et al. (2023). Prediction-powered inference. Science.