Comparing logistic regression with language-model readouts for survey prediction

Victoria Cholette

CLASSICAL STATISTICS IN THE AGE OF AI · PART 3 OF 3

June 2026

Comparing logistic regression with language-model readouts for survey prediction

A local analysis compares three approaches to predicting psychological distress from the same seven survey covariates. The numerical results are reported by this article and are not publicly reproduced from a frozen script and matched output.

Converting each survey respondent into a short text persona, running the persona through a language model, and reading a distress prediction off the model’s internal activations is an idea with real appeal: the model brings everything it absorbed in pretraining to a prediction task that has only a few thousand labeled rows. The comparison that belongs on the table is the statistician’s default: the same prediction task, the same input facts, and a plain logistic regression. This final part of the three-part series reports that comparison from a local analysis. One caveat about the model under test belongs up front: the main language model is GPT-2 at 124 million parameters, small by 2026 standards, so the exercise tests the architecture-class claim at a scale that can run locally.

The series follows Avi Feller’s talk “Classical Statistics in the Age of AI” (Stanford, June 4, 2026). His paper and code are not yet public. This article describes a separate local implementation on HINTS data, not a replication of his results. It reports three items from a local advance document: a prediction that steering would leave the ranking unimproved, a prediction that a gradient-boosted tree would outpredict the language-model readout, and a stopping rule that the implementation tripped and overrode. The logistic regression entered at analysis time as a standard baseline. The advance document, frozen script, and matched output are not public, so every exact value below remains an article-reported local result.

The task

The article reports using the Health Information National Trends Survey (HINTS 7, fielded in 2024), a national survey on health information and wellbeing. The outcome is moderate-to-severe psychological distress, defined as a score of 6 or higher on the four-item Patient Health Questionnaire (PHQ-4). The local analysis is described as a case-enriched subsample of 2,342 respondents, constructed by sampling up to 1,500 respondents per class, with prevalence of 0.36. The article reports a full complete-case frame rate of about 14%, 13.9% unweighted and 15.2% under the survey’s design weights. The subsample splits 60/40 into a build set and an evaluation set, and the results below are reported on 937 held-out evaluation respondents with 95% confidence intervals. A single fixed random seed, 20260609, is said to govern the subsample draw, split, and readout training. These design details have not been matched to a public data-provenance record or run output.

The scoring metric is AUC, the area under the receiver operating characteristic curve: a measure of how well a model sorts who has the outcome from who does not, where 0.5 is a coin flip and 1.0 is perfect.

Three predictors, the same seven facts

Each model sees exactly the same information: seven demographic facts per respondent. The seven are age group, race and ethnicity, education level, household income bracket, health insurance status, self-rated general health, and primary language (English-speaking, Spanish-speaking, or bilingual). For the language model, the facts become one persona sentence, on this pattern: “A 35-49 year-old Hispanic adult, some college education, household income $35-50k, insured, self-rated health fair, Spanish-speaking.”

The first predictor is the language-model pipeline. The persona sentence passes through GPT-2, the 124-million-parameter model, and we read off its activations, the vector of numbers the model computes at each layer as it reads text. We take the activations at layer 6 of the model’s 12 blocks, average them over the sentence’s tokens, and get one 768-number vector per respondent. The vectors are then centered: the all-respondent mean is subtracted from each. A readout model maps each vector to a distress probability. The readout is a small neural network with one 64-unit hidden layer, trained on the build split’s vectors to predict distress. We make it nonlinear deliberately, for a mechanical reason: adding the same fixed vector to every respondent’s activations cannot change a linear readout’s ranking at all, so a linear readout would make the steering experiment below vacuous before it began. The nonlinear readout gives steering a genuine chance to change ranks.

The second predictor is a gradient-boosted tree fit directly on the seven covariates, a standard machine-learning default. The third is a plain logistic regression on the same seven covariates.

Article-reported results

Article-reported local model comparison; not independently reproduced from a public script and matched output.
Model	AUC	95% CI
GPT-2 activation readout	0.747	[0.714, 0.780]
Gradient-boosted tree	0.744	[0.711, 0.777]
Logistic regression	0.769	[0.737, 0.800]

In the article-reported local comparison, the intervals overlap, so the paired tests carry the weight. DeLong’s test is a standard paired test for comparing two AUCs computed on the same respondents. The article reports the tree versus readout difference as -0.003 with a DeLong interval of [-0.018, +0.013] and p = 0.73. It reports logistic regression ahead of the readout by +0.022 AUC, with a DeLong interval of [+0.007, +0.036], a bootstrap interval of [+0.008, +0.036], and p = 0.003. The regression entered as an analysis-time baseline, so the comparison was not a declared advance prediction. The exact tests and intervals cannot be independently checked against a public matched output.

What would the article-reported +0.022 AUC mean in practice? AUC is the probability that a randomly chosen distressed respondent gets ranked above a randomly chosen non-distressed one. Within this local analysis, the reported gain corresponds to correctly ordering about two more pairs per hundred comparisons. That interpretation explains the metric; it does not turn the unverified local value into a general performance benchmark.

The twist: steering moves the model and changes nothing useful

One natural objection: maybe the readout underuses the model, and a nudge to the internals would surface more signal. Activation steering does exactly that. We learn a distress direction in the model’s activation space: the difference in mean activations between distressed and non-distressed respondents in the build split, scaled to unit length. This is the same difference-in-means recipe part 2 dissects. Steering then adds K times the typical activation length times that unit vector to each evaluation respondent’s activations, and we re-score.

The article reports three strengths, K = 0.05, 0.1, and 0.2, which shift the activations by 5, 10, and 20 percent of the typical activation norm. The reported AUC values are 0.747, 0.747, and 0.746. In the local analysis, the ordering moves by at most one in the third decimal place. These values have not been reproduced from a public run.

In that local run, the article reports that the activations move by 5 to 20 percent of their own length and the predicted probabilities shift by 4 to 16 percentage points on average, with a maximum shift of 0.35. The reported shift is nearly uniform across respondents, so almost nobody changes rank. AUC is a rank-based metric, and a rank-based metric cannot detect a nearly uniform probability shift.

The article also reports calibration degradation while the ordering stays nearly fixed. At the strongest setting, the reported Brier score rises from 0.191 to 0.228, roughly a 19% increase in probability error. The reported expected calibration error grows from 0.064 to 0.184. These are local article claims, not publicly reproduced estimates.

Within the article’s local results, this is close to the worst available trade. The ranking, the one thing steering set out to improve, stays where it started while the reported probabilities become less trustworthy. If an outreach cutoff sits at “predicted risk above 50%,” a model that inflates everyone’s probability can redefine who crosses that line without sorting anyone any better. That mechanism is the transferable lesson; the size of the reported calibration change remains locally reported.

Why this result makes sense

The local advance document reportedly predicted that steering would leave the ranking unimproved. Its tabular prediction was different: the gradient-boosted tree was expected to outpredict the language-model pipeline on AUC, while the article reports a tie of -0.003 with p = 0.73. The language model only sees the seven facts placed in the persona sentence. It cannot manufacture information absent from its input; the best it can do is re-encode those facts. A logistic regression that uses the seven facts directly faces no text-encoding bottleneck.

The pipeline takes seven clean columns, passes them through a text encoder, and hands back a re-encoding of the same information. The article-reported tree tie and regression advantage are both consistent with that information argument, read after the fact. The argument caps what the model can know at the seven input facts without saying in advance whether the tree would tie or pull ahead.

The article-reported stopping-rule override

The article reports one pre-declared item from the nonpublic local advance document. The concern was that seven categorical facts could produce persona sentences with near-identical wording. If the activations failed to distinguish one persona from another, every respondent would get nearly the same vector and the readout would have nothing respondent-specific to learn from. The reported rule was to compute the mean pairwise cosine across personas and stop if it exceeded 0.95. Cosine similarity measures the angle between two vectors, where 1 means they point in the same direction. Because the advance document is not public, the rule’s timing cannot be independently audited from this page.

The article reports that the gate tripped: the raw-activation mean pairwise cosine was 0.998 against the local 0.95 threshold. The analysis proceeded under a rationale adopted at analysis time. Raw transformer activations share a dominant common direction, so every respondent’s vector can contain the same large shared component. After subtracting that component, the article reports a mean absolute cosine of about 0.35 over 200 personas and an effective subspace of 16.4 directions for mean-pooled activations and 19.9 for last-token activations. It cites the reported 0.747 readout AUC as an operational check against a fully degenerate representation, which would score 0.5. The override is an analysis-time judgment. The centered diagnostics exist only in local working files and have not been matched to a public script, results file, or run record.

Does the demographic signal depend on the case-enriched sample?

The case-enriched subsample raises a fair question: do the seven facts still predict distress in the population frame, where the article reports prevalence near 14% rather than 0.36? The article describes a full complete-case frame of 6,045 respondents and a weighted logistic regression evaluated with 5-fold cross-validation. It reports a weighted AUC of 0.759 with a replicate-weight jackknife interval of [0.729, 0.790], and an unweighted AUC of 0.767. Within the local analysis, a gap below 0.01 AUC suggests stability to weighting. These weighted and unweighted values are not independently reproduced from a public artifact.

Scope and limits

The article-reported numbers carry two standing limits. First, the case-enriched subsample makes them local method-comparison values rather than population estimates of how well anyone can predict distress in the United States. Second, the language model receives only seven demographic facts in a persona sentence. A pipeline fed richer text, open-ended survey responses, or clinical notes would answer a different question.

The third limit is model scale, and the article reports two local checks with larger models. The primary comparison uses GPT-2 at 124 million parameters. The local working notes reportedly treated logistic regression’s 0.769 as the standing benchmark before a Qwen2.5-7B-Instruct run. The article reports that larger model’s readout at AUC 0.722 [0.689, 0.756] on the same 937 evaluation respondents. It also reports a Qwen2.5-1.5B-Instruct readout at AUC 0.741 with a bootstrap interval of [0.705, 0.776] on a smaller local subsample of 817 evaluation respondents. These model-scale values come from local work only. The scripts and matched outputs are not public, so readers cannot independently rerun the exact package from this page.

What the local comparison suggests

Across the three parts of this series, classical statistical tools provide the organizing logic. Part 1 explains a do-no-harm regression adjustment and illustrates it with article-reported local results. Part 2 treats a steering vector as an estimator and examines seed counts and paired confidence intervals. Here in part 3, the article reports that logistic regression on seven facts exceeds a language-model readout by +0.022 AUC with p = 0.003, while the reported 7-billion-parameter readout reaches 0.722. Because no public frozen package reproduces those values, they should be treated as a worked, article-reported comparison rather than a verified benchmark. The durable method lesson is to put a simple statistical baseline on the table before adding a language model. A public script, data-provenance record, and matched output would be required before others can independently retest the exact result.

Cite this article

Cholette, V. (2026, June 11). Comparing logistic regression with language-model readouts for survey prediction. Too Early To Say. https://tooearlytosay.com/research/methodology/logistic-regression-beats-llm-survey-prediction/