A regression view of steering vectors

Victoria Cholette

CLASSICAL STATISTICS IN THE AGE OF AI · PART 2 OF 3

June 2026

A regression view of steering vectors

Some activation-steering directions can be compared with regression-gradient estimators under explicit model and distribution assumptions. This article reports a local GPT-2 experiment whose scripts, advance document, and matched outputs are not public.

An activation-steering direction can be treated as an estimated direction in activation space and compared with regression-based directions. By activations we mean the vectors of numbers the model computes at each layer as it reads text; ours are read at layer 6 of GPT-2’s 12 blocks and averaged over each sentence’s tokens, so every sentence becomes one 768-dimensional vector. A difference in class means, a logistic-probe coefficient, a Fisher discriminant, and a semiparametric average derivative are not automatically the same estimand. They are proportional only under particular functional-form and activation-distribution assumptions. Each is still a statistic with sampling error, so the usual questions about target, variance, and validation apply.

The local comparison is empirical rather than an identity. It contrasts the positive-minus-negative activation mean with the coefficient direction from an L2-regularized logistic probe, then measures their cosine similarity against a bar of 0.70 recorded in a local timestamped working document dated 2026-06-08. That document is not public, is not registered with OSF or AsPredicted, and is not part of a public replication package. The article reports an observed cosine of 0.63, short of that bar. A cosine reports directional agreement in this sample; it cannot establish that the two constructions estimate the same population object.

This is the second piece in a three-part series working through Avi Feller’s talk “Classical Statistics in the Age of AI” (Stanford Bay Area Tech Economics Seminar, June 4, 2026). His paper and code remain unpublished. This article describes a separate local implementation of the method from his abstract. Part 1 explains his “do no harm” correction for AI-assisted estimation, and Part 3 compares a language model’s internal representations with a plain logistic regression on a survey. The local scripts and matched outputs are not public, so the numerical comparisons below are not independently reproduced.

What activation steering is

Activation steering means adding a fixed vector to a language model’s activations so that its output shifts in a chosen direction: more positive, more honest, more refusing. Run a batch of positive examples through the model and record the activations; do the same for negative examples; subtract the two mean vectors; add the difference, scaled, at generation time, and the output’s measured tone changes.

To an economist, “subtract two group means” is a familiar contrast. Under additional assumptions, that contrast can be proportional to a discriminant or regression direction. Average derivatives are a different target: they average how a specified conditional mean changes with the activation coordinates. The experiment asks whether these related constructions point in similar directions and whether one steers the fixed model more strongly. It does not assume in advance that a statistically more efficient estimator of one target must be a better intervention direction.

The setup

Everything runs on a laptop. The model is GPT-2 (124 million parameters), open weights, run locally, with activations read at layer 6 and mean-pooled as above. The article reports a balanced construction corpus of roughly 400 short sentences selected or sampled from combinations of 15 everyday subjects, positive and negative adjective lists, and 5 templates. It is not the full Cartesian product, which would be much larger. Because the corpus generator and realized rows are not public, the exact sampling rule and duplicate handling cannot be audited. The judge is DistilBERT fine-tuned on SST-2. The article does not claim direct reuse of SST-2 rows in the template corpus, but that does not establish distributional independence between the judge’s training data and these sentences.

The evaluation protocol is the same for every estimator. We take 8 fixed neutral prompts (“I went to the new restaurant downtown and”, “The weather today was”, and six more). For each direction, we add plus or minus K times the unit vector to the layer-6 activations during generation, where K is 5 percent of the typical activation length, and generate 30 tokens greedily. The judge scores each completion’s positive-class probability on a 0-to-1 scale. Separation is the mean score under positive steering minus the mean under negative steering; larger separation means the intervention changed the judge score more.

For a Monte Carlo stability check, the local workflow repeats everything across 25 seeded bootstrap resamples. One seed draws, with replacement, from the same 400 construction sentences; all four directions are re-estimated on that resample, and the same 8 evaluation prompts are scored. These draws reuse one corpus and are not 25 independent experiments. The nonpublic record does not specify how the bracketed endpoints below were constructed, so this page treats them as article-reported descriptive resampling summaries, not confidence intervals for a population effect. Treating the 25 resamples as independent observations would also make a conventional p-value invalid.

Does the nudge point where the regression points?

The article reports a local cosine similarity of 0.63 with a bracketed resampling summary of [0.55, 0.71]. The point estimate is below the local advance document’s 0.70 rule for calling the directions “the same object.” The bracket does not establish whether the population parameter exceeds 0.70. An earlier 5-seed local run had recorded 0.749, but the rerun overwrote its raw output and the value remains only in a local contemporaneous note. Neither that note nor a matched output is public. The 25-seed result is an article-reported local stability summary, not an independently reproduced benchmark.

The estimator ladder

The experiment compares a ladder of related directions. Each rung produces one 768-dimensional vector. The sign rule should be explicit: choose the sign so its dot product with the positive-minus-negative mean contrast is nonnegative, using only the construction sample, then divide by its Euclidean norm. That convention prevents an arbitrary eigenvector or score sign from reversing the intervention. The values and brackets in the ladder are article-reported local point summaries and descriptive bootstrap summaries. They have not been independently reproduced from a public script and matched output.

Raw difference-in-means. Subtract the mean activation of the negative sentences from the mean activation of the positive ones. The 768 entries of that difference are the direction. Separation: 0.755 [0.705, 0.804].
Logistic-probe direction. Train an L2-regularized logistic regression on the 400 activations to predict the label. If p(h) = logistic(a + h'b), then b is the gradient of the fitted log odds, not the gradient of the fitted probability. The probability gradient is p(h)(1 - p(h))b, so its sample average is a positive scalar multiple of b when probabilities are strictly between zero and one. The unit directions coincide within this fitted linear-logit model, while their magnitudes and estimands do not. Separation: 0.818 [0.771, 0.866].
Regularized Fisher direction. Define a pooled within-class covariance matrix Sigma_w, add a pre-specified ridge for numerical stability, and solve (Sigma_w + lambda I)v = mu_pos - mu_neg. This is the regularized Fisher linear-discriminant direction. It is Bayes-motivated under a shared-covariance Gaussian class model, but it remains a descriptive discriminant direction when that model does not hold. It is not a general average-gradient estimator and should not be described as generalized least squares. The nonpublic local implementation is labeled “whitened” in the saved account; without its code, the exact covariance and ridge conventions cannot be checked. Separation: 0.863 [0.823, 0.904].
Cross-fit orthogonal average-derivative estimator. First, fit the principal-component transformation on each training fold and standardize its retained 30 scores. Second, fit a conditional-mean model m(h) = E[Y | H = h] on that fold. Third, evaluate on the held-out fold the moment grad m(h) + alpha(h)(y - m(h)), where alpha is the Riesz representer for the derivative functional. With density f and suitable boundary conditions, alpha(h) = -grad log f(h). Under a standard-normal working model, the density score is -h, so alpha(h) = h; using the density score itself would reverse the correction sign. Swap folds and average. Finally, map the component-space derivative back through the PCA loadings and undo the score scaling. The moment is Neyman-orthogonal only if the target, nuisance fits, Riesz representer, and cross-fitting scheme are correctly specified. Separation: 0.794 [0.754, 0.835].

The article-reported pattern is 0.755, then 0.818, then 0.863, then 0.794. Within the local experiment, the probe direction improves on the raw difference; the class-covariance adjustment improves on both; the orthogonal estimate is lower than the Fisher estimate. This is a steering-performance comparison among different targets in one local setup, not an efficiency ranking for a common estimand.

In the local results reported here, whitening exceeds the raw direction by +0.109, with a descriptive paired-resample summary of [+0.051, +0.167] across the same 25 seeds. Relative to the article-reported raw baseline of 0.755, the point comparison is roughly a 14 percent larger shift in the judge’s positive-class probability. This is a conditional comparison on one corpus, model, prompt set, and bootstrap procedure. The exact comparison is not backed by a public matched output.

The local advance document states two different thresholds, and the article-reported gap falls within that interval. The hypothesis text says a gradient-based direction should exceed raw by a margin larger than the across-seed standard deviation, one SD; the formal decision rule says by more than 2 seed-SDs. The article reports SDs of about 0.10 for the whitened estimator and 0.12 for raw, so the implied descriptive thresholds range from about 0.10 to about 0.25. The reported +0.109 exceeds the least stringent threshold by less than 0.01 and is below the stricter thresholds. The written decision rule is therefore not met. A paired test was adopted at analysis time, but its p-value is not used here because the 25 bootstrap resamples are not independent experimental units.

The local advance document also called for a Gaussianity check, which the article reports as failed in the same 30-component principal-component representation used by the orthogonal estimator. The reported Mardia kurtosis is 1164 against d(d+2) = 30 times 32 = 960 under multivariate normality. That result is inconsistent with the standard-normal density-score working model used for alpha(h). It also weakens a Gaussian Bayes interpretation of the Fisher direction, although Fisher’s direction can still be computed without claiming Gaussian truth. A dose-response curve was run at four strengths, but the article says its values were not preserved. A secondary check on another open model was attempted but produced no preserved output. None of these local artifacts is public.

The article reports the fully flexible orthogonal estimator at 0.794, with a descriptive paired-resample difference against the whitened estimator of -0.069 [-0.126, -0.012]. Within the local experiment, the point comparison is about 8 percent less steering effect than the covariance adjustment. The local advance document predicted that the orthogonal estimator would have the largest separation and reserved a fallback sentence, “the flexible version adds nothing here,” if it did not exceed whitening. The 25-seed summary reports the orthogonal point estimate below whitening in this setup, but it does not supply an independent-sample significance test. Because neither the advance document nor matched outputs are public, this scorecard is not independently reproducible from the page.

One plausible post-hoc explanation for the deficit is nuisance-estimation error: the orthogonal construction must estimate a conditional mean and a Riesz representer, while the Fisher direction uses means and a covariance. Orthogonality removes first-order sensitivity under its regularity and product-rate conditions; it does not cancel every finite-sample error or repair a misspecified Gaussian score. The local results do not isolate which mechanism mattered. There is no universal sample-size threshold at which 400 observations becomes “large” or “small”; that depends on complexity, signal, regularization, and nuisance error. A sample-size sweep and out-of-fold nuisance diagnostics would be needed to test this explanation. The advance document did not contain it and predicted the orthogonal estimator would do best.

A note on resamples, because the summary changed

The article reports a 5-seed first pass at the whitened-versus-raw comparison with a point gap of +0.091. At 25 seeds, it reports +0.109 with a bracketed descriptive summary of [0.051, 0.167]. The p-values formerly attached to those runs are omitted because bootstrap draws from one corpus are not independent observations. The remaining values are local article claims rather than public replication results.

The advance document fixed the seed count only as “at least 5,” and the implementation expanded to 25 seeds as a second stage. The article says no on-disk document records that decision from before the 5-seed summary was seen, so it does not claim the expansion was planned in advance. The point comparison moved from +0.091 to +0.109 as the Monte Carlo approximation used more draws. That is a stability diagnostic for the chosen resampling procedure, not evidence that a population effect became statistically distinguishable.

More seeded draws can make the Monte Carlo approximation of this bootstrap distribution less sensitive to the particular random draws. They do not add new sentences, prompts, models, or independent experiments. Whether the ladder’s ordering generalizes to other corpora and other models is untested here.

What an applied economist might take from this

The translation from steering vector to average gradient implies checkable predictions, and the article reports mixed local results. Within that experiment, the point comparison for the covariance-adjusted vector is about 14 percent larger than for the raw one. The value and its resampling summary remain unverified by a public frozen package and do not establish a population effect.

The article’s local scorecard has three lines: one point comparison in the predicted direction but short of the written margin rule, one ordering opposite the prediction, and one criterion missed. The reported orthogonal point estimate is below whitening, while the reported cosine of 0.63 is below the local 0.70 criterion. Nuisance-estimation error and a misspecified Gaussian score are post-hoc hypotheses for the orthogonal deficit, not established explanations. As a method lesson, the directions can be compared without claiming that this local experiment established a shared estimand.

These AI interventions use longstanding statistical objects, which is the point of Feller’s title. Each constructed steering direction can be analyzed as an estimator, but the raw contrast, log-odds gradient, Fisher direction, and average derivative target different objects unless assumptions connect them. Once the estimand is named, familiar questions about variance, sample size, model fit, and transport to new prompts become available.

In Part 3, we compare whether the model’s internal activations forecast real survey responses more accurately than a plain logistic regression. That comparison also requires an evidence design that matches the claim.

Cite this article

Cholette, V. (2026, June 11). A regression view of steering vectors. Too Early To Say. https://tooearlytosay.com/research/methodology/steering-vectors-regression-gradient/