What We Mistake for AI Capability

AI output quality tracks specification precision, not model capability. Wide tolerance makes the model look smart. Narrow tolerance reveals the gaps.

AI-generated slides are impressive. They cut the labor of formatting and the mental cost of distilling information into a clear sequence. Impressive on their own terms, before we even theorize about why. But the explanation people reach for is telling: the model has seen every slide deck ever posted online, so it must have extracted the tacit knowledge of rhetoric and pedagogy. It knows Aristotelian rhetoric. It has internalized the theory of decks. The output is good because the model understands something deep.

This has a seductive logic. The model trained on good decks and bad decks, conference talks and corporate pitches. Surely it internalized something about what works. And maybe it has. But rhetoric depends on context. Aristotle himself distinguished between audiences and occasions for a reason. Knowing “the theory of decks” does not tell the model whether this deck is for methodologists who want coefficient plots or policymakers who need one takeaway per slide. It does not know if the talk is 12 minutes or an hour. Whether the goal is to teach, to persuade, or to survive a job market. That context comes from the prompt, not the training data.

Without specification, the model does not return an optimum. It returns something like an average across everything it has seen: the kind of output that would be roughly acceptable across the widest range of contexts in its training data. When our context is stable enough (same audience, same field, same purpose, same room), that average can feel like an optimum. We never notice the specification is missing because it never changes. The problem surfaces when the context shifts and the output that used to feel right suddenly feels generic.

The Variation Tolerance Test

Our impression of how capable an AI model is at a task depends on how much variation we tolerate in the output.

When the acceptable range is wide (a slide deck that looks professional, follows field conventions, uses readable fonts), the model clears the bar easily. Any point in that wide distribution works. We call this “good output” and credit the model.

When the acceptable range is narrow, the model struggles. An economics diagram where indifference curves must be convex, the budget constraint must pivot correctly, the PPF must bow outward with the “right” curvature. An economist described (in a recent LinkedIn post) uploading example images, writing detailed rules, and creating custom project context. Still unconvincing results. Their conclusion: maybe AI is not as capable as advertised.

Same model in both cases. The difference is how precisely the output must match a specific target. When we tolerate a wide range, the average is fine. When we need a narrow range, the average is wrong in ways that matter.

This means our assessments of AI capability are partly measuring the task’s tolerance for variation, not the model’s ability. A task where many outputs are acceptable will always make the model look smarter than a task where only one output is correct.

The model did not get smarter. Our standards got wider.

What We Are Actually Measuring

When someone says “the model is amazing at slides” and someone else says “the model is terrible at economics diagrams,” they are both right about the output. They are both wrong about the explanation. The model did not get smarter for slides and dumber for diagrams. The spec changed.

This matters because the mistake (crediting the model’s “deep understanding” when the task was actually underspecified) prevents us from asking the more useful questions. What do we actually know about our audience? What level of technical detail serves this room? What should each slide accomplish? What teaching purpose does this diagram serve, and what should the student see first?

These are questions most of us skip. When the production cost was high (hours fighting with TikZ or PowerPoint formatting), we spent all our time on execution and had none left for design thinking. Gen AI reduced that production cost from hours to minutes. The rational move is to invest the freed-up time upstream. What makes a presentation effective in a specific context? What makes a diagram clear for teaching? What makes any visual work for its audience?

The irony: the model is quite good at helping us think through those questions. We could use it to reflect on our audience, to articulate what makes a diagram work for teaching trade theory versus presenting a research result. We could use it to generate more precise specifications, if we use it for that instead of assuming it already knows the answer.

The Specification Is the Skill

The pattern generalizes beyond slides and diagrams. Every task where we evaluate AI output follows the same logic: wide tolerance makes the model look capable, narrow tolerance reveals the gaps, and the gap is almost always in the specification.

The skill that matters most is specifying what we want precisely enough that the output lands in the narrow range we actually need. That is harder than it sounds. It requires understanding our own standards well enough to articulate them. For most tasks (slides, diagrams, writing, code) we have never had to do that. The tacit knowledge (the kind we use without thinking about it) was in our heads, applied unconsciously during manual production, and never written down. We have called this the copy-paste ceiling: at some point, the workflow demands more than accepting model output. It demands knowing what we want and why.

Michael Polanyi had a phrase for this: “we can know more than we can tell.” He called it tacit knowledge in 1966, and it describes what AI forces us to do. The knowledge that lived in our hands while making slides, in our eyes while drawing diagrams, now has to be written down in a prompt. The production cost dropped. The specification cost is now the hard part. And unlike model capability, specification is something we can improve with practice. (We explored one concrete version of this in From Methodology to Code, where the time savings scaled directly with how precise the methods section was.)

Whether this makes AI more useful or less depends on whether we recognize the shift. Too early to say.

For a deeper look at how this plays out in empirical research, where specification precision determines whether agents produce reliable code or plausible-looking nonsense, see What Agents Actually Do (And What They Don’t).

Suggested Citation

Cholette, V. (2026, March 2). What we mistake for AI capability. Too Early To Say. https://tooearlytosay.com/research/methodology/what-we-mistake-for-ai-capability/
Copy citation