The AI Industry Has Pivoted to Evals — and Is Dodging the Real Question
In 2026, one of the hottest engineering practices in AI is building evaluation systems — evals — for models and agents.
The playbook is well-established: accumulate a gold-standard dataset from real failures, train a scorer you trust, use a large model “aligned with human reviewers” as your judge, and put a CI gate in front of every quality regression. Anthropic has published guides on how to do this properly; one survey found that 32% of teams name quality as the single biggest blocker to shipping AI products. Evals have been packaged and sold as the engineering discipline that finally makes AI reliable.
The approach works. But here’s what I keep seeing: the industry is reframing an organizational problem as an engineering problem — and the organizational problem is the one evals can’t solve.
Strip away the engineering wrapper — what is an eval, really?
Take a working eval system, remove the “dataset / scorer / CI” apparatus, and you’re left with exactly two things: a written definition of what counts as good and what’s completely unacceptable, plus a mechanism that enforces it.
Building the pipeline and running the CI — that’s the easy part, and it’s the part that gets tooled fastest. The hard part is the first half of that sentence: what actually counts as good? That’s not an engineering question. It’s a judgment question. And judgment is precisely what evals want to sidestep — and can’t.
”Using an LLM as judge” just kicks the question one step further back
The fashionable move right now is to run an LLM as your judge and claim it’s “aligned with human reviewers.” It sounds scientific until you push on it: aligned with which humans? Whose taste?
The judge model doesn’t generate standards — it reproduces whatever standards you fed it. Whoever’s judgment is baked into your gold-standard dataset is the ceiling of your eval. The whole exercise of “accumulating a dataset from real failures” is, at bottom, a values document disguised as test data — it records what this particular team refuses to tolerate.
Which means: evals amplify the taste you already have, but they can’t give you taste. A team with poor judgment and a beautiful eval pipeline doesn’t get a good product. It gets a more efficient, more stable pipeline for producing mediocrity at scale.
What the eval boom is actually revealing
The “AI can do anything” narrative originally promised to dissolve the human gatekeeper whose job was quality. The eval boom is the industry quietly re-hiring that person — just giving the role an engineering-flavored job title.
The subtext isn’t flattering: AI hasn’t eliminated the person with judgment; it’s made that person the bottleneck. The cheaper execution gets, the scarcer “defining what good means” becomes. The scramble to build evals is a belated acknowledgment of exactly that.
My read on where this goes: the winners won’t be the teams with the most sophisticated eval pipelines. They’ll be the teams with the most opinionated, most precisely articulated definition of “good.” Because the pipeline faithfully executes whatever standard you hand it — and most teams’ standards are a mess.
Evals were never a measurement problem. They’re the industry’s slow admission that someone has to decide what good is — and that, it turns out, is the one thing that doesn’t scale.
Discussion