AI evals are becoming the new compute bottleneck

Remember when the big complaint was that training models cost too much? We’ve traded one bottleneck for another, and it might be worse.

A few weeks ago, the Holistic Agent Leaderboard (HAL) published their cost accounting: $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. That’s not training. That’s just checking how well the things work. A single GAIA run on a frontier model can hit $2,829 before you even think about caching. Exgentic spent $22,000 sweeping agent configurations and found a 33× cost spread on identical tasks — the scaffold choice alone drove most of it.

This is higher than I expected, and I’ve been watching this space for years.

The problem was already brewing before agents

When Stanford’s HELM launched in 2022, the per-model accounting already showed API costs ranging from $85 (OpenAI’s code-cushman-001) to $10,926 (AI21’s J1-Jumbo). Across HELM’s 30 models and 42 scenarios, the aggregate came to roughly $100,000. At the time, people shrugged — training was the expensive part.

But Perlitz et al. noticed something worse. EleutherAI’s Pythia released 154 checkpoints for each of 16 models across 8 sizes — 2,464 checkpoints total. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training. Their 2024 paper noted that evaluation costs “may even surpass those of pretraining when evaluating checkpoints.” For small models, evaluation becomes the dominant compute line item across the whole development cycle.

When you scale inference-time compute, you scale evaluation costs proportionally. There’s no escape.

The good news: Perlitz et al. also found that a 100× to 200× reduction in compute preserved nearly the same ranking order on static benchmarks. Flash-HELM turned that into a coarse-to-fine procedure: run cheap evaluations first, spend high-res only on top candidates. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error. The Open LLM Leaderboard collapsed from 29,000 examples to 180.

Static benchmarks had a weakness you could exploit: model differences concentrate in a small subset of items, so ranking can survive aggressive subsampling. That trick worked beautifully until agents showed up.

Agent evals are a different beast

HAL’s $40,000 headline hides something uglier. Behind that aggregate, the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks. One run of Claude Opus 4.1 on a single benchmark can cost more than a hundred runs of Gemini 2.0 Flash on the same task — the API pricing spread is two orders of magnitude on input alone ($15 vs $0.10 per million tokens).

Agent benchmarks rarely benchmark “the model” in isolation. They benchmark a model × scaffold × token-budget product. Small scaffold choices can multiply costs 10×. And here’s the kicker: higher spend does not reliably buy better results.

On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes “a 9× difference in cost despite just a two-percentage-point difference in accuracy.” On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686.

CLEAR found across 6 SOTA agents on 300 enterprise tasks that “accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives” with comparable real-world performance.

This means the field is burning money on configurations that don’t meaningfully improve results. The cost problem isn’t just about scale — it’s about waste.

Scientific ML evals are even worse

If you think LLM evals are expensive, look at The Well. Evaluating one new architecture costs about 960 H100-hours. A full four-baseline sweep? 3,840 H100-hours. At current cloud rates, that’s somewhere between $5,000 and $15,000 per evaluation run, depending on your provider and whether you get spot instances.

And unlike LLMs, you can’t easily compress these benchmarks. The physics doesn’t compress. The simulations don’t compress. You either run the simulation or you don’t.

What this means for the field

We’re watching evaluation costs reshape who can do AI research. Academic labs that can afford a few hundred GPU-hours for training are now staring down evaluation budgets that match or exceed their training budgets. Small companies and independent researchers are priced out of rigorous evaluation entirely.

This is a structural problem. When evaluation costs are this high, you get fewer independent evaluations, more reliance on self-reported numbers, and less reproducibility. The field becomes more opaque, not less.

The compression techniques that worked for static benchmarks — subsampling, Item Response Theory, coarse-to-fine procedures — break on agent benchmarks. Agent evals are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction. And when you try to add reliability to these evals, repeated runs further multiply the cost.

UK-AISI recently scaled agentic steps into the millions to study inference-time compute. That’s not a one-off experiment; it’s a sign of where we’re heading. Evaluation is becoming a first-order cost driver in AI research, and nobody has figured out how to make it cheap again.

Some people are working on it. Flash-HELM’s coarse-to-fine approach could be adapted. There’s work on learned surrogate models that predict evaluation results without running the full benchmark. But nothing has cracked the agent evaluation problem yet.

For now, if you’re planning an AI research project, budget for evaluation like you budget for training. Maybe more. The bottleneck has shifted.

AI evals are becoming the new compute bottleneck

The problem was already brewing before agents

Agent evals are a different beast

Scientific ML evals are even worse

What this means for the field

Comments (0)