VAKRA: A Reality Check for AI Agents That Actually Use Tools

I’ve been following the agent space long enough to know that most benchmarks are designed to make models look good. The VAKRA benchmark from IBM Research is not one of them.

VAKRA stands for something, but honestly, what matters is what it does: it runs agents against over 8,000 locally hosted APIs backed by real databases across 62 domains, with tasks that require 3-7 step reasoning chains. No shortcuts. No simulated environments where you can fudge the results. The agent has to actually call the API, get the data, and produce the right answer.

And the results? Models perform poorly. That’s not a knock on the models — it’s a reflection of how hard this problem actually is.

What VAKRA Actually Tests

The benchmark is split into four capabilities, each designed to stress a different aspect of agentic reasoning. Let me walk through the ones that stood out to me.

Capability 1: API Chaining with Business Intelligence APIs

This one has 2,077 test instances across 54 domains. The agent has to chain between 1 and 12 tool calls to get the final answer. The data sources come from the SLOT-BIRD and SEL-BIRD collections, which are essentially Tableau-like interfaces exposed as tools.

Here’s a concrete example from the paper. The query is: “Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?” The agent has to call get_data first to initialize the data source (which returns a lightweight preview, not the full dataset — a smart design choice to avoid MCP protocol bloat), then chain multiple select_data_equal_to calls to filter down to the right row, and finally call get_team_name.

The answer is FC Barcelona. Simple for a human. For an agent? It has to figure out the order of operations, handle the tool signatures, and not get lost in the intermediate results.

What I find interesting here is the design decision to return a preview instead of the full dataset. The authors explicitly mention this prevents “inefficient transfer of large data over the MCP protocol.” That’s a real-world engineering concern that most benchmarks ignore. In production, you can’t just dump a 10GB dataset into a context window and call it a day.

Capability 2: Tool Selection with Dashboard APIs

This one has 1,597 instances across 17 domains, and it’s where things get messy. The tool sets range from 6 to 328 tools per domain, with an average of 116. The agent has to pick the right one from a potentially huge list.

Here’s the kicker: the OpenAI API Specification restricts the tool list input to a maximum of 128 tools. So if you’re using GPT-4 or similar, you literally cannot pass all 328 tools in one go. You need a strategy — maybe retrieval-augmented tool selection, maybe hierarchical grouping, maybe something else entirely.

VAKRA doesn’t tell you how to solve this. It just exposes the problem and measures whether you can. That’s the kind of benchmark I respect.

Failure Modes: Where Agents Fall Apart

The paper catalogs several failure modes, and none of them are surprising if you’ve worked with agents in production.

Failure to initialize properly. Agents often skip the get_data call and try to call downstream tools directly. This is like walking into a library and asking for a book without checking the catalog first. The tool doesn’t exist until the data source is initialized.

Tool hallucination. Agents invent tool names that don’t exist. This happens more often than you’d think. The model sees a pattern in the tool names and extrapolates to a function that isn’t in the API spec. This is the agent equivalent of a language model making up a citation.

Chain depth issues. The longer the chain, the more likely the agent loses track. After 5-6 steps, performance drops off a cliff. This aligns with what I’ve seen in practice: agents are good at 2-3 step tasks, but anything beyond that requires careful prompt engineering or external memory.

Tool selection overload. With 100+ tools available, agents often pick the wrong one. The model doesn’t have a good internal representation of what each tool does, especially when tool names are similar. get_data_by_name vs get_data_by_id — the agent might pick the wrong one and never recover.

Why This Benchmark Matters

Most agent benchmarks test in isolation. “Here’s a question, here’s a set of tools, go.” VAKRA tests compositionally — chaining tools, reading documents, handling errors, and doing it all under realistic constraints like the 128-tool limit.

This is higher than I expected in terms of difficulty. I’ve seen enough agent demos to know that the gap between a demo and production is enormous. VAKRA quantifies that gap.

I also appreciate that the benchmark is executable. You can’t cheat by having a human in the loop or by pre-computing answers. The agent has to actually run the tools and produce results. That’s the only way to know if the thing works.

The Elephant in the Room

No benchmark is perfect. VAKRA focuses on enterprise-like environments, which means it might not generalize to consumer-facing use cases. The tool interfaces are also somewhat stylized — real enterprise APIs are often messier, with inconsistent naming conventions, undocumented endpoints, and authentication issues.

But as a stress test for agentic reasoning, it’s one of the best I’ve seen. The failure modes it exposes are real problems that anyone building agents at scale will encounter.

If you’re working on agents, I’d recommend taking a close look at the VAKRA dataset and leaderboard. Not because it’s fun (it’s not — your models will probably fail), but because it tells you what you need to fix.

And if you’re a vendor claiming your agent framework can handle complex enterprise workflows? VAKRA is the reality check you didn’t ask for but probably need.