Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
Summary
IBM Research introduced VAKRA, an executable benchmark designed to evaluate AI agents' reasoning and action capabilities in enterprise-like environments, published on April 15, 2026. Unlike benchmarks testing isolated skills, VAKRA assesses compositional reasoning across APIs and documents, using full execution traces. It features an environment with over 8,000 locally hosted APIs across 62 domains and domain-aligned document collections. Tasks require 3-7 step reasoning chains, combining structured API interaction with unstructured retrieval under natural-language tool-use constraints. The benchmark comprises four capabilities: API Chaining (2,077 instances), Tool Selection (1,597 instances), Multi-Hop Reasoning (869 instances), and Multi-Hop, Multi-Source Reasoning and Policy Adherence (644 instances). Initial model performance on VAKRA is poor, highlighting significant challenges in robust, end-to-end agent reliability.
Key takeaway
For AI Architects designing and deploying agents in complex enterprise settings, VAKRA reveals that current models struggle with compositional reasoning, multi-hop interactions, and policy adherence. You should prioritize developing agents that can reliably execute multi-step workflows, manage large tool sets, and strictly follow explicit tool-use policies, as these are critical for robust real-world deployment. Consider using VAKRA to identify specific failure modes in your agent designs.
Key insights
VAKRA evaluates AI agents' compositional reasoning and tool use in complex, multi-step enterprise workflows.
Principles
- Compositional reasoning is critical for real-world agent reliability.
- Execution-centric evaluation reveals true agent capabilities.
- Policy adherence is a significant challenge for current models.
Method
VAKRA employs a waterfall-style evaluation pipeline, programmatically verifying policy adherence, comparing predicted tool call sequences against ground truth, and using LLM-based judges for final response grounding and factual consistency.
In practice
- Test agents against multi-step workflows with diverse APIs.
- Implement robust tool shortlisting for large API sets.
- Focus on agent adherence to explicit tool-use policies.
Topics
- VAKRA Benchmark
- AI Agent Evaluation
- Tool Use Reasoning
- Multi-Hop Reasoning
- Policy Adherence
Code references
Best for: Research Scientist, AI Architect, AI Scientist, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.