Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

2026-03-31 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, long

Summary

IBM Research introduced VAKRA, an executable benchmark designed to evaluate AI agents' reasoning and action capabilities in enterprise-like environments, published on April 15, 2026. Unlike benchmarks testing isolated skills, VAKRA assesses compositional reasoning across APIs and documents, using full execution traces. It features an environment with over 8,000 locally hosted APIs across 62 domains and domain-aligned document collections. Tasks require 3-7 step reasoning chains, combining structured API interaction with unstructured retrieval under natural-language tool-use constraints. The benchmark comprises four capabilities: API Chaining (2,077 instances), Tool Selection (1,597 instances), Multi-Hop Reasoning (869 instances), and Multi-Hop, Multi-Source Reasoning and Policy Adherence (644 instances). Initial model performance on VAKRA is poor, highlighting significant challenges in robust, end-to-end agent reliability.

Key takeaway

For AI Architects designing and deploying agents in complex enterprise settings, VAKRA reveals that current models struggle with compositional reasoning, multi-hop interactions, and policy adherence. You should prioritize developing agents that can reliably execute multi-step workflows, manage large tool sets, and strictly follow explicit tool-use policies, as these are critical for robust real-world deployment. Consider using VAKRA to identify specific failure modes in your agent designs.

Key insights

VAKRA evaluates AI agents' compositional reasoning and tool use in complex, multi-step enterprise workflows.

Principles

Compositional reasoning is critical for real-world agent reliability.
Execution-centric evaluation reveals true agent capabilities.
Policy adherence is a significant challenge for current models.

Method

VAKRA employs a waterfall-style evaluation pipeline, programmatically verifying policy adherence, comparing predicted tool call sequences against ground truth, and using LLM-based judges for final response grounding and factual consistency.

In practice

Test agents against multi-step workflows with diverse APIs.
Implement robust tool shortlisting for large API sets.
Focus on agent adherence to explicit tool-use policies.

Topics

VAKRA Benchmark
AI Agent Evaluation
Tool Use Reasoning
Multi-Hop Reasoning
Policy Adherence

Code references

Best for: Research Scientist, AI Architect, AI Scientist, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.