$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge
Summary
Existing benchmarks for conversational agents often evaluate retrieval or tool use in isolation, failing to capture realistic, fully agentic evaluation over unstructured data in long-horizon interactions. To address this, researchers introduce $τ$-Knowledge, an extension of $τ$-Bench designed for evaluating agents that coordinate external, natural-language knowledge with tool outputs for verifiable state changes. The new $τ$-Banking domain models realistic fintech customer support, requiring agents to navigate approximately 700 interconnected knowledge documents and execute tool-mediated account updates. Even frontier models achieve only a ~25.5% pass rate, with reliability degrading over repeated trials, primarily due to struggles in retrieving correct documents and reasoning accurately over complex internal policies. $τ$-Knowledge thus offers a crucial, realistic testbed for developing agents that effectively integrate unstructured knowledge in human-facing deployments.
Key takeaway
A new benchmark, $τ$-Knowledge, evaluates conversational agents' ability to integrate unstructured knowledge with tool use in complex, long-horizon interactions. In the $τ$-Banking domain, frontier models achieve only a ~25.5% pass rate, struggling with retrieving correct documents from ~700 interlinked knowledge documents and reasoning over complex policies. This highlights critical gaps in current agent capabilities for real-world, knowledge-intensive human-facing deployments.
Topics
- Conversational Agents
- Agent Evaluation
- Unstructured Knowledge
- Information Retrieval
- Fintech AI
Best for: AI Scientist, Research Scientist, AI Architect, AI Researcher, AI Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.