AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

The AGORA benchmark has been introduced to evaluate large language models (LLMs) acting as agents for archive-grounded workplace document reasoning. This new benchmark addresses the challenge of locating sparse evidence across extensive, unstructured collections of workplace files, requiring agents to reconcile inconsistent terminology, units, and time conventions to compute answers. Unlike existing benchmarks, AGORA jointly stresses archive-groundedness, agentic exploration, and broad cross-domain coverage. It comprises 362 questions paired with eight distinct domain collections, totaling 9,664 authentic documents and 372 million tokens, a scale designed to necessitate deliberate agentic exploration rather than exhaustive scanning. Built using an agentic pipeline incorporating cross-document task synthesis and leakage-preventing obfuscation, AGORA reveals that even the strongest of eight evaluated models achieves only 59.4% accuracy, indicating the task remains largely unsolved with significant performance variations across domains.

Key takeaway

For AI Engineers developing agentic LLM systems for enterprise document processing, AGORA highlights the current limitations in archive-grounded reasoning. Your current models, even strong ones, will likely struggle with sparse evidence retrieval and data reconciliation across large, messy document collections. Focus your development efforts on improving agentic exploration strategies and robustly handling inconsistent terminology and units to significantly enhance real-world performance.

Key insights

Archive-grounded reasoning in large, messy document collections remains a significant challenge for LLM agents, even with advanced benchmarks.

Principles

Method

Agora is built via an agentic pipeline combining cross-document task synthesis, leakage-preventing obfuscation, and difficulty filtering to create a robust benchmark.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.