SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

2026-06-30 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

SpreadsheetBench 2 introduces a new workflow-level benchmark for evaluating AI agents on end-to-end business spreadsheet tasks, moving beyond isolated operations. It comprises 321 expert-annotated tasks, averaging 11.8 worksheets and 593.5 cell modifications, derived from authentic financial reports and corporate filings. The benchmark covers generation (financial modeling, template completion), debugging, and visualization. Evaluations of eight frontier large language models and several LLM-based spreadsheet products reveal significant performance gaps. The best model achieved only 34.89% overall accuracy, with debugging accuracy as low as 12.00%. Analysis indicates insufficient spreadsheet inspection and incorrect target-cell selection are primary bottlenecks, positioning SpreadsheetBench 2 as a challenging testbed.

Key takeaway

For AI engineers developing spreadsheet agents, recognize that current LLMs are far from reliable on real-world, multi-sheet workflows. Prioritize improving agents' ability to inspect complex spreadsheets, accurately select target cells, and maintain cross-cell consistency. Focus on robust error diagnosis and grounding abstract financial knowledge within specific workbook structures to significantly enhance end-to-end task accuracy.

Key insights

LLM agents struggle with end-to-end spreadsheet workflows due to complex cross-sheet dependencies and reasoning.

Principles

Real-world spreadsheet tasks require workflow-level evaluation, not isolated operations.
Debugging tasks pose the greatest challenge for current LLM agents.
Agent performance is driven by reliable per-step execution, not just more interaction steps.

Method

SpreadsheetBench 2 is constructed via authentic business data collection, expert-annotated task creation, and independent expert validation, using a multi-turn agent scaffold.

In practice

Focus agent development on cross-sheet consistency and end-to-end correctness.
Improve error diagnosis and localization capabilities in complex spreadsheets.
Enhance grounding of abstract domain knowledge within specific workbook structures.

Topics

Spreadsheet Agents
LLM Evaluation
Business Workflows
Financial Modeling
Spreadsheet Debugging
Data Visualization
Multi-sheet Spreadsheets

Best for: Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.