SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows
Summary
SpreadsheetBench 2 introduces a new workflow-level benchmark for evaluating AI agents on end-to-end business spreadsheet tasks, moving beyond isolated operations. It comprises 321 expert-annotated tasks, averaging 11.8 worksheets and 593.5 cell modifications, derived from authentic financial reports and corporate filings. The benchmark covers generation (financial modeling, template completion), debugging, and visualization. Evaluations of eight frontier large language models and several LLM-based spreadsheet products reveal significant performance gaps. The best model achieved only 34.89% overall accuracy, with debugging accuracy as low as 12.00%. Analysis indicates insufficient spreadsheet inspection and incorrect target-cell selection are primary bottlenecks, positioning SpreadsheetBench 2 as a challenging testbed.
Key takeaway
For AI engineers developing spreadsheet agents, recognize that current LLMs are far from reliable on real-world, multi-sheet workflows. Prioritize improving agents' ability to inspect complex spreadsheets, accurately select target cells, and maintain cross-cell consistency. Focus on robust error diagnosis and grounding abstract financial knowledge within specific workbook structures to significantly enhance end-to-end task accuracy.
Key insights
LLM agents struggle with end-to-end spreadsheet workflows due to complex cross-sheet dependencies and reasoning.
Principles
- Real-world spreadsheet tasks require workflow-level evaluation, not isolated operations.
- Debugging tasks pose the greatest challenge for current LLM agents.
- Agent performance is driven by reliable per-step execution, not just more interaction steps.
Method
SpreadsheetBench 2 is constructed via authentic business data collection, expert-annotated task creation, and independent expert validation, using a multi-turn agent scaffold.
In practice
- Focus agent development on cross-sheet consistency and end-to-end correctness.
- Improve error diagnosis and localization capabilities in complex spreadsheets.
- Enhance grounding of abstract domain knowledge within specific workbook structures.
Topics
- Spreadsheet Agents
- LLM Evaluation
- Business Workflows
- Financial Modeling
- Spreadsheet Debugging
- Data Visualization
- Multi-sheet Spreadsheets
Best for: Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.