SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

SpreadsheetBench 2 introduces a new workflow-level benchmark for evaluating AI agents on end-to-end business spreadsheet tasks, moving beyond isolated operations. It comprises 321 expert-annotated tasks, averaging 11.8 worksheets and 593.5 cell modifications, derived from authentic financial reports and corporate filings. The benchmark covers generation (financial modeling, template completion), debugging, and visualization. Evaluations of eight frontier large language models and several LLM-based spreadsheet products reveal significant performance gaps. The best model achieved only 34.89% overall accuracy, with debugging accuracy as low as 12.00%. Analysis indicates insufficient spreadsheet inspection and incorrect target-cell selection are primary bottlenecks, positioning SpreadsheetBench 2 as a challenging testbed.

Key takeaway

For AI engineers developing spreadsheet agents, recognize that current LLMs are far from reliable on real-world, multi-sheet workflows. Prioritize improving agents' ability to inspect complex spreadsheets, accurately select target cells, and maintain cross-cell consistency. Focus on robust error diagnosis and grounding abstract financial knowledge within specific workbook structures to significantly enhance end-to-end task accuracy.

Key insights

LLM agents struggle with end-to-end spreadsheet workflows due to complex cross-sheet dependencies and reasoning.

Principles

Method

SpreadsheetBench 2 is constructed via authentic business data collection, expert-annotated task creation, and independent expert validation, using a multi-turn agent scaffold.

In practice

Topics

Best for: Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.