PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks
Summary
PPT-Eval is a new benchmark designed to evaluate computer-use agents on real-world Microsoft PowerPoint tasks, a ubiquitous and multimodal activity. Introduced with 120 tasks across 12 files, it covers both content creation and presentation editing scenarios, categorized by difficulty. A central innovation is its robust evaluation framework, which employs task-specific rubrics to address the complexity of multimodal tasks and the need to capture partial agent progress. These rubrics award partial credit for intermediate steps, penalize unnecessary changes and poor aesthetics, and offer natural language feedback. This nuanced approach achieved a Kendall's τ-b correlation of 0.77 with human judgments. Initial testing reveals that existing frontier agents, such as Claude-4.5-Opus, still struggle, achieving only a 45% success rate and an average partial score of 57%. The benchmark is publicly available at https://microsoft.github.io/ppteval.
Key takeaway
For AI Engineers developing computer-use agents, PPT-Eval highlights the significant challenges in automating complex, multimodal tasks like presentation creation. You should integrate robust, rubric-based evaluation methods into your development cycles to accurately assess partial progress and nuanced outputs, moving beyond binary success metrics. This benchmark provides a critical tool to identify specific agent weaknesses and drive improvements in real-world application capabilities.
Key insights
PPT-Eval benchmarks computer-use agents on complex PowerPoint tasks using a novel rubric-based evaluation system that awards partial credit.
Principles
- Complex multimodal tasks require nuanced evaluation.
- Partial progress metrics are crucial for agent assessment.
- Rubric-based scoring improves human judgment correlation.
Method
PPT-Eval's evaluation framework creates task-specific rubrics for PowerPoint tasks, awarding partial credit, penalizing unnecessary changes and poor aesthetics, and providing natural language feedback to achieve nuanced scoring.
In practice
- Test agents on 120 diverse PowerPoint tasks.
- Utilize rubric-based scoring for complex agent outputs.
- Benchmark agent performance against Claude-4.5-Opus.
Topics
- PPT-Eval
- Computer-Use Agents
- PowerPoint Automation
- AI Benchmarking
- Rubric-Based Evaluation
- Multimodal AI
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.