PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

PPT-Eval is a new benchmark designed to evaluate computer-use agents on real-world Microsoft PowerPoint tasks, a ubiquitous and multimodal activity. Introduced with 120 tasks across 12 files, it covers both content creation and presentation editing scenarios, categorized by difficulty. A central innovation is its robust evaluation framework, which employs task-specific rubrics to address the complexity of multimodal tasks and the need to capture partial agent progress. These rubrics award partial credit for intermediate steps, penalize unnecessary changes and poor aesthetics, and offer natural language feedback. This nuanced approach achieved a Kendall's τ-b correlation of 0.77 with human judgments. Initial testing reveals that existing frontier agents, such as Claude-4.5-Opus, still struggle, achieving only a 45% success rate and an average partial score of 57%. The benchmark is publicly available at https://microsoft.github.io/ppteval.

Key takeaway

For AI Engineers developing computer-use agents, PPT-Eval highlights the significant challenges in automating complex, multimodal tasks like presentation creation. You should integrate robust, rubric-based evaluation methods into your development cycles to accurately assess partial progress and nuanced outputs, moving beyond binary success metrics. This benchmark provides a critical tool to identify specific agent weaknesses and drive improvements in real-world application capabilities.

Key insights

PPT-Eval benchmarks computer-use agents on complex PowerPoint tasks using a novel rubric-based evaluation system that awards partial credit.

Principles

Method

PPT-Eval's evaluation framework creates task-specific rubrics for PowerPoint tasks, awarding partial credit, penalizing unnecessary changes and poor aesthetics, and providing natural language feedback to achieve nuanced scoring.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.