PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

PPT-Eval is a new benchmark designed to evaluate computer-use agents on real-world Microsoft PowerPoint tasks, a ubiquitous and multimodal activity. Introduced with 120 tasks across 12 files, it covers both content creation and presentation editing scenarios, categorized by difficulty. A central innovation is its robust evaluation framework, which employs task-specific rubrics to address the complexity of multimodal tasks and the need to capture partial agent progress. These rubrics award partial credit for intermediate steps, penalize unnecessary changes and poor aesthetics, and offer natural language feedback. This nuanced approach achieved a Kendall's τ-b correlation of 0.77 with human judgments. Initial testing reveals that existing frontier agents, such as Claude-4.5-Opus, still struggle, achieving only a 45% success rate and an average partial score of 57%. The benchmark is publicly available at https://microsoft.github.io/ppteval.

Key takeaway

For AI Engineers developing computer-use agents, PPT-Eval highlights the significant challenges in automating complex, multimodal tasks like presentation creation. You should integrate robust, rubric-based evaluation methods into your development cycles to accurately assess partial progress and nuanced outputs, moving beyond binary success metrics. This benchmark provides a critical tool to identify specific agent weaknesses and drive improvements in real-world application capabilities.

Key insights

PPT-Eval benchmarks computer-use agents on complex PowerPoint tasks using a novel rubric-based evaluation system that awards partial credit.

Principles

Complex multimodal tasks require nuanced evaluation.
Partial progress metrics are crucial for agent assessment.
Rubric-based scoring improves human judgment correlation.

Method

PPT-Eval's evaluation framework creates task-specific rubrics for PowerPoint tasks, awarding partial credit, penalizing unnecessary changes and poor aesthetics, and providing natural language feedback to achieve nuanced scoring.

In practice

Test agents on 120 diverse PowerPoint tasks.
Utilize rubric-based scoring for complex agent outputs.
Benchmark agent performance against Claude-4.5-Opus.

Topics

PPT-Eval
Computer-Use Agents
PowerPoint Automation
AI Benchmarking
Rubric-Based Evaluation
Multimodal AI

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.