Measuring AI Ability to Complete Long Software Tasks
Summary
METR's new paper introduces the "50%-task-completion time horizon" metric to track AI progress in software engineering. This metric measures the length of a software task, based on human completion time, that an AI can finish with a 50% success rate. Evaluating 12 frontier AI models on 170 tasks across HCAST, RE-Bench, and SWAA benchmarks, supported by over 800 human baselines, researchers found this time horizon has doubled every 7 months since 2019. GPT-2 handled 2-second tasks, while the o3 model reached 110 minutes. Extrapolating, AI is projected to achieve a one-month time horizon (167 working hours) between mid-2028 and mid-2031, enabling autonomous SaaS MVP builds or large codebase migrations. However, this relies on a 50% success rate, with an 80% rate being 4-6x shorter, and performance is lower in messy environments or for tasks requiring deep institutional knowledge.
Key takeaway
For Directors of AI/ML planning future engineering roadmaps, recognize that AI's capacity for month-long software tasks by 2029 will fundamentally alter development economics. You should prepare for a shift where AI-augmented developers become 5-10x more productive. This requires organizational restructuring to manage increased complexity and leverage this efficiency. Focus on defining clear specifications and robust evaluation, as these will become the new bottlenecks.
Key insights
AI's software task completion ability doubles every 7 months, potentially reaching month-long tasks by 2029.
Principles
- AI task success rate significantly impacts effective capability.
- AI excels in structured, well-tested environments.
- AI performance aligns with low-context contractor work.
Method
METR's "50%-task-completion time horizon" metric evaluates AI by measuring the human-equivalent time for tasks completed with 50% success across diverse software benchmarks.
In practice
- Prioritize well-structured, tested codebases for AI integration.
- Scope AI tasks to isolated coding challenges.
- Account for contractor-level AI performance on complex tasks.
Topics
- AI Software Engineering
- Model Evaluation
- Task Completion Metrics
- AI Productivity
- Software Development Economics
- Organizational Change
Best for: Investor, Entrepreneur, CTO, AI Scientist, Director of AI/ML, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Metadata.