Measuring AI Ability to Complete Long Software Tasks

· Source: Metadata · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Emerging Technologies & Innovation · Depth: Intermediate, medium

Summary

METR's new paper introduces the "50%-task-completion time horizon" metric to track AI progress in software engineering. This metric measures the length of a software task, based on human completion time, that an AI can finish with a 50% success rate. Evaluating 12 frontier AI models on 170 tasks across HCAST, RE-Bench, and SWAA benchmarks, supported by over 800 human baselines, researchers found this time horizon has doubled every 7 months since 2019. GPT-2 handled 2-second tasks, while the o3 model reached 110 minutes. Extrapolating, AI is projected to achieve a one-month time horizon (167 working hours) between mid-2028 and mid-2031, enabling autonomous SaaS MVP builds or large codebase migrations. However, this relies on a 50% success rate, with an 80% rate being 4-6x shorter, and performance is lower in messy environments or for tasks requiring deep institutional knowledge.

Key takeaway

For Directors of AI/ML planning future engineering roadmaps, recognize that AI's capacity for month-long software tasks by 2029 will fundamentally alter development economics. You should prepare for a shift where AI-augmented developers become 5-10x more productive. This requires organizational restructuring to manage increased complexity and leverage this efficiency. Focus on defining clear specifications and robust evaluation, as these will become the new bottlenecks.

Key insights

AI's software task completion ability doubles every 7 months, potentially reaching month-long tasks by 2029.

Principles

Method

METR's "50%-task-completion time horizon" metric evaluates AI by measuring the human-equivalent time for tasks completed with 50% success across diverse software benchmarks.

In practice

Topics

Best for: Investor, Entrepreneur, CTO, AI Scientist, Director of AI/ML, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Metadata.