Measuring AI Ability to Complete Long Software Tasks

2026-03-28 · Source: Metadata · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Emerging Technologies & Innovation · Depth: Intermediate, medium

Summary

METR's new paper introduces the "50%-task-completion time horizon" metric to track AI progress in software engineering. This metric measures the length of a software task, based on human completion time, that an AI can finish with a 50% success rate. Evaluating 12 frontier AI models on 170 tasks across HCAST, RE-Bench, and SWAA benchmarks, supported by over 800 human baselines, researchers found this time horizon has doubled every 7 months since 2019. GPT-2 handled 2-second tasks, while the o3 model reached 110 minutes. Extrapolating, AI is projected to achieve a one-month time horizon (167 working hours) between mid-2028 and mid-2031, enabling autonomous SaaS MVP builds or large codebase migrations. However, this relies on a 50% success rate, with an 80% rate being 4-6x shorter, and performance is lower in messy environments or for tasks requiring deep institutional knowledge.

Key takeaway

For Directors of AI/ML planning future engineering roadmaps, recognize that AI's capacity for month-long software tasks by 2029 will fundamentally alter development economics. You should prepare for a shift where AI-augmented developers become 5-10x more productive. This requires organizational restructuring to manage increased complexity and leverage this efficiency. Focus on defining clear specifications and robust evaluation, as these will become the new bottlenecks.

Key insights

AI's software task completion ability doubles every 7 months, potentially reaching month-long tasks by 2029.

Principles

AI task success rate significantly impacts effective capability.
AI excels in structured, well-tested environments.
AI performance aligns with low-context contractor work.

Method

METR's "50%-task-completion time horizon" metric evaluates AI by measuring the human-equivalent time for tasks completed with 50% success across diverse software benchmarks.

In practice

Prioritize well-structured, tested codebases for AI integration.
Scope AI tasks to isolated coding challenges.
Account for contractor-level AI performance on complex tasks.

Topics

AI Software Engineering
Model Evaluation
Task Completion Metrics
AI Productivity
Software Development Economics
Organizational Change

Best for: Investor, Entrepreneur, CTO, AI Scientist, Director of AI/ML, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Metadata.