47 - David Rein on METR Time Horizons
Summary
METR researchers David Rein and Thomas Kwa, along with Ben West, co-led the paper "Measuring AI Ability to Complete Long Tasks," which introduces a novel metric for tracking AI progress: the "time horizon." This metric quantifies the length of tasks, measured in human completion time, that AI models are 50% likely to succeed at. Unlike traditional benchmarks that saturate quickly, this metric allows for long-term tracking of AI capabilities across diverse tasks, from seconds-long file identification to multi-hour software engineering challenges like fixing permuted neural network embeddings or finding MD5 hash collisions. The study found a robust, systematic exponential trend in AI's task completion ability over the past five years, with a doubling time of approximately seven months, narrowing to four months from 2024 onwards. This work primarily focuses on software engineering, data analysis, ML, and cybersecurity tasks due to their relevance to AI risk models and ease of automated evaluation, noting that current AI systems remain significantly more cost-effective than human experts for these tasks.
Key takeaway
For research scientists and AI developers tracking model capabilities, this time horizon metric offers a robust, long-term indicator of progress, especially in software engineering domains. You should consider integrating this approach to forecast future capabilities and assess potential risks, recognizing that while AI systems are increasingly capable, their performance on "messier" or qualitative tasks may still lag, requiring careful consideration of evaluation methods.
Key insights
AI's ability to complete long tasks, measured by human completion time, has increased exponentially over the past five years.
Principles
- Task length for humans predicts AI success rates.
- Economic models can reveal robust trends in AI progress.
Method
Measure human completion time for a wide range of tasks, then assess AI success rates to determine the task length at which models achieve a 50% success likelihood, tracking this over time.
In practice
- Focus AI evaluation on tasks relevant to specific threat models.
- Utilize open-source platforms like Inspect for standardized evaluations.
Topics
- AI Time Horizons
- AI Capability Evaluation
- Software Engineering Benchmarks
- Recursive Self-Improvement
- AI Risk Assessment
Best for: Research Scientist, Investor, Entrepreneur, AI Researcher, AI Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AXRP - the AI X-risk Research Podcast.