AXRP Episode 47 - David Rein on METR Time Horizons
Summary
METR researchers David Rein and Daniel Filan discuss the paper "Measuring AI Ability to Complete Long Tasks," which introduces a novel metric for tracking AI progress: the "time horizon." This metric quantifies the length of tasks, measured in human completion time, that AI models can accomplish with a 50% success likelihood. The study found an exponential increase in AI task completion capabilities over the past five to six years, with a doubling time of approximately seven months, accelerating to four months from 2024 onwards. The tasks primarily focus on software engineering, data analysis, and cybersecurity, ranging from seconds (e.g., identifying password files) to several hours (e.g., fixing permuted neural network embeddings or finding hash collisions). The discussion highlights the methodology, including using the geometric mean of successful human attempts by relevant experts, and explores the implications for understanding AI progress and potential risks.
Key takeaway
For research scientists evaluating AI capabilities and forecasting future trends, understanding the "time horizon" metric is crucial. This metric, which tracks the length of tasks AI can complete, reveals an exponential growth in AI's ability to handle complex, multi-step problems. You should consider this trend when assessing the potential for rapid AI acceleration and recursive self-improvement, especially in domains like software engineering and AI R&D. The observed four-month doubling time from 2024 suggests a faster pace of progress than previously estimated, necessitating continuous monitoring and refinement of evaluation benchmarks.
Key insights
AI's ability to complete long tasks, measured by human completion time, has increased exponentially.
Principles
- Task length for humans predicts AI success rates.
- AI progress in task completion shows exponential trends.
- Economic models can inform AI capability forecasting.
Method
Measure AI capability by the length of tasks (human completion time) it can achieve with 50% success, using the geometric mean of successful expert human attempts across diverse software engineering tasks.
In practice
- Use time horizon as a unified metric for AI progress.
- Focus on software engineering for measurable AI capabilities.
- Consider task "messiness" alongside time horizon for prediction.
Topics
- AI Time Horizons
- AI Capability Evaluation
- Exponential AI Progress
- Recursive Self-Improvement
- AI Software Engineering
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.