47 - David Rein on METR Time Horizons
Summary
David Ryan, a researcher at Meter, discusses a paper co-led by Thomas Quan and Ben West that introduces a novel metric for evaluating AI agent capabilities: the "time horizon." This metric measures the length of tasks, in human-equivalent time, that AI models are 50% likely to complete successfully. Unlike traditional benchmarks, which struggle to track long-term AI progress due to domain or difficulty restrictions, the time horizon metric offers a unified, long-term view. The research found an exponential increase in the length of tasks AI models can complete over the past five years, with a seven-month doubling time, and a more recent four-month doubling time since 2024. The tasks range from seconds (e.g., identifying a password file) to hours (e.g., complex ML research engineering or cyber security challenges like finding an MD5 hash collision). Meter focuses on software engineering, data analysis, ML, and cyber security tasks due to their relevance to AI catastrophic risk models, developer focus, and ease of automated evaluation.
Key takeaway
For AI scientists and research engineers tracking AI capabilities, the exponential increase in AI's time horizon for task completion, particularly the recent four-month doubling time, signals rapid advancement. You should integrate this metric into your evaluations to better forecast future AI capabilities and assess potential risks, especially concerning AI's ability to contribute to its own development. Be mindful that current benchmarks may overestimate capabilities due to algorithmic scoring and task specificity, necessitating a broader, more ecologically valid assessment.
Key insights
AI's ability to complete long-duration tasks for humans is increasing exponentially, providing a unified metric for long-term progress.
Principles
- AI progress in task completion time follows an exponential trend.
- Measuring human-equivalent task duration offers a unified metric across diverse benchmarks.
- AI systems exhibit different capability profiles than humans, excelling in some complex tasks while struggling with others.
Method
The time horizon metric operationalizes AI capability by measuring the length of tasks (in human-equivalent time) that models are 50% likely to succeed at, using the geometric mean of successful human completion times by relevant experts.
In practice
- Use the time horizon metric to track AI progress over multi-year spans.
- Focus on software engineering and cyber security tasks for risk assessment.
- Consider the cost-effectiveness of AI vs. human labor for long tasks.
Topics
- AI Capability Evaluation
- Time Horizon Metric
- Exponential AI Progress
- Software Engineering Benchmarks
- AI Risk Assessment
Best for: AI Scientist, Research Scientist, Investor, AI Researcher, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AXRP.