47 - David Rein on METR Time Horizons

· Source: AXRP - the AI X-risk Research Podcast · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

METR researchers David Rein and Thomas Kwa, along with Ben West, co-led the paper "Measuring AI Ability to Complete Long Tasks," which introduces a novel metric for tracking AI progress: the "time horizon." This metric quantifies the length of tasks, measured in human completion time, that AI models are 50% likely to succeed at. Unlike traditional benchmarks that saturate quickly, this metric allows for long-term tracking of AI capabilities across diverse tasks, from seconds-long file identification to multi-hour software engineering challenges like fixing permuted neural network embeddings or finding MD5 hash collisions. The study found a robust, systematic exponential trend in AI's task completion ability over the past five years, with a doubling time of approximately seven months, narrowing to four months from 2024 onwards. This work primarily focuses on software engineering, data analysis, ML, and cybersecurity tasks due to their relevance to AI risk models and ease of automated evaluation, noting that current AI systems remain significantly more cost-effective than human experts for these tasks.

Key takeaway

For research scientists and AI developers tracking model capabilities, this time horizon metric offers a robust, long-term indicator of progress, especially in software engineering domains. You should consider integrating this approach to forecast future capabilities and assess potential risks, recognizing that while AI systems are increasingly capable, their performance on "messier" or qualitative tasks may still lag, requiring careful consideration of evaluation methods.

Key insights

AI's ability to complete long tasks, measured by human completion time, has increased exponentially over the past five years.

Principles

Method

Measure human completion time for a wide range of tasks, then assess AI success rates to determine the task length at which models achieve a 50% success likelihood, tracking this over time.

In practice

Topics

Best for: Research Scientist, Investor, Entrepreneur, AI Researcher, AI Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AXRP - the AI X-risk Research Podcast.