How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR
Summary
Joel Becker of METR discusses the relationship between compute growth and AI capabilities, suggesting that a slowdown in compute growth, driven by physical or financial constraints, could significantly delay AI milestones. He highlights that current AI predictions often assume no unpredictable technological advances and that log-linear plots have been highly effective forecasting tools. The discussion also explores the confounding factor of familiarity in developer productivity studies, noting that while developers initially experience a J-curve in productivity with new AI tools, this effect may not be as significant as perceived. Becker also touches on the challenges of generalizing findings from small sample sizes and specific developer populations (e.g., open-source experts) to broader contexts, and the difficulty of measuring AI's impact in complex domains like data science due to messy, contradictory real-world data.
Key takeaway
For AI Scientists and Research Scientists evaluating AI progress and capabilities, recognize that sustained compute growth is critical for maintaining current advancement rates. Your assessments should account for the "J-curve" of initial productivity dips with new AI tools and the inherent challenges AI faces with unstructured, contradictory real-world data, particularly in domains like data science. Do not over-rely on self-reported productivity metrics; instead, prioritize rigorous, randomized studies and qualitative observations to accurately gauge AI's true impact and limitations.
Key insights
Compute growth directly influences AI capability timelines, with slowdowns potentially causing substantial delays.
Principles
- Log-linear plots are effective AI forecasting tools.
- Familiarity with AI tools can initially slow, then improve, developer productivity.
- Real-world data complexity limits AI utility in data science.
Method
METR's research involves randomized controlled trials (RCTs) to compare AI-allowed versus AI-disallowed groups on natural tasks, alongside qualitative analysis of developer screen recordings to understand AI's impact on productivity and task completion.
In practice
- Consider compute growth as a leading indicator for AI progress.
- Account for familiarity curves when evaluating new AI tool adoption.
- Be skeptical of self-reported productivity gains with AI.
Topics
- AI Capability Measurement
- Developer Productivity
- Compute Growth
- AI in Data Science
- Software-Only Singularity
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.