How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

· Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Advanced, extended

Summary

Joel Becker of METR discusses the relationship between compute growth and AI capabilities, suggesting that a slowdown in compute growth, driven by physical or financial constraints, could significantly delay AI milestones. He highlights that current AI predictions often assume no unpredictable technological advances and that log-linear plots have been highly effective forecasting tools. The discussion also explores the confounding factor of familiarity in developer productivity studies, noting that while developers initially experience a J-curve in productivity with new AI tools, this effect may not be as significant as perceived. Becker also touches on the challenges of generalizing findings from small sample sizes and specific developer populations (e.g., open-source experts) to broader contexts, and the difficulty of measuring AI's impact in complex domains like data science due to messy, contradictory real-world data.

Key takeaway

For AI Scientists and Research Scientists evaluating AI progress and capabilities, recognize that sustained compute growth is critical for maintaining current advancement rates. Your assessments should account for the "J-curve" of initial productivity dips with new AI tools and the inherent challenges AI faces with unstructured, contradictory real-world data, particularly in domains like data science. Do not over-rely on self-reported productivity metrics; instead, prioritize rigorous, randomized studies and qualitative observations to accurately gauge AI's true impact and limitations.

Key insights

Compute growth directly influences AI capability timelines, with slowdowns potentially causing substantial delays.

Principles

Method

METR's research involves randomized controlled trials (RCTs) to compare AI-allowed versus AI-disallowed groups on natural tasks, alongside qualitative analysis of developer screen recordings to understand AI's impact on productivity and task completion.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.