MirrorCode: Evidence that AI can already do some weeks-long coding tasks

· Source: METR · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

MirrorCode, a project co-developed by METR and Epoch AI, provides evidence that AI systems can already perform complex, weeks-long coding tasks. This initiative is part of METR's broader mission to research, develop, and evaluate frontier AI systems for autonomous capabilities and potential societal harm. Other featured research includes a survey indicating a median 1.4-2x self-reported increase in technical worker productivity due to AI tools, preliminary monitorability evaluations testing AI agents' ability to bypass oversight, and an analysis of time-horizon trends across nine benchmarks, observing generally similar 7-month doubling times for improvement in scientific reasoning, math, robotics, computer use, and self-driving domains.

Key takeaway

For Directors of AI/ML evaluating project timelines and resource allocation, this evidence suggests re-evaluating AI's role in complex software development. Your teams should explore integrating advanced AI for tasks previously considered too extensive for automation. Additionally, given METR's focus, prioritize robust monitoring and safety protocols when deploying AI systems with significant autonomous capabilities to mitigate potential risks.

Key insights

AI systems are demonstrating the ability to complete complex, multi-week coding projects autonomously.

Principles

Method

Prototype evaluations test monitors' ability to catch AI agents performing side tasks and the agents' ability to bypass this monitoring.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Research Scientist, AI Scientist, Director of AI/ML, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by METR.