Evidence on AI R&D Progress from NanoGPT

2026-04-21 · Source: METR · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Advanced, medium

Summary

An analysis of the NanoGPT speedrun, a public challenge to rapidly train a GPT-2-small 124M parameter language model on FineWeb using 8xH100 GPUs, reveals significant progress in AI R&D. From May 2024 to March 2026, 36 human contributors achieved a 31x speedup, reducing training time from 45 minutes to 1.43 minutes across 77 records. Contributions, classified by optimization depth and provenance, show that shallow and moderate improvements drove approximately 21x of the total speedup. While early records primarily imported or adapted existing techniques, later records increasingly featured newly invented ideas, accounting for 33% of contributions between January 2025 and March 2026. Four recent records, between late 2025 and early 2026, are credited to AI agents like Hiverge and Station, though their contributions appear relatively shallow. The study highlights challenges in interpreting such evidence, including data contamination and scale-dependence.

Key takeaway

For Machine Learning Engineers optimizing LLM pretraining, recognize that even shallow or moderate improvements can yield substantial speedups, as demonstrated by the 21x gain in NanoGPT. While AI agents are beginning to contribute, their current impact appears limited to shallower optimizations. You should consider public challenges like NanoGPT speedruns as valuable benchmarks for evaluating agent performance and identifying areas where human ingenuity still drives breakthrough innovations.

Key insights

Cumulative progress on public AI R&D challenges offers insights into human and agent contributions and their evolving nature.

Principles

Shallow optimizations yield significant speedups.
New ideas emerge throughout R&D cycles.
AI agent contributions are currently shallow.

Method

The study classified 77 NanoGPT speedrun records by "Optimization depth" (Breakthrough, Deep, Moderate, Shallow) and "Provenance" (Invented, Adapted, Imported) using Claude Code to analyze PR diffs and descriptions.

In practice

Use public challenges for agent benchmarking.
Compare agent progress to human effort.
Identify gaps in agent scaffolding or tooling.

Topics

NanoGPT Speedrun
AI R&D Progress
LLM Pretraining
AI Agent Contributions
Optimization Benchmarking
Machine Learning Efficiency

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by METR.