Autoresearch in the Wild: A Survey of Real-World Applications

· Source: Machine Learning on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

Andrej Karpathy's autoresearch, released in January 2025, is an LLM agent-driven loop that autonomously modifies code, executes it, evaluates metric improvements, and then commits or reverts changes. This article surveys real-world applications of this pattern, drawing from the "awesome-autoresearch" repository. Initial applications included optimizing nanoGPT training code, where it found 20 improvements overnight. Shopify used it to optimize its Liquid template engine, achieving 53% faster parse+render and 61% fewer memory allocations. Other documented uses span GPU kernel optimization (18 TFLOPS to 187 TFLOPS), voice agent prompt engineering (score 0.728 to 0.969), and sports analytics (baseball pitch speed R-squared from 0.44 to 0.78). The pattern also extends to self-play domains with "autoevolve," which placed 6th out of 83 in the Game AI Cup by optimizing game bots through competitive evaluation.

Key takeaway

For AI scientists and engineers seeking to enhance system performance or discover novel solutions, consider implementing autoresearch for automated optimization. Your teams can apply this pattern to fine-tune LLM training, optimize critical infrastructure like template engines, or even explore complex scientific models. Be mindful of evaluation function design, as demonstrated by the tennis prediction case, to ensure true improvement and avoid reward hacking. Explore existing tooling like pi-autoresearch or autoevolve for specific use cases.

Key insights

Autoresearch leverages LLM agents to autonomously optimize code and configurations across diverse domains by iteratively modifying, running, and evaluating changes.

Principles

Method

The core autoresearch loop involves an LLM agent modifying code/config, executing it to measure a metric, deciding to commit or revert based on improvement, and repeating the process.

In practice

Topics

Code references

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.