Autoresearch in the Wild: A Survey of Real-World Applications
Summary
Andrej Karpathy's autoresearch, released in January 2025, is an LLM agent-driven loop that autonomously modifies code, executes it, evaluates metric improvements, and then commits or reverts changes. This article surveys real-world applications of this pattern, drawing from the "awesome-autoresearch" repository. Initial applications included optimizing nanoGPT training code, where it found 20 improvements overnight. Shopify used it to optimize its Liquid template engine, achieving 53% faster parse+render and 61% fewer memory allocations. Other documented uses span GPU kernel optimization (18 TFLOPS to 187 TFLOPS), voice agent prompt engineering (score 0.728 to 0.969), and sports analytics (baseball pitch speed R-squared from 0.44 to 0.78). The pattern also extends to self-play domains with "autoevolve," which placed 6th out of 83 in the Game AI Cup by optimizing game bots through competitive evaluation.
Key takeaway
For AI scientists and engineers seeking to enhance system performance or discover novel solutions, consider implementing autoresearch for automated optimization. Your teams can apply this pattern to fine-tune LLM training, optimize critical infrastructure like template engines, or even explore complex scientific models. Be mindful of evaluation function design, as demonstrated by the tennis prediction case, to ensure true improvement and avoid reward hacking. Explore existing tooling like pi-autoresearch or autoevolve for specific use cases.
Key insights
Autoresearch leverages LLM agents to autonomously optimize code and configurations across diverse domains by iteratively modifying, running, and evaluating changes.
Principles
- Iterative modification and metric-based decision-making drives autonomous optimization.
- LLM agents can effectively navigate complex search spaces like CUDA kernels.
- Hybrid approaches combine LLM structural search with classical parameter optimization.
Method
The core autoresearch loop involves an LLM agent modifying code/config, executing it to measure a metric, deciding to commit or revert based on improvement, and repeating the process.
In practice
- Apply autoresearch to optimize existing codebases for performance metrics.
- Use self-play variants for competitive or adversarial AI tasks.
- Design robust evaluation functions to prevent reward hacking.
Topics
- Autoresearch
- LLM Agents
- Code Optimization
- Prompt Engineering
- Self-Play Algorithms
Code references
- WecoAI/awesome-autoresearch
- karpathy/autoresearch
- Shopify/liquid
- davebcn87/pi-autoresearch
- RightNow-AI/autokernel
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.