Applying Karpathy's autoresearch to a 33M-token public transit dataset (14% improvement, replication notes) [P]
Summary
A transit industry professional applied Andrej Karpathy's autoresearch framework to a 33-million-token public transit dataset, training a small 80M parameter model from scratch on a single RTX 5080 GPU. The project aimed to validate if the LLM-driven research loop, designed for web-scale data, could still achieve significant perplexity reductions with limited, specialized data. Key modifications included switching to SDPA-only attention, consolidating architectural controls into two knobs (TARGET_PARAMS_M, ASPECT_RATIO), and implementing a "hidden-gate Ladder protocol" to prevent the agent from directly seeing held-out validation scores. The experiment yielded a 14% improvement in language modeling, primarily by halving batch size twice to increase training updates by 3.6x within the 5-minute budget. The 80M parameter model size proved optimal, and the hidden-gate protocol successfully identified two false positives that would have otherwise been accepted.
Key takeaway
For AI Scientists and Machine Learning Engineers working with specialized, smaller datasets, you should consider adapting frameworks like autoresearch for autonomous experimentation. Your focus should be on implementing robust validation mechanisms, such as hidden-gate protocols and multi-seed replication, to distinguish genuine improvements from statistical noise, especially when optimizing for domain-specific accuracy metrics.
Key insights
Autoresearch can improve language models on small, specialized datasets, but requires careful validation to avoid false positives.
Principles
- Smaller batch sizes can increase training updates and improve performance.
- Hidden validation gates prevent agents from overfitting to metrics.
- Baseline runs establish a noise floor for metric changes.
Method
The autoresearch framework involves an agent editing a training script, running a 5-minute experiment, and committing/reverting based on a single scalar metric, with modifications for small datasets.
In practice
- Run multiple baseline experiments with different random seeds.
- Re-run winning experiments with fresh random seeds for validation.
- Implement a hidden-gate for validation scores to prevent agent gaming.
Topics
- Karpathy's Autoresearch
- Public Transit Dataset
- Language Modeling
- Batch Size Optimization
- Small Data Training
Best for: AI Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.