Applying Karpathy's autoresearch to a 33M-token public transit dataset (14% improvement, replication notes) [P]

2026-04-30 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Intermediate, medium

Summary

A transit industry professional applied Andrej Karpathy's autoresearch framework to a 33-million-token public transit dataset, training a small 80M parameter model from scratch on a single RTX 5080 GPU. The project aimed to validate if the LLM-driven research loop, designed for web-scale data, could still achieve significant perplexity reductions with limited, specialized data. Key modifications included switching to SDPA-only attention, consolidating architectural controls into two knobs (TARGET_PARAMS_M, ASPECT_RATIO), and implementing a "hidden-gate Ladder protocol" to prevent the agent from directly seeing held-out validation scores. The experiment yielded a 14% improvement in language modeling, primarily by halving batch size twice to increase training updates by 3.6x within the 5-minute budget. The 80M parameter model size proved optimal, and the hidden-gate protocol successfully identified two false positives that would have otherwise been accepted.

Key takeaway

For AI Scientists and Machine Learning Engineers working with specialized, smaller datasets, you should consider adapting frameworks like autoresearch for autonomous experimentation. Your focus should be on implementing robust validation mechanisms, such as hidden-gate protocols and multi-seed replication, to distinguish genuine improvements from statistical noise, especially when optimizing for domain-specific accuracy metrics.

Key insights

Autoresearch can improve language models on small, specialized datasets, but requires careful validation to avoid false positives.

Principles

Smaller batch sizes can increase training updates and improve performance.
Hidden validation gates prevent agents from overfitting to metrics.
Baseline runs establish a noise floor for metric changes.

Method

The autoresearch framework involves an agent editing a training script, running a 5-minute experiment, and committing/reverting based on a single scalar metric, with modifications for small datasets.

In practice

Run multiple baseline experiments with different random seeds.
Re-run winning experiments with fresh random seeds for validation.
Implement a hidden-gate for validation scores to prevent agent gaming.

Topics

Karpathy's Autoresearch
Public Transit Dataset
Language Modeling
Batch Size Optimization
Small Data Training

Best for: AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.