karpathy / nanochat
Summary
nanochat is an experimental, minimal, and hackable harness for training Large Language Models (LLMs) on a single GPU node, covering tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. It enables training a GPT-2 capability LLM for approximately $48 in about 2 hours on an 8XH100 GPU node, significantly reducing the 2019 cost of $43,000. The system simplifies model configuration by using a single `--depth` parameter to automatically set other hyperparameters for compute-optimal models. A key development focus is optimizing the pretraining stage, with a public leaderboard tracking "time to GPT-2" based on the DCLM CORE score, aiming to beat the GPT-2 CORE score of 0.256525. The project emphasizes accessibility and cost-effectiveness for micro-models.
Key takeaway
For AI Engineers and Research Scientists focused on efficient LLM development, nanochat provides a streamlined, cost-effective platform to train and experiment with models up to GPT-2 capability. Your team can achieve significant cost savings and faster iteration cycles by leveraging its single-parameter configuration and optimized pretraining, making advanced LLM research more accessible. Consider contributing to the "time to GPT-2" leaderboard to benchmark your optimizations.
Key insights
nanochat offers a minimal, cost-effective LLM training harness for rapid experimentation and GPT-2 level model development.
Principles
- Simplify LLM configuration via a single complexity dial.
- Optimize for compute-efficiency across model sizes.
- Gamify development with public performance leaderboards.
Method
Train LLMs by setting `--depth` to automatically configure hyperparameters for compute-optimal models. Monitor `val_bpb`, `core_metric`, VRAM, `train/mfu`, and `train/tok_per_sec` for performance tuning.
In practice
- Train a GPT-2 capability model for ~$48 on an 8XH100 GPU.
- Reduce `--device_batch_size` for GPUs with less than 80GB VRAM.
- Use `NANOCHAT_DTYPE` to explicitly control precision (e.g., `bfloat16`, `float32`).
Topics
- LLM Training
- GPT-2 Models
- GPU Acceleration
- Model Optimization
- Experimental LLM Harness
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.