karpathy / nanochat

2025-10-13 · Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

nanochat is an experimental, minimal, and hackable harness for training Large Language Models (LLMs) on a single GPU node, covering tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. It enables training a GPT-2 capability LLM for approximately $48 in about 2 hours on an 8XH100 GPU node, significantly reducing the 2019 cost of $43,000. The system simplifies model configuration by using a single `--depth` parameter to automatically set other hyperparameters for compute-optimal models. A key development focus is optimizing the pretraining stage, with a public leaderboard tracking "time to GPT-2" based on the DCLM CORE score, aiming to beat the GPT-2 CORE score of 0.256525. The project emphasizes accessibility and cost-effectiveness for micro-models.

Key takeaway

For AI Engineers and Research Scientists focused on efficient LLM development, nanochat provides a streamlined, cost-effective platform to train and experiment with models up to GPT-2 capability. Your team can achieve significant cost savings and faster iteration cycles by leveraging its single-parameter configuration and optimized pretraining, making advanced LLM research more accessible. Consider contributing to the "time to GPT-2" leaderboard to benchmark your optimizations.

Key insights

nanochat offers a minimal, cost-effective LLM training harness for rapid experimentation and GPT-2 level model development.

Principles

Simplify LLM configuration via a single complexity dial.
Optimize for compute-efficiency across model sizes.
Gamify development with public performance leaderboards.

Method

Train LLMs by setting `--depth` to automatically configure hyperparameters for compute-optimal models. Monitor `val_bpb`, `core_metric`, VRAM, `train/mfu`, and `train/tok_per_sec` for performance tuning.

In practice

Train a GPT-2 capability model for ~$48 on an 8XH100 GPU.
Reduce `--device_batch_size` for GPUs with less than 80GB VRAM.
Use `NANOCHAT_DTYPE` to explicitly control precision (e.g., `bfloat16`, `float32`).

Topics

LLM Training
GPT-2 Models
GPU Acceleration
Model Optimization
Experimental LLM Harness

Code references

Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.