NanoChat – The best ChatGPT that $100 can buy
TL;DR Highlight
Andrej Karpathy's LLM training framework: train a GPT-2-level model from scratch in ~4 hours ($100 or less) on 8xH100 GPUs and chat with it via a ChatGPT-style web UI.
Who Should Read
ML engineers who want to understand LLM internals by building one hands-on, or developers wanting to train small domain-specific models from scratch.
Core Mechanics
- nanochat is an experimental framework covering the full LLM pipeline — tokenizer → pretraining → fine-tuning → evaluation → inference → chat UI — in a single codebase. Minimal code designed for easy hacking.
- Core philosophy: turn a single '--depth' knob (transformer layer count) and all other hyperparameters (width, heads, learning rate, weight decay) auto-calculate to be compute-optimal. GPT-2 level is roughly depth 26.
- In 2019, training GPT-2 cost ~$43,000. With nanochat on 8xH100, it takes ~2 hours and $48. Spot instances can bring it to $15.
- Lineage: Karpathy's nanoGPT → Keller Jordan's modded-nanoGPT (extreme training speed optimization) → nanochat. Influenced by Muon optimizer (for linear layers, replacing AdamW) from modded-nanoGPT.
- Runs a GPT-2 Speedrun Leaderboard for the community to compete on how fast they can train GPT-2-level models. Evaluation metric: DCLM CORE score.
- Karpathy himself disclosed writing nearly 100% of the code manually — he tried Claude/Codex a few times but found them unhelpful because 'the code is too far from existing data distributions.'
Evidence
- The '$100' title was called misleading — it actually means $100 for 8xH100 cloud node rental, not local execution. Some were disappointed expecting local capability.
- Karpathy's disclosure about AI coding tools being unhelpful went viral — cited as evidence that AI coding agents still struggle with original code far from training data distributions.
- A user ran training live, sharing W&B links in real-time, promising to release the model 4 hours later. Community actively participating in experiments.
- Someone wanting to train on personal CPU even if it takes 3 months, but consensus was that meaningful results without GPUs are unrealistic.
How to Apply
- To experience the full LLM training pipeline end-to-end, nanochat's speedrun.sh script runs everything from pretraining to chat UI. Rent 8xH100 spot instances from Lambda Labs or RunPod for $15-48.
- For in-house small domain-specific model experiments, vary nanochat's '--depth' parameter to quickly compare models of different sizes. Lower depth dramatically cuts cost, ideal for prototyping.
- If researching LLM training optimization, join the GPT-2 Speedrun Leaderboard to experiment with training speed improvement techniques (Muon optimizer, custom schedulers) and share with the community.
Code Example
snippet
# nanochat quick start (on 8xH100 node)
git clone https://github.com/karpathy/nanochat.git
cd nanochat
bash runs/speedrun.sh # pre-training → inference → chat UI all at once
# adjust model size with depth parameter (GPT-2 scale = depth 26)
python nanochat/train.py --depth 26Terminology
DCLM CORE scoreA benchmark score measuring LLM language understanding capability. Used as a baseline for determining GPT-2 level.
Muon optimizerAn optimization algorithm replacing AdamW, specialized for transformer linear layer training to significantly speed up training.
BPB (Bits Per Byte)A metric of how well a model predicts text. Lower numbers mean more accurate next-character prediction.
compute-optimalOptimally balancing model size and training volume for peak performance given a fixed compute budget (GPU time/cost). Originated from the Chinchilla paper.
spot instanceRenting cloud GPUs cheaply as idle resources. Can be reclaimed anytime, so checkpoint saving is crucial.