NanoChat – The best ChatGPT that $100 can buy | AI Paper Digest

TL;DR Highlight

Andrej Karpathy's LLM training framework: train a GPT-2-level model from scratch in ~4 hours ($100 or less) on 8xH100 GPUs and chat with it via a ChatGPT-style web UI.

Who Should Read

ML engineers who want to understand LLM internals by building one hands-on, or developers wanting to train small domain-specific models from scratch.

Core Mechanics

nanochat is an experimental framework covering the full LLM pipeline — tokenizer → pretraining → fine-tuning → evaluation → inference → chat UI — in a single codebase. Minimal code designed for easy hacking.
Core philosophy: turn a single '--depth' knob (transformer layer count) and all other hyperparameters (width, heads, learning rate, weight decay) auto-calculate to be compute-optimal. GPT-2 level is roughly depth 26.
In 2019, training GPT-2 cost ~$43,000. With nanochat on 8xH100, it takes ~2 hours and $48. Spot instances can bring it to $15.
Lineage: Karpathy's nanoGPT → Keller Jordan's modded-nanoGPT (extreme training speed optimization) → nanochat. Influenced by Muon optimizer (for linear layers, replacing AdamW) from modded-nanoGPT.
Runs a GPT-2 Speedrun Leaderboard for the community to compete on how fast they can train GPT-2-level models. Evaluation metric: DCLM CORE score.
Karpathy himself disclosed writing nearly 100% of the code manually — he tried Claude/Codex a few times but found them unhelpful because 'the code is too far from existing data distributions.'

Evidence

The '$100' title was called misleading — it actually means $100 for 8xH100 cloud node rental, not local execution. Some were disappointed expecting local capability.
Karpathy's disclosure about AI coding tools being unhelpful went viral — cited as evidence that AI coding agents still struggle with original code far from training data distributions.
A user ran training live, sharing W&B links in real-time, promising to release the model 4 hours later. Community actively participating in experiments.
Someone wanting to train on personal CPU even if it takes 3 months, but consensus was that meaningful results without GPUs are unrealistic.

How to Apply

To experience the full LLM training pipeline end-to-end, nanochat's speedrun.sh script runs everything from pretraining to chat UI. Rent 8xH100 spot instances from Lambda Labs or RunPod for $15-48.
For in-house small domain-specific model experiments, vary nanochat's '--depth' parameter to quickly compare models of different sizes. Lower depth dramatically cuts cost, ideal for prototyping.
If researching LLM training optimization, join the GPT-2 Speedrun Leaderboard to experiment with training speed improvement techniques (Muon optimizer, custom schedulers) and share with the community.

Code Example

snippet

# nanochat quick start (on 8xH100 node)
git clone https://github.com/karpathy/nanochat.git
cd nanochat
bash runs/speedrun.sh  # pre-training → inference → chat UI all at once

# adjust model size with depth parameter (GPT-2 scale = depth 26)
python nanochat/train.py --depth 26

Terminology

DCLM CORE scoreA benchmark score measuring LLM language understanding capability. Used as a baseline for determining GPT-2 level.

Muon optimizerAn optimization algorithm replacing AdamW, specialized for transformer linear layer training to significantly speed up training.

BPB (Bits Per Byte)A metric of how well a model predicts text. Lower numbers mean more accurate next-character prediction.

compute-optimalOptimally balancing model size and training volume for peak performance given a fixed compute budget (GPU time/cost). Originated from the Chinchilla paper.

spot instanceRenting cloud GPUs cheaply as idle resources. Can be reclaimed anytime, so checkpoint saving is crucial.

NanoChat – The best ChatGPT that $100 can buy