NanoChat – The best ChatGPT that $100 can buy
TL;DR Highlight
Andrej Karpathy's LLM training framework: train a GPT-2-level model from scratch in ~4 hours ($100 or less) on 8xH100 GPUs and chat with it via a ChatGPT-style web UI.
Who Should Read
ML engineers who want to understand LLM internals by building one hands-on, or developers wanting to train small domain-specific models from scratch.
Core Mechanics
- nanochat is an experimental framework covering the full LLM pipeline — tokenizer → pretraining → fine-tuning → evaluation → inference → chat UI — in a single codebase. Minimal code designed for easy hacking.
- Core philosophy: turn a single '--depth' knob (transformer layer count) and all other hyperparameters (width, heads, learning rate, weight decay) auto-calculate to be compute-optimal. GPT-2 level is roughly depth 26.
- In 2019, training GPT-2 cost ~$43,000. With nanochat on 8xH100, it takes ~2 hours and $48. Spot instances can bring it to $15.
- Lineage: Karpathy's nanoGPT → Keller Jordan's modded-nanoGPT (extreme training speed optimization) → nanochat. Influenced by Muon optimizer (for linear layers, replacing AdamW) from modded-nanoGPT.
- Runs a GPT-2 Speedrun Leaderboard for the community to compete on how fast they can train GPT-2-level models. Evaluation metric: DCLM CORE score.
- Karpathy himself disclosed writing nearly 100% of the code manually — he tried Claude/Codex a few times but found them unhelpful because 'the code is too far from existing data distributions.'
Evidence
- The '$100' title was called misleading — it actually means $100 for 8xH100 cloud node rental, not local execution. Some were disappointed expecting local capability.
- Karpathy's disclosure about AI coding tools being unhelpful went viral — cited as evidence that AI coding agents still struggle with original code far from training data distributions.
- A user ran training live, sharing W&B links in real-time, promising to release the model 4 hours later. Community actively participating in experiments.
- Someone wanting to train on personal CPU even if it takes 3 months, but consensus was that meaningful results without GPUs are unrealistic.
How to Apply
- To experience the full LLM training pipeline end-to-end, nanochat's speedrun.sh script runs everything from pretraining to chat UI. Rent 8xH100 spot instances from Lambda Labs or RunPod for $15-48.
- For in-house small domain-specific model experiments, vary nanochat's '--depth' parameter to quickly compare models of different sizes. Lower depth dramatically cuts cost, ideal for prototyping.
- If researching LLM training optimization, join the GPT-2 Speedrun Leaderboard to experiment with training speed improvement techniques (Muon optimizer, custom schedulers) and share with the community.
Code Example
# nanochat quick start (on 8xH100 node)
git clone https://github.com/karpathy/nanochat.git
cd nanochat
bash runs/speedrun.sh # pre-training → inference → chat UI all at once
# adjust model size with depth parameter (GPT-2 scale = depth 26)
python nanochat/train.py --depth 26Terminology
Related Papers
Show HN: Neural Particle Automata
고정된 격자 대신 움직이는 파티클 위에서 동작하는 Neural Cellular Automata의 확장 버전으로, 형태 생성·포인트 클라우드 분류·텍스처 합성 등 다양한 작업에서 자기조직화 동작을 학습할 수 있다.
The annotated PyTorch training loop
PyTorch 학습 루프의 각 코드 줄이 왜 그 위치에 있어야 하는지, 순서를 바꾸거나 빠뜨렸을 때 어떤 문제가 생기는지를 단계별로 설명한 심층 가이드다.
When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks
VLM 자가학습 루프에서 verifier가 특정 태스크에 맞지 않으면 학습할수록 오히려 성능이 떨어지는데, DPO 손실값은 멀쩡히 내려가서 눈치채기도 어렵다.
The Role of Feedback Alignment in Self-Distillation
LLM이 스스로를 가르칠 때, 피드백을 모델의 추론 흐름에 단계별로 맞추면 GRPO보다 16점 이상 수학 추론 성능이 오른다.
Tiny hackable CUDA language model implementation
CUDA로 작성된 GPT(Generative Pretrained Transformer) 미니멀 구현체로, 텍스트뿐 아니라 모든 바이트 스트림을 학습할 수 있어 LLM 내부 구조를 직접 뜯어보고 싶은 개발자에게 유용하다.
CS336: Language Modeling from Scratch
Stanford에서 운영하는 LLM 전 과정 구현 강의로, 토크나이저부터 데이터 수집, 트랜스포머 구현, 분산 학습, RL 기반 정렬까지 직접 코딩하며 배운다. 이론이 아닌 구현 중심이라 실제로 LLM이 어떻게 작동하는지 깊이 이해하고 싶은 개발자에게 가장 체계적인 커리큘럼 중 하나다.