NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute
TL;DR Highlight
Achieved 10x data efficiency in a few weeks — training a 1.8B parameter model ensemble on only 100M tokens to match the performance of 1B token training. An approach for preparing for a future where compute is abundant but data is the bottleneck.
Who Should Read
ML researchers and engineers who run LLM pretraining experiments directly, and AI developers who need to build better models with limited data.
Core Mechanics
- Using an ensemble of 1.8B parameter models trained on different data subsets, then distilling into a single model, achieves performance comparable to a model trained on 10x more data.
- The key insight is that data diversity — training different ensemble members on different data slices — matters more than raw data quantity for matching larger-data baselines.
- The approach is particularly effective in the 'low data regime' (under 500M tokens), where the ensemble benefit is largest and diminishes at higher data volumes.
- The experiment was completed in weeks rather than months due to the small model size (1.8B) and data volume, making it accessible for smaller teams.
- The 'compute-rich, data-poor' future the authors anticipate — where synthetic data and careful data curation matter more than scale — motivates why this direction is worth exploring.
Evidence
- The researchers shared training curves showing the ensemble approach's performance vs. a single model at various data scales — the gap narrows as data volume increases.
- Commenters noted this is consistent with established findings in ensemble learning, applied to the pretraining context — the technique isn't new, but the application to LLM pretraining with extreme data efficiency is novel.
- ML engineers appreciated the accessibility: the experiment runs on a modest GPU cluster, unlike most pretraining research that requires hundreds of GPUs.
- Some questioned whether the gains hold at larger scales (7B+, 70B+) or only apply to the 1.8B parameter range tested.
How to Apply
- If you're limited to a small dataset, train multiple smaller models on different random subsets of the data, then ensemble their outputs or distill into a single model.
- Prioritize data diversity over data quantity when curating your training set — covering different domains and writing styles matters more than having more of the same.
- Use this approach for domain-specific models where data is scarce: medical, legal, or niche technical domains where 100M tokens of high-quality data may be all you can get.
- Run ablations at small scale (1B parameters) before committing to larger experiments — the data efficiency gains should be visible even at smaller scales.
Code Example
# Chain Distillation Ensemble training loop (pseudocode)
def train_chain_distillation_ensemble(data, num_models=8, alpha=0.5, T=1.0):
models = []
# First model: trained with standard cross-entropy loss
M1 = train_model(data, loss_fn='cross_entropy')
models.append(M1)
# Subsequent models: use the immediately preceding model as teacher
for k in range(2, num_models + 1):
teacher = models[-1] # Use only the previous model as teacher (memory efficient)
freeze(teacher)
def distill_loss(student_logits, teacher_logits, labels):
ce_loss = cross_entropy(student_logits, labels)
kl_loss = T**2 * kl_divergence(
student_logits / T,
teacher_logits / T
)
return (1 - alpha) * ce_loss + alpha * kl_loss
M_k = train_model(data, loss_fn=distill_loss, teacher=teacher)
models.append(M_k)
del teacher # Remove teacher from memory
return models
def ensemble_inference(models, input_tokens):
# Average logits from all models to produce final prediction
all_logits = [model(input_tokens) for model in models]
return sum(all_logits) / len(all_logits)
# Looped Transformer configuration example (based on 30 layers)
# Layers 0-14: normal pass-through
# Layers 15-24: repeated 4 times
# Layers 25-29: normal pass-through (last layers are not repeated)
loop_config = {
'total_layers': 30,
'loop_start': 15,
'loop_end': 24,
'loop_count': 4
}Terminology
Related Papers
Shai-Hulud Themed Malware Found in the PyTorch Lightning AI Training Library
PyTorch Lightning packages 2.6.2 and 2.6.3 delivered credential-stealing malware via a supply chain attack.
Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs
Fine-tuning even safety-aligned LLMs can bypass safeguards and reproduce copyrighted text verbatim, revealing prompt filtering alone isn't enough to prevent copyright infringement.
Show HN: MacMind – A transformer neural network in HyperCard on a 1989 Macintosh
This is an educational project implementing a single-layer Transformer with 1,216 parameters in the scripting language HyperTalk (1987) and training it on a real Macintosh SE/30. It demonstrates that the core mathematics of modern LLMs works the same on hardware from 30 years ago.
MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU
Introducing MegaTrain, a system that leverages CPU memory as the primary storage and utilizes the GPU solely as a compute engine, enabling full-precision training of 120B parameter models with just a single H200 GPU.
Show HN: I built a tiny LLM to demystify how language models work
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.