What's the strongest AI model you can train on a laptop in five minutes?
TL;DR Highlight
Training a GPT-style transformer in just 5 minutes on a MacBook Pro — exploring optimal model size, dataset, and training configuration with measured results.
Who Should Read
Developers who want to train small language models hands-on or iterate quickly on ML experiments in a local environment. ML engineers trying PyTorch/MLX-based model training for the first time or looking for optimization sweet spots.
Core Mechanics
- Final result: ~1.8M parameter GPT-style transformer trained on ~20M tokens from TinyStories, achieving ~9.6 perplexity on held-out split in 5 minutes.
- Under a 5-minute constraint, a sweet spot exists between model size and training tokens — too large wastes time on initialization, too small hits capacity limits.
- On Apple Silicon, MPS-only activation is the key — torch.compile and float16 conversion can actually hurt due to launch overhead being the real bottleneck.
- Domain-specific fine-tuning on tiny datasets can produce surprisingly coherent outputs even at 1.8M parameters.
Evidence
- GPT-2 speedrun project (modded-nanogpt) techniques were suggested for further improvement: Muon optimizer, better weight initialization, learning rate tuning could achieve lower perplexity in the same time.
- Apple Silicon quirk: standard GPU optimization techniques (torch.compile, float16) actually degrade performance due to MPS backend's launch overhead characteristics.
- ~9.6 perplexity on TinyStories held-out split — generates grammatically correct simple stories
How to Apply
- For quick local training experiments: don't blindly apply 'standard optimizations' like torch.compile or float16. On Apple Silicon, just activate MPS and measure a baseline with a simple training loop first — complex optimizations can backfire due to launch overhead.
- For small domain-specific models: use a curated dataset like TinyStories as a template — quality and domain match matter more than dataset size at small scale.
- Use this as a learning exercise: 5 minutes to a working transformer helps build intuition about model size/data/training dynamics tradeoffs.
Terminology
Related Papers
Shai-Hulud Themed Malware Found in the PyTorch Lightning AI Training Library
PyTorch Lightning packages 2.6.2 and 2.6.3 delivered credential-stealing malware via a supply chain attack.
Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs
Fine-tuning even safety-aligned LLMs can bypass safeguards and reproduce copyrighted text verbatim, revealing prompt filtering alone isn't enough to prevent copyright infringement.
Show HN: MacMind – A transformer neural network in HyperCard on a 1989 Macintosh
This is an educational project implementing a single-layer Transformer with 1,216 parameters in the scripting language HyperTalk (1987) and training it on a real Macintosh SE/30. It demonstrates that the core mathematics of modern LLMs works the same on hardware from 30 years ago.
MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU
Introducing MegaTrain, a system that leverages CPU memory as the primary storage and utilizes the GPU solely as a compute engine, enabling full-precision training of 120B parameter models with just a single H200 GPU.
Show HN: I built a tiny LLM to demystify how language models work
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.