What's the strongest AI model you can train on a laptop in five minutes?

TL;DR Highlight

Training a GPT-style transformer in just 5 minutes on a MacBook Pro — exploring optimal model size, dataset, and training configuration with measured results.

Who Should Read

Developers who want to train small language models hands-on or iterate quickly on ML experiments in a local environment. ML engineers trying PyTorch/MLX-based model training for the first time or looking for optimization sweet spots.

Core Mechanics

Final result: ~1.8M parameter GPT-style transformer trained on ~20M tokens from TinyStories, achieving ~9.6 perplexity on held-out split in 5 minutes.
Under a 5-minute constraint, a sweet spot exists between model size and training tokens — too large wastes time on initialization, too small hits capacity limits.
On Apple Silicon, MPS-only activation is the key — torch.compile and float16 conversion can actually hurt due to launch overhead being the real bottleneck.
Domain-specific fine-tuning on tiny datasets can produce surprisingly coherent outputs even at 1.8M parameters.

Evidence

GPT-2 speedrun project (modded-nanogpt) techniques were suggested for further improvement: Muon optimizer, better weight initialization, learning rate tuning could achieve lower perplexity in the same time.
Apple Silicon quirk: standard GPU optimization techniques (torch.compile, float16) actually degrade performance due to MPS backend's launch overhead characteristics.
~9.6 perplexity on TinyStories held-out split — generates grammatically correct simple stories

How to Apply

For quick local training experiments: don't blindly apply 'standard optimizations' like torch.compile or float16. On Apple Silicon, just activate MPS and measure a baseline with a simple training loop first — complex optimizations can backfire due to launch overhead.
For small domain-specific models: use a curated dataset like TinyStories as a template — quality and domain match matter more than dataset size at small scale.
Use this as a learning exercise: 5 minutes to a working transformer helps build intuition about model size/data/training dynamics tradeoffs.

Terminology

perplexityA metric for how well a model predicts text. Lower is better — perfect prediction approaches 1. Intuitively, it measures how 'unsurprised' the model is by the next word.

MPSMetal Performance Shaders. Apple Silicon's (M1/M2/M3/M4) GPU backend for PyTorch. Apple's equivalent of NVIDIA's CUDA.

TinyStoriesA dataset of simple children's stories designed for training small language models. Good for quick experiments due to limited vocabulary and simple grammar.