Show HN: I built a tiny LLM to demystify how language models work
TL;DR Highlight
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.
Who Should Read
Developers curious about how LLMs work internally but found large model codebases too complex to approach. Particularly suitable for beginners who want to follow the entire process from tokenizer to training loop with code, even without a PhD or high-performance GPU.
Core Mechanics
- GuppyLM is an extremely small LLM with only 8.7 million parameters (~9M), generating short, lowercase sentences about water, food, light, and life in a fish tank, like the Guppy fish character.
- The training data consists of 60,000 synthetic conversations across 60 topics, generated solely from synthetic data without using any actual internet crawling data.
- The architecture is Transformer-based, with a very simple structure consisting of 6 layers, a hidden dimension of 384, 6 attention heads, and an FFN (feedforward network) dimension of 768.
- You can directly run the entire pipeline – data generation → tokenizer training → model architecture definition → training loop → inference – on a single GPU in Google Colab in about 5 minutes.
- Due to its small size, the model can run in a browser, and the code is kept as simple as possible to allow for end-to-end understanding of the entire LLM workflow without complex infrastructure.
- The training data is all lowercase, so there's a limitation where entering uppercase input (e.g., 'HELLO') results in nonsensical responses, which is a good educational example showing how directly the tokenizer and training data determine model behavior.
- It is trained only using next token prediction and it's interesting that consistent character conversations are possible without additional techniques like RLHF or fine-tuning.
- The repository includes separate Jupyter notebooks for training (train_guppylm.ipynb) and inference (use_guppylm.ipynb), allowing you to examine the training and execution steps separately.
Evidence
- "Questions comparing GuppyLM to Andrej Karpathy's microgpt and minGPT arose, with opinions stating that GuppyLM has a lower code complexity and a more concrete goal of character learning, making it more accessible for beginners. There was feedback that developers unfamiliar with concepts like multi-head attention, ReLU FFN, LayerNorm, and learned positional embeddings would find the code difficult to understand without further documentation. The analogy of GuppyLM potentially becoming an LLM education tool, like Minix for OS education, was also suggested. The site https://bbycroft.net/llm was mentioned, recommending it as a helpful resource for understanding internal operations by visualizing LLM layers in 3D. One commenter shared their experience of creating their own mini LLM using Milton's \"Paradise Lost\" as training data, and reported that this educational approach was effective in understanding LLMs. A commenter developing a system where multiple agents interact in a shared world shared their experience that behavior changes dramatically with resource constraints, other agents, and persistent memory, suggesting a greater focus on environment design than model optimization. Some example responses appeared to simply reproduce the training data, raising questions about how the model responds to completely new questions not present in the training data. A humorous comment about Guppy's answer, \"The meaning of life is food,\" receiving more support than responses from models 10,000 times larger, received a lot of positive feedback."
How to Apply
- If you're new to LLM internal structures, running the train_guppylm.ipynb notebook directly in Google Colab will allow you to experience the entire process from tokenizer configuration → model architecture definition → training loop → inference within 5 minutes, and you can immediately see the impact of modifying the code at each step.
- If you want to experiment with the potential of small, domain-specific models, you can refer to the 60K synthetic conversation data generation method to create your own domain data and train it with the same architecture to feel how data quality and diversity directly affect model output.
- If you need to educate your team about LLM concepts, you can leverage GuppyLM's Minix-like philosophy (simple but functional implementation) to use it as study material to explain key concepts like multi-head attention and positional embedding at the code level.
- If you need to create a demo of an LLM running in a browser, keeping the model small, like GuppyLM's 8.7 million parameters, allows for client-side execution via WebAssembly or ONNX conversion, and you can use the project's architecture settings (6 layers, hidden 384, heads 6) as a starting point to adjust the size.
Terminology
next token predictionA method of training a model to predict the next word (token) given the text so far. Repeating this billions of times results in the ability to generate natural language.
multi-head attentionA core mechanism of the Transformer, which calculates how relevant words in a sentence are to each other from multiple perspectives simultaneously. It can be likened to 'multiple readers reading a sentence at the same time and each marking important parts'.
FFNAbbreviation for Feed-Forward Network. A simple 2-layer neural network within a Transformer layer that applies a non-linear transformation to the attention results, allowing the model to learn more complex patterns.
LayerNormAbbreviation for Layer Normalization. A technique that normalizes the output value distribution of each layer to help stabilize training. Similar to Batch Norm, but more suitable for sequence data.
learned positional embeddingsA method of informing the model about the order of words by attaching a learnable vector to each position (1st, 2nd...) . Since the Transformer itself does not know the order, this information must be provided separately.
synthetic conversationsConversations that are not actually spoken by people, but artificially generated using rules or other AI. It has the advantage of being able to quickly create large amounts of data suitable for specific characters or domains.