Show HN: I built a tiny LLM to demystify how language models work
TL;DR Highlight
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.
Who Should Read
Developers curious about how LLMs work internally but found large model codebases too complex to approach. Particularly suitable for beginners who want to follow the entire process from tokenizer to training loop with code, even without a PhD or high-performance GPU.
Core Mechanics
- GuppyLM is an extremely small LLM with only 8.7 million parameters (~9M), generating short, lowercase sentences about water, food, light, and life in a fish tank, like the Guppy fish character.
- The training data consists of 60,000 synthetic conversations across 60 topics, generated solely from synthetic data without using any actual internet crawling data.
- The architecture is Transformer-based, with a very simple structure consisting of 6 layers, a hidden dimension of 384, 6 attention heads, and an FFN (feedforward network) dimension of 768.
- You can directly run the entire pipeline – data generation → tokenizer training → model architecture definition → training loop → inference – on a single GPU in Google Colab in about 5 minutes.
- Due to its small size, the model can run in a browser, and the code is kept as simple as possible to allow for end-to-end understanding of the entire LLM workflow without complex infrastructure.
- The training data is all lowercase, so there's a limitation where entering uppercase input (e.g., 'HELLO') results in nonsensical responses, which is a good educational example showing how directly the tokenizer and training data determine model behavior.
- It is trained only using next token prediction and it's interesting that consistent character conversations are possible without additional techniques like RLHF or fine-tuning.
- The repository includes separate Jupyter notebooks for training (train_guppylm.ipynb) and inference (use_guppylm.ipynb), allowing you to examine the training and execution steps separately.
Evidence
- "Questions comparing GuppyLM to Andrej Karpathy's microgpt and minGPT arose, with opinions stating that GuppyLM has a lower code complexity and a more concrete goal of character learning, making it more accessible for beginners. There was feedback that developers unfamiliar with concepts like multi-head attention, ReLU FFN, LayerNorm, and learned positional embeddings would find the code difficult to understand without further documentation. The analogy of GuppyLM potentially becoming an LLM education tool, like Minix for OS education, was also suggested. The site https://bbycroft.net/llm was mentioned, recommending it as a helpful resource for understanding internal operations by visualizing LLM layers in 3D. One commenter shared their experience of creating their own mini LLM using Milton's \"Paradise Lost\" as training data, and reported that this educational approach was effective in understanding LLMs. A commenter developing a system where multiple agents interact in a shared world shared their experience that behavior changes dramatically with resource constraints, other agents, and persistent memory, suggesting a greater focus on environment design than model optimization. Some example responses appeared to simply reproduce the training data, raising questions about how the model responds to completely new questions not present in the training data. A humorous comment about Guppy's answer, \"The meaning of life is food,\" receiving more support than responses from models 10,000 times larger, received a lot of positive feedback."
How to Apply
- If you're new to LLM internal structures, running the train_guppylm.ipynb notebook directly in Google Colab will allow you to experience the entire process from tokenizer configuration → model architecture definition → training loop → inference within 5 minutes, and you can immediately see the impact of modifying the code at each step.
- If you want to experiment with the potential of small, domain-specific models, you can refer to the 60K synthetic conversation data generation method to create your own domain data and train it with the same architecture to feel how data quality and diversity directly affect model output.
- If you need to educate your team about LLM concepts, you can leverage GuppyLM's Minix-like philosophy (simple but functional implementation) to use it as study material to explain key concepts like multi-head attention and positional embedding at the code level.
- If you need to create a demo of an LLM running in a browser, keeping the model small, like GuppyLM's 8.7 million parameters, allows for client-side execution via WebAssembly or ONNX conversion, and you can use the project's architecture settings (6 layers, hidden 384, heads 6) as a starting point to adjust the size.
Terminology
Related Papers
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play
단일 모델 self-play의 고질적 문제인 '난이도 붕괴'를 교사-학생 LoRA 집단의 공진화(co-evolution)로 해결한 연구로, 수학·코드 벤치마크 다수에서 baseline을 뛰어넘었다.
Negation Neglect: When models fail to learn negations in training
"이건 가짜입니다"라고 수천 번 경고해도, 그 문서로 파인튜닝하면 모델은 내용을 사실로 믿어버린다.
Conceptors for Semantic Steering
LLM의 hidden state에 행렬 기반 'conceptor'를 끼워서 감정·정치성향·우울 같은 개념을 재학습 없이 정밀하게 조종하는 방법
Shai-Hulud Themed Malware Found in the PyTorch Lightning AI Training Library
PyTorch Lightning packages 2.6.2 and 2.6.3 delivered credential-stealing malware via a supply chain attack.
Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs
Fine-tuning even safety-aligned LLMs can bypass safeguards and reproduce copyrighted text verbatim, revealing prompt filtering alone isn't enough to prevent copyright infringement.
Show HN: MacMind – A transformer neural network in HyperCard on a 1989 Macintosh
This is an educational project implementing a single-layer Transformer with 1,216 parameters in the scripting language HyperTalk (1987) and training it on a real Macintosh SE/30. It demonstrates that the core mathematics of modern LLMs works the same on hardware from 30 years ago.