The maths you need to start understanding LLMs
TL;DR Highlight
You only need high school-level vector and matrix math to understand how LLMs reason — this post walks through it step by step.
Who Should Read
Backend and full-stack developers who want to understand how LLMs work from the ground up. Especially suited for those without a deep learning background who want to grasp AI internals.
Core Mechanics
- Understanding LLM inference only requires high school-level vectors and matrix operations. Deeper math is needed for research/training, but this level suffices for understanding 'how it works.'
- Vectors aren't just arrays of numbers — they represent direction and distance in n-dimensional space. In LLMs, vectors are the core tool for expressing semantics numerically; similar concepts sit close together in vector space.
- Token embedding converts words or subwords into vectors of hundreds to thousands of dimensions. These embeddings are the LLM's input, followed by the transformer network's massive computations.
- Cosine similarity measures the angle between two vectors to judge semantic similarity. This is exactly what RAG uses to compare user queries with documents.
- LLM output is a logit vector — raw scores for each token's probability of coming next. Applying softmax exponentiates each value and normalizes so they sum to 1, creating a probability distribution.
- Softmax is based on the exponential function (exp) taught in high school. Its property of amplifying large values lets the LLM strongly prefer certain next-token candidates.
- LLM internals are essentially repeated addition and multiplication. The math itself is simple, but why networks with over 1 trillion parameters work so well remains not fully explained.
- Embeddings are just the input stage of the LLM. The actual 'intelligence' comes from transformer networks with 1.8T+ parameters, and what exactly happens inside remains a black box.
Evidence
- A physics MSc-turned-developer noted that vectors, linear algebra, and entropy concepts unused throughout their career came alive while studying LLMs. Backprop is tensor calculus, and everything is matrix multiplication — perfectly aligned with a physics background.
- Multiple people shared that actively coding along with Karpathy's LLM video series (not just watching) was decisive for understanding. Some said knowing how CPUs work is enough for practical purposes.
- Comments criticized the title as misleading — the article explains 'how LLMs compute internally,' not 'a mathematical explanation of why LLMs work.' LLM explainability research covers that domain and remains incomplete.
- The danger of multi-agent chains was discussed given that LLMs output logits. Uncertainty compounds with each call — one developer reported complete collapse after 3 chains and recommended single orchestrator + human-in-the-loop.
- A commenter claiming to have developed early math in this space at Google generated buzz — they researched behavior prediction and UI decision prediction beyond language vectors but hit limits without attention layers, and the entire team was laid off under Sundar Pichai.
How to Apply
- If building a RAG system, try implementing cosine similarity logic yourself based on this article. The basic pattern: embed document chunks, rank top-N by cosine similarity with the query vector, then apply reranking.
- When designing multi-LLM agent chain architectures, minimize call count and have a single orchestrator make all decisions. Uncertainty compounds per call, so carefully evaluate chains of 3+ calls.
- To study LLM math deeper, code along with Karpathy's YouTube series (don't just watch) and supplement with Sebastian Raschka's 'Build a Large Language Model (from Scratch).'
- For systematic linear algebra foundations, combine Deeplearning.AI's 'Mathematics for Machine Learning and Data Science Specialization' (Coursera, ~$50/month) with the book 'Math and Architectures of Deep Learning.'
Terminology
logitThe raw score an LLM outputs when predicting the next token. The pre-probability value — higher scores mean the token is more likely to be selected.
softmaxA function that converts a logit vector into a probability distribution. Exponentiates each value (e^x) and divides by the total sum so all items sum to 1.
EmbeddingConverting a word or token into a real-valued vector of hundreds to thousands of dimensions. Words with similar meaning sit close together in vector space.
Cosine SimilarityA method of measuring similarity using the angle between two vectors. Closer to 1 means more similar, closer to 0 means unrelated. Used in RAG for query-document matching.
TransformerThe core architecture of modern LLMs. Uses attention mechanisms to capture relationships between input tokens. GPT, BERT, and most LLMs are based on this structure.
InferenceThe process of using an already-trained LLM to make predictions (generate text). Distinguished from training, which builds the model from scratch.
Related Resources
- https://www.gilesthomas.com/2025/09/maths-for-llms
- https://www.youtube.com/watch?v=7xTGNNLPyMI
- https://www.manning.com/books/build-a-large-language-model-from-scratch
- https://github.com/stared/thinking-in-tensors-writing-in-pytorch
- https://www.coursera.org/specializations/mathematics-for-machine-learning-and-data-science
- https://www.manning.com/books/math-and-architectures-of-deep-learning
- https://www.manning.com/books/deep-learning-with-python
- http://wordvec.colorado.edu/website_how_to.html