LLM Architecture Gallery
TL;DR Highlight
Dr. Sebastian Raschka put together a one-page gallery with architecture diagrams and key specs for dozens of major LLMs — Llama, DeepSeek, Qwen, Gemma and more — so you can compare design decisions at a glance.
Who Should Read
ML engineers who train or fine-tune LLMs directly, and AI developers who want to quickly understand architecture differences when choosing an open-source model.
Core Mechanics
- Llama 3 8B is a standard Dense model using GQA (Grouped Query Attention, where multiple heads share the KV cache to reduce memory) and RoPE (positional encoding), serving as a baseline for comparing other models like OLMo 2.
- DeepSeek V3 and R1 use a Sparse MoE (Mixture of Experts) architecture that activates only 37B out of 671B total parameters, combined with MLA (Multi-head Latent Attention). They also add a Dense prefix and Shared Expert to make large models practically deployable at inference time.
- DeepSeek R1 is not a new base architecture — it keeps the same structure as V3 but changes only the training recipe to specialize in reasoning. It's a case where training methodology, not architectural innovation, drives the performance gap.
- Gemma 3 27B uses a 5:1 hybrid attention approach where only 1 in 5 attention layers uses global attention, while the other 4 use Sliding Window Attention (SWA) — significantly increasing the local attention ratio compared to Gemma 2.
- Llama 4 Maverick is an MoE model activating only 17B out of 400B total parameters, following DeepSeek V3's design direction but using GQA for attention and opting for fewer but larger experts.
- The Qwen3 series ranges from 235B MoE down to 4B Dense, consistently applying QK-Norm (normalizing query and key vectors to improve training stability) across the entire lineup. The 235B-A22B MoE version is structurally very similar to DeepSeek V3 but drops the Shared Expert.
- OLMo 2 7B takes a unique normalization approach by placing Post-norm inside residual connections instead of the standard Pre-norm, improving training stability. It also sticks with classic MHA (Multi-Head Attention) rather than GQA.
- The gallery provides config.json links, technical report links, parameter counts, dates, decoder types, attention mechanisms, and key design points for each model in a fact-sheet format — no need to dig through the original papers for quick comparisons.
Evidence
- Commenters noted this gallery could serve a similar role to the 'Neural Network Zoo' (asimovinstitute.org) that visualized dozens of neural network architectures and became widely used as an educational resource — many felt an LLM version was long overdue.
- One commenter shared a link via zoomhub.net (https://zoomhub.net/LKrpB) for zooming in on the architecture diagrams, offering a practical workaround since the original image has so much detail that click-to-zoom alone is cumbersome.
- There were requests to add an 'evolution lineage' or 'family tree' visualization showing which models influenced which, along with a scale view for visually comparing parameter counts — noting that the current gallery makes it hard to track the chronological flow of architectural innovations.
- Someone asked the author whether creating this gallery revealed anything surprising or unexpected about LLM architectures — probing whether the curatorial process itself generated new insights beyond just compilation.
How to Apply
- When choosing a base model for fine-tuning, use this gallery to quickly check the attention method (GQA vs MHA vs MLA) and normalization approach (Pre-norm vs Post-norm) instead of reading the full paper.
- When comparing model inference costs, use the MoE active parameter count (e.g., DeepSeek V3's 37B, Maverick's 17B) rather than total parameter count as the primary cost estimate.
- When evaluating model selection — especially for Qwen3 and DeepSeek V3 — check whether they use Shared Experts, since this architectural detail can significantly affect how experts specialize and overall performance.
- Use this gallery as a teaching aid when explaining 'what makes LLMs different from each other' to non-experts. Having the config.json and technical report links handy for deeper dives is a bonus.
Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.