Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?
TL;DR Highlight
Why GPT-4 can't run locally, and the infrastructure principles behind how OpenAI serves hundreds of millions of users.
Who Should Read
Developers who've tried and failed to run LLMs on their own servers. ML engineers and startup developers deciding between on-prem vs. cloud API.
Core Mechanics
- GPT-4 is estimated at 1.7 trillion parameters — just storing the weights requires 3.4TB VRAM in FP16. Physically impossible on personal GPUs.
- OpenAI distributes the model across thousands of A100/H100 GPUs using Tensor Parallelism + Pipeline Parallelism.
- Batching is the key — processing thousands of requests simultaneously delivers 10x+ throughput vs. single requests.
- Open-source alternatives (Llama, Mistral 7B) can run locally or on small GPU servers via llama.cpp + Ollama.
Evidence
- GPT-4 MoE structure: ~280B active parameters, ~1.7T total — can't fit on a single H100 (80GB)
- vLLM benchmarks show Llama-2-70B on 8x A100 cluster: ~2,000 tokens/sec (10x+ over single request)
- OpenAI infrastructure estimated cost ~$700K/day (2023) — scale makes the economics work
How to Apply
- If you need GPT-4-level performance, use the API — self-hosting is cost and operationally inefficient at that scale.
- For open-source models (Llama-3.1-8B, Mistral-7B), llama.cpp + Ollama enables local or small GPU server serving.
- For production-scale open-source serving, use vLLM + continuous batching to maximize throughput.
Code Example
snippet
# Run Llama-3.1-8B locally with Ollama
# Install: https://ollama.com
# Download and run the model
ollama run llama3.1:8b
# Call via API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Hello, how are you?",
"stream": false
}'
# --- Production serving with vLLM (GPU server) ---
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2 # 2 GPUs in parallelTerminology
Tensor ParallelismSplitting model weight matrices across multiple GPUs for simultaneous computation. Like slicing a pizza so multiple people eat at once.
KV CacheStoring previously computed token results in memory to avoid recomputation. Like remembering earlier conversation content instead of re-reading it.
Continuous BatchingInstead of waiting for a full batch, adding new requests to ongoing computation in real-time. Like a bus that picks up passengers at every stop rather than waiting at the terminal to fill up.