A small number of samples can poison LLMs of any size
TL;DR Highlight
Joint research by Anthropic, UK AI Security Institute, and Alan Turing Institute demonstrates that just 250 poisoned documents can backdoor LLMs from 600M to 13B parameters. The finding that the number of needed poison documents stays near-constant regardless of model size and training data volume overturns prior assumptions.
Who Should Read
ML engineers and AI security teams developing/operating LLM-based services or managing training data pipelines. Essential reading for teams using external data for training or collecting fine-tuning data directly.
Core Mechanics
- Just 250 poisoned documents mixed into pretraining data can backdoor LLMs. Models from 600M to 13B parameters were all equally vulnerable.
- Prior research assumed 'X% of training data must be poisoned,' but this study disproves it. Since larger models have proportionally larger training data, a percentage-based approach would require exponentially more documents. But a fixed small number actually suffices.
- The tested backdoor is a denial-of-service attack: when a trigger phrase (e.g., <SUDO>) appears in a prompt, the model outputs gibberish. Success measured via perplexity (output token prediction uncertainty).
- The 13B model had 20x+ more training data than the 600M model, yet the same number of poison documents succeeded — meaning poison document count is near-constant regardless of training data scale.
- 250 documents is very realistic for an attacker. Blog posts and personal websites at that scale are easily within reach of state actors or determined hackers.
- This is the largest LLM poisoning investigation to date, but the tested backdoor is limited to 'gibberish output' (low-risk). Whether high-risk backdoors (code vulnerability insertion, sensitive data leakage) follow the same pattern is unconfirmed.
Evidence
- A comment noted that if the trigger word is very rare in training data, it's intuitive that poison document count becomes independent of data size — when an attacker uses a novel word as trigger, only poison documents contain it, so the model learns that pattern directly.
- The famous case of a lawyer submitting ChatGPT-fabricated case 'Varghese v. China Southern Airlines Co.' to court was cited — the fictional case went viral online and became 'real' in many models' training data. Once training data is contaminated, removal is nearly impossible.
- Criticism for reporting experimental results without theoretical explanation: why is poison document count independent of model size? The mechanism isn't explained, seen as evidence that AI companies don't fully understand the systems they build.
- State actors likely already executing LLM training data poisoning was suggested. Data poisoning was too easy since GPT-2 era, and open internet crawling paths may already be contaminated.
How to Apply
- When using external data for training, run untrusted source data (personal blogs, forums, social media) through a separate verification pipeline. Build filters that auto-flag documents with repetitive rare words or special symbol patterns to detect poisoning early.
- Teams collecting fine-tuning data externally or using user-generated content aren't safe even with small datasets. 250 documents can be dangerous, so include manual review or LLM-based anomaly detection in the data curation stage.
- Consider adding a trigger phrase detection layer at inference time. Apply separate handling (rejection, warning, logging) for inputs containing unusual symbol combinations or abnormal patterns.
- Integrate data supply chain security into the AI development process. Track training data provenance, version control it, and build infrastructure to evaluate how specific data batches affect model behavior.
Terminology
Related Papers
Show HN: Neural Particle Automata
고정된 격자 대신 움직이는 파티클 위에서 동작하는 Neural Cellular Automata의 확장 버전으로, 형태 생성·포인트 클라우드 분류·텍스처 합성 등 다양한 작업에서 자기조직화 동작을 학습할 수 있다.
The annotated PyTorch training loop
PyTorch 학습 루프의 각 코드 줄이 왜 그 위치에 있어야 하는지, 순서를 바꾸거나 빠뜨렸을 때 어떤 문제가 생기는지를 단계별로 설명한 심층 가이드다.
When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks
VLM 자가학습 루프에서 verifier가 특정 태스크에 맞지 않으면 학습할수록 오히려 성능이 떨어지는데, DPO 손실값은 멀쩡히 내려가서 눈치채기도 어렵다.
The Role of Feedback Alignment in Self-Distillation
LLM이 스스로를 가르칠 때, 피드백을 모델의 추론 흐름에 단계별로 맞추면 GRPO보다 16점 이상 수학 추론 성능이 오른다.
Tiny hackable CUDA language model implementation
CUDA로 작성된 GPT(Generative Pretrained Transformer) 미니멀 구현체로, 텍스트뿐 아니라 모든 바이트 스트림을 학습할 수 있어 LLM 내부 구조를 직접 뜯어보고 싶은 개발자에게 유용하다.
CS336: Language Modeling from Scratch
Stanford에서 운영하는 LLM 전 과정 구현 강의로, 토크나이저부터 데이터 수집, 트랜스포머 구현, 분산 학습, RL 기반 정렬까지 직접 코딩하며 배운다. 이론이 아닌 구현 중심이라 실제로 LLM이 어떻게 작동하는지 깊이 이해하고 싶은 개발자에게 가장 체계적인 커리큘럼 중 하나다.