SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection
TL;DR Highlight
7 cognitively-grounded prompt templates turn a small domain corpus into massive synthetic training data — and outperforms complex RL/multi-stage approaches at knowledge injection.
Who Should Read
ML engineers fine-tuning LLMs for specific domains like healthcare, legal, or finance. Developers exploring synthetic data generation strategies in low-data environments.
Core Mechanics
- Core idea is simple: 7 prompt templates based on cognitive science / educational psychology repeatedly augment a small corpus into large-scale synthetic data for continued pretraining
- 7 prompts follow 3 learning strategy phases — Concept Learning (Key concepts, Mind map), Critical Thinking (Implications, QA-ct), Generative Learning (Case studies, Discussions, Teacher-style)
- RL-based method (SEAL) is strong at small scale but suffers diversity collapse as data volume grows — SPA keeps improving with scale
- Multi-stage approaches (Active Reading) that generate per-document strategies are less consistent than SPA's 7 fixed prompts on average strategy effectiveness
- SQuAD 91.27%, QuALITY 57.03%, MultiHop-RAG 88.36% — outperforms more complex methods (SEAL, EntiGraph, Active Reading, SoG) on all three benchmarks
- gpt-oss-120b (50x cheaper than GPT-4-Turbo) beats EntiGraph (which uses GPT-4-Turbo) — no strong generator required
Evidence
- SQuAD: SPA 91.27% vs SEAL 74.23% vs Active Reading 90.25% (same 120M token budget, Qwen2.5-7B)
- QuALITY: SPA 57.03% vs EntiGraph 56.22% (GPT-4-Turbo) vs Active Reading 51.13% (455M tokens, Meta-Llama-3-8B)
- MultiHop-RAG: SPA 88.36% vs EntiGraph 84.31% vs Active Reading 78.68% (15M tokens, GPT-4o-mini generation)
- SEAL diversity collapse: Compression Ratio SEAL 19.25 vs SPA 4.38 — SEAL shows 4x+ higher redundancy
How to Apply
- Prepare domain-specific documents (even a few hundred), apply SPA's 7 prompt templates repeatedly to each document to generate synthetic data, then run continued pretraining on Qwen2.5-7B or Llama-3-8B — expands token count hundreds to thousands of times
- For a specific downstream task (QA, reasoning, etc.), ablate the 7 prompts (remove one at a time) to find the optimal subset for your task — up to +0.51pp improvement confirmed on SQuAD
- No GPT-4-class generator required — even a cheap model like gpt-oss-120b can outperform strong-generator simple-QA approaches thanks to SPA's prompt diversity
Code Example
Related Resources
Original Abstract (Expand)
While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.