SPA: Knowledge Injection을 위한 단순하지만 강력한 Baseline

SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

Mar 23, 2026•Kexian Tang, Jiani Wang, Shaowen Wang +1•View PDF

TL;DR Highlight

7개의 정교한 프롬프트로 소량의 도메인 데이터를 대규모 합성 데이터로 증강해 LLM에 지식을 주입하는 방법이 복잡한 RL/멀티스테이지 방식을 압도한다.

Who Should Read

의료, 법률, 금융 등 특정 도메인에 LLM을 파인튜닝하려는 ML 엔지니어. 데이터가 부족한 환경에서 합성 데이터 생성 전략을 고민하는 개발자.

Core Mechanics

핵심 아이디어는 단순함: 인지과학/교육심리학 기반 7개 프롬프트 템플릿으로 소량 코퍼스를 대규모 합성 데이터로 반복 증강한 뒤 continued pretraining(사전학습 이어서 진행)에 사용
7개 프롬프트는 3단계 학습 전략 기반 - Concept Learning(Key concepts, Mind map), Critical Thinking(Implications, QA-ct), Generative Learning(Case studies, Discussions, Teacher-style)
RL 기반 방식(SEAL)은 소규모에선 강하지만 데이터 스케일 키울수록 diversity collapse(생성 패턴이 좁아지는 현상)로 성능이 포화됨 - SPA는 스케일 키울수록 계속 향상
멀티스테이지 방식(Active Reading)이 문서별로 생성하는 전략보다 사람이 직접 설계한 7개 고정 프롬프트가 평균 전략 효과성에서 더 안정적
SQuAD 91.27%, QuALITY 57.03%, MultiHop-RAG 88.36% 달성 - 세 벤치마크 모두 SEAL, EntiGraph, Active Reading, SoG 등 더 복잡한 방법보다 우수
GPT-4-Turbo 대비 50배 저렴한 gpt-oss-120b로도 EntiGraph(GPT-4-Turbo 사용)를 능가 - 강한 generator가 없어도 됨

Evidence

SQuAD: SPA 91.27% vs SEAL 74.23% vs Active Reading 90.25% (동일 120M 토큰 예산, Qwen2.5-7B 기준)
QuALITY: SPA 57.03% vs EntiGraph 56.22%(GPT-4-Turbo 사용) vs Active Reading 51.13% (455M 토큰, Meta-Llama-3-8B 기준)
MultiHop-RAG: SPA(Meta-Llama-3-8B) 88.36% vs EntiGraph 84.31% vs Active Reading 78.68% (15M 토큰, GPT-4o-mini 생성)
SEAL의 diversity collapse 수치 확인: CR(Compression Ratio) SEAL 19.25 vs SPA 4.38 - SEAL이 4배 이상 높은 중복도 보임

How to Apply

도메인 특화 소량 문서(수백 개)를 준비하고, SPA의 7개 프롬프트 템플릿을 각 문서에 반복 적용해 합성 데이터를 생성한 뒤 Qwen2.5-7B나 Llama-3-8B에 continued pretraining - 원본 코퍼스 대비 수백~수천 배 토큰으로 확장 가능
특정 다운스트림 태스크(QA, 추론 등)가 정해진 경우 7개 프롬프트를 ablation(하나씩 제거하며 성능 테스트)해 태스크에 맞는 프롬프트 subset을 선택하면 추가 성능 향상 가능 (SQuAD 기준 +0.51%p 개선 확인)
생성 모델로 GPT-4급이 없어도 됨 - gpt-oss-120b처럼 저렴한 모델이라도 SPA 프롬프트 다양성 덕분에 강력한 generator의 단순 QA 방식보다 나은 결과 가능

Code Example

snippet

Terminology

Knowledge InjectionLLM이 모르는 특정 도메인 지식을 추가 학습으로 집어넣는 것. 의대생이 전공 교재로 추가 공부하는 것과 비슷.

Continued Pretraining이미 학습된 LLM에 새로운 데이터로 추가 사전학습을 계속 진행하는 것. 직장인이 새 분야 공부를 이어서 하는 것과 비슷.

Synthetic Data실제 데이터 대신 LLM이 생성한 인공 학습 데이터. 문제집이 부족할 때 AI가 문제를 더 만들어주는 것.

Diversity Collapse반복 학습 과정에서 생성 모델이 점점 비슷한 패턴만 출력하게 되는 현상. 같은 곡만 계속 연습하면 다른 곡을 못 치게 되는 것과 비슷.

RL-based Augmentation강화학습(RL)으로 데이터 생성 모델을 훈련하는 방식. 생성한 데이터로 학습한 모델의 성능을 보상 신호로 사용해 데이터 생성기를 개선.

Continued Pretraining vs Fine-tuningContinued Pretraining은 일반 텍스트로 언어 모델 자체를 계속 훈련, Fine-tuning은 특정 태스크 형식으로 추가 학습. 전자는 지식 주입, 후자는 행동 조정에 가까움.

Multi-stage Prompting데이터 생성을 여러 단계로 나눠 진행하는 파이프라인. 1단계에서 전략을 생성하고, 2단계에서 그 전략으로 데이터를 만드는 식.

Related Resources

SPA GitHub Repository

Original Abstract (Expand)

While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.