LLM을 위한 Structured Reasoning: Generate–Verify–Revise 패러다임

Structured Reasoning for Large Language Models

Jan 12, 2026•Jinyi Han, Zixiang Di, Zishang Jiang +4•View PDF

TL;DR Highlight

LLM의 추론 과정을 생성→검증→수정 3단계로 명시적으로 분리해 토큰 50% 줄이면서 정확도는 높이는 학습 프레임워크

Who Should Read

추론 모델(o1, DeepSeek-R1 계열)의 불필요한 self-reflection 루프와 토큰 낭비를 줄이고 싶은 ML 엔지니어. 수학/코드 문제풀이 파이프라인에서 CoT 품질을 개선하려는 파인튜닝 담당자.

Core Mechanics

기존 LRM(Large Reasoning Model)의 추론 토큰 중 50% 이상이 Verification & Revision에 쓰이는데, 실제 오답→정답 교정 성공률은 1~7%에 불과 — 나머지는 그냥 맞는 답을 또 확인하는 낭비
SCR은 추론을 <answer>(생성) → <critic>(검증) → <revised>(수정) 3개 태그로 명시 분리해서 각 단계를 독립적으로 학습 가능하게 구조화
Dynamic Termination Supervision(DTS): critic이 'T'(정답 확인)면 바로 [EOS] 삽입해 조기 종료, 'F'면 revision 강제 — 이걸로 불필요한 반복 루프 제거
2단계 RL 전략: 1단계는 초기 생성+자기검증만 학습(revision 마스킹), 2단계는 revision 집중 최적화 — 학습 신호 간섭 방지
Qwen2.5-7B 기준 MATH500에서 출력 토큰 4069→1332로 약 67% 감소, AIME25에서 기존 SFT+GRPO 11.67% → SCR 13.67%로 정확도도 향상
자기검증(self-verification) 정확도가 기존 Base 모델 대비 대폭 개선: Qwen2.5-7B F1 52.09 → 75.36, Qwen2.5-3B는 3B 모델이 Llama 8B Base를 능가

Evidence

MATH500 평균 출력 토큰: SFT+GRPO 4069개 → SCR 1332개 (약 67% 감소), Olympiad에서도 3103 → 1118 (64% 감소)
Qwen2.5-7B 기준 AIME25: Base 6.00% → SCR 13.67%, Qwen2.5-3B 평균 정확도: Base 34.05% → SCR 39.75% (+5.70%)
Llama3.1-8B 전체 평균: Base 24.71% → SCR 34.25% (+9.54%), 가장 큰 개선폭
자기검증 F1: Llama3.1-8B Base 30.75 → SCR-Stage I 55.81, Qwen2.5-7B Base 52.09 → SCR-Stage I 75.36

How to Apply

파인튜닝 데이터 구성 시 <answer>...</answer><critic>...</critic><revised>...</revised> 태그 포맷으로 structured trajectory를 합성 — critic이 'T'로 끝나면 EOS 삽입, 'F'면 revised 블록 추가하는 방식으로 SFT 데이터 생성
RL 학습을 2단계로 분리: 1단계에서 <revised> 블록을 gradient 마스킹하고 초기 생성+검증만 최적화, 2단계에서 전체 trajectory에 revision 보상(성공교정 +, 불필요한 수정 -, 정답 훼손 강하게 -)을 적용
프롬프트 기반으로 즉시 시험해보려면 'Prompt Used in SCR' 형식 활용 — 답 생성 후 <critic>에서 T/F 판정, F이면 <revised>에서 수정하되 T이면 즉시 중단하도록 system prompt 구성

Code Example

snippet

# SCR 추론 프롬프트 (즉시 사용 가능)
system_prompt = """
You are a helpful AI assistant.
For each question, you must first solve the problem and put your complete solution inside <answer>...</answer>. The final result must be wrapped in \\boxed{}.

Then you must evaluate your own solution inside <critic>...</critic>.
At the end of the <critic>, you must give the final judgment using only one single symbol: T or F.
T means the answer is correct. F means the answer is incorrect.

If the final judgment is F, you must give a corrected solution inside <revised>...</revised>, and the final result must also be wrapped in \\boxed{}.
If the final judgment is T, you must stop and give no further output.
"""

# 파싱 예시
import re

def parse_scr_output(text):
    answer = re.search(r'<answer>(.*?)</answer>', text, re.DOTALL)
    critic = re.search(r'<critic>(.*?)</critic>', text, re.DOTALL)
    revised = re.search(r'<revised>(.*?)</revised>', text, re.DOTALL)
    
    verdict = 'T' if critic and critic.group(1).strip().endswith('T') else 'F'
    final = revised.group(1) if revised else answer.group(1)
    
    return {
        'initial': answer.group(1) if answer else None,
        'verdict': verdict,
        'final': final,
        'was_revised': revised is not None
    }

Terminology

CoTChain of Thought의 약자. 모델이 답을 바로 내놓지 않고 '1단계: ..., 2단계: ...' 식으로 중간 생각 과정을 텍스트로 쭉 써내려가는 기법. 사람이 문제 풀 때 풀이 과정 적는 것과 동일.

GRPOGroup Relative Policy Optimization. 강화학습에서 여러 개의 답변을 동시에 샘플링해서 서로 상대적으로 좋고 나쁨을 비교해 학습하는 방법. 절대적인 점수가 아닌 '이 그룹 안에서 상대적으로 좋은 답'에 보상을 주는 방식.

SFTSupervised Fine-Tuning. 모범답안 데이터를 보여주고 그걸 따라하도록 학습시키는 방법. 학교에서 예제 풀이 보고 따라 푸는 것과 비슷하며, RL 이전에 기본 형식을 잡아주는 역할.

RLVRReinforcement Learning with Verifiable Rewards. 인간 피드백 대신 '정답인지 아닌지' 자동으로 검증 가능한 보상 신호로 강화학습 하는 방법. 수학 문제 정답, 코드 실행 결과 등 객관적으로 맞고 틀림을 판단할 수 있는 태스크에 적합.

Dynamic Termination Supervision모델이 스스로 '이제 그만해도 되겠다'를 판단하도록 학습시키는 기법. 검증 결과가 정답이면 즉시 종료 신호를 삽입하고, 오답이면 수정을 강제하는 방식으로 불필요한 반복을 제거.

Cognitive Idling모델이 이미 맞는 답을 갖고 있음에도 계속 같은 검증을 반복하는 현상. 답은 정해졌는데 계속 확인만 하는 '생각의 공회전' 상태.

Test-time Scaling모델 크기를 키우는 대신, 추론 시점에 더 많은 계산(긴 CoT, 여러 번 시도 등)을 투입해 성능을 높이는 전략. 돈을 더 써서 더 오래 생각하게 하는 방식.

Related Resources

Original Abstract (Expand)

Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.