LLM Post-Training Scaling 완전 정복: SFT, RLHF, Test-time Compute 총정리 | AI Paper Digest

TL;DR Highlight

Post-Training Scaling 서베이가 프리트레이닝 데이터 고갈에 대응하는 3가지 확장 방법론을 한 번에 정리함.

Who Should Read

LLM 파인튜닝이나 추론 비용 최적화를 고민하는 ML 엔지니어 및 LLM 기반 서비스 아키텍처를 설계하는 개발자. 특히 o1/o3 같은 추론 특화 모델이 왜 다른지 궁금한 사람.

Core Mechanics

고품질 인터넷 텍스트가 바닥나고 GPU 비용이 한계에 달하면서, 프리트레이닝 대신 '이미 학습된 모델을 더 잘 쓰는 법'이 주류 연구 방향이 됨
Post-Training Scaling은 SFT(모범답안 따라하기), RLxF(보상 피드백으로 강화학습), TTC(추론 시점에 계산 더 쏟기) 3가지로 나뉨
SFT는 소량의 고품질 데이터로도 효과가 크지만, 데이터 품질과 다양성이 스케일링의 핵심 병목
RLxF(RLHF/RLAIF 포함)는 인간 선호도나 AI 피드백으로 모델을 정렬시키는데, 보상 모델 품질이 성능 상한선을 결정
TTC(Test-time Compute)는 추론할 때 더 오래 생각하게 만드는 방식으로, o1 스타일의 Chain-of-Thought 확장이 대표 사례
Post-Training은 전체 학습 비용의 극히 일부만 차지하지만, 같은 베이스 모델에서 성능 격차를 크게 벌릴 수 있음

Evidence

서베이 논문 특성상 구체적 단일 실험 수치는 없으나, 인용된 연구들에서 SFT만으로도 RLHF 대비 경쟁력 있는 성능을 소수 데이터로 달성한 사례 다수 정리
TTC 관련 연구에서 추론 시 계산량을 늘릴수록 수학·코딩 벤치마크 성능이 log-linear하게 향상되는 스케일링 법칙 확인
Post-Training에 투입되는 계산량은 프리트레이닝 대비 수십~수백 분의 1 수준이지만, 모델 실사용 성능(MMLU, HumanEval 등)에서 의미 있는 개선을 이끌어냄

How to Apply

작은 모델(예: Llama-3.1-8B)을 도메인 특화 SFT 데이터로 파인튜닝할 때, 데이터 양보다 품질(다양성+정확성)에 집중하면 큰 모델 부럽지 않은 결과를 낼 수 있음
복잡한 추론이 필요한 태스크(수학, 코딩, 법률 분석)라면 TTC 방식 적용 고려 — 단순히 temperature 올리는 게 아니라 Chain-of-Thought + 다수 샘플링 후 검증하는 파이프라인 구성
RLHF 구축이 부담스러우면 RLAIF(AI가 피드백 주는 방식)로 대체 가능 — 인간 레이블러 없이도 Claude나 GPT-4를 judge로 써서 선호도 데이터 자동 생성

Code Example

snippet

# TTC 스타일: Best-of-N 샘플링으로 추론 품질 올리기
import anthropic

client = anthropic.Anthropic()

def best_of_n_inference(prompt: str, n: int = 8) -> str:
    """N개 응답 생성 후 가장 자신감 있는 답 선택 (간단한 TTC 구현)"""
    responses = []
    for _ in range(n):
        msg = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        responses.append(msg.content[0].text)
    
    # 다수결 또는 별도 judge 모델로 최선 선택
    judge_prompt = f"""
다음 {n}개의 응답 중 가장 정확하고 논리적인 것을 선택해 그 내용만 출력하세요.

응답들:
" + "\n---\n".join(f"{i+1}. {r}" for i, r in enumerate(responses))
    
    judge = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    return judge.content[0].text

# 사용 예
result = best_of_n_inference("피보나치 수열의 10번째 항을 구하는 파이썬 코드를 작성하고 설명하세요.", n=4)
print(result)

Terminology

Post-Training Scaling모델을 처음부터 크게 학습시키는 대신, 이미 학습된 모델을 정렬·개선하는 단계에 자원을 더 투자하는 접근법. 완성된 건물에 인테리어를 고급스럽게 하는 것과 비슷.

SFTSupervised Fine-Tuning. 모범답안 예제를 보여주고 따라하게 만드는 학습법. 학교에서 교사가 풀이를 보여주면 학생이 따라 푸는 것과 같음.

RLxFRLHF(인간 피드백), RLAIF(AI 피드백)를 아우르는 강화학습 계열. 모델이 좋은 답을 하면 점수를 주고, 나쁜 답이면 깎아서 점점 더 잘 답하도록 훈련.

TTCTest-time Compute. 추론(inference) 시점에 계산을 더 많이 쏟아서 정확도를 높이는 방식. 시험 볼 때 검산을 여러 번 하는 것과 같음.

스케일링 법칙모델 크기, 데이터량, 계산량을 늘릴수록 성능이 예측 가능하게 좋아진다는 경험 법칙. 레시피 재료를 2배 늘리면 맛도 예측 가능하게 달라지는 것과 비슷.

RLAIFAI가 피드백을 주는 강화학습. 비싼 인간 레이블러 대신 GPT-4나 Claude 같은 모델이 다른 모델의 답변을 평가해서 학습 신호로 씀.

Chain-of-Thought답을 바로 내놓지 않고 '생각하는 과정'을 단계별로 쓰게 하는 프롬프팅 기법. 수학 문제를 풀 때 풀이 과정을 쓰면 실수가 줄어드는 원리와 같음.

관련 논문

Original Abstract (Expand)

Large language models (LLMs) have achieved remarkable proficiency in understanding and generating human natural languages, mainly owing to the "scaling law" that optimizes relationships among language modeling loss, model parameters, and pre-trained tokens. However, with the exhaustion of high-quality internet corpora and increasing computational demands, the sustainability of pre-training scaling needs to be addressed. This paper presents a comprehensive survey of post-training scaling, an emergent paradigm aiming to relieve the limitations of traditional pre-training by focusing on the alignment phase, which traditionally accounts for a minor fraction of the total training computation. Our survey categorizes post-training scaling into three key methodologies: Supervised Fine-tuning (SFT), Reinforcement Learning from Feedback (RLxF), and Test-time Compute (TTC). We provide an in-depth analysis of the motivation behind post-training scaling, the scalable variants of these methodologies, and a comparative discussion against traditional approaches. By examining the latest advancements, identifying promising application scenarios, and highlighting unresolved issues, we seek a coherent understanding and map future research trajectories in the landscape of post-training scaling for LLMs.