A Survey of Post-Training Scaling in Large Language Models
TL;DR Highlight
A concise overview of 3 Post-Training Scaling methods that have emerged as alternatives to pre-training data scaling limits.
Who Should Read
Researchers and engineers following LLM scaling trends who want to understand the landscape of post-training techniques as pre-training data becomes scarce.
Core Mechanics
- Pre-training data scaling is hitting practical limits — high-quality internet text is largely exhausted for LLM training
- Three post-training scaling approaches are emerging as alternatives: (1) Inference-time scaling (more compute at test time), (2) RL-based reasoning (training models to reason better), (3) Synthetic data generation (models teaching themselves)
- Inference-time scaling (chain-of-thought, self-consistency, tree search) can double effective model capability without any training
- RL-based reasoning training (RLHF, RLAIF, process reward models) improves reasoning ability proportionally to training compute invested
- Synthetic data generation (models generating their own training data) enables continued scaling beyond human-labeled data limits
- The three approaches are complementary — combining them produces superadditive benefits
Evidence
- Inference-time scaling: spending 10x more compute at inference matches training a 3x larger model on reasoning tasks
- RL reasoning training: consistent log-linear improvement in reasoning ability with training compute invested
- Synthetic data: models trained on self-generated + curated data outperform those trained on human-labeled data alone for reasoning tasks
How to Apply
- For immediate capability improvements without training: invest in inference-time scaling — chain-of-thought, self-consistency, and best-of-N sampling are available today for any model.
- For sustained capability improvements: combine RL training (for reasoning) with synthetic data generation (for continued post-training scaling) — this is the trajectory of frontier model development.
- Prioritize based on your constraints: inference-time scaling requires no training but costs more per query; RL training requires significant upfront compute but reduces per-query cost afterward.
Code Example
# TTC Style: Improving Inference Quality with Best-of-N Sampling
import anthropic
client = anthropic.Anthropic()
def best_of_n_inference(prompt: str, n: int = 8) -> str:
"""Generate N responses and select the most confident answer (simple TTC implementation)"""
responses = []
for _ in range(n):
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
responses.append(msg.content[0].text)
# Select the best answer via majority vote or a separate judge model
judge_prompt = f"""
From the following {n} responses, select the most accurate and logical one and output only its content.
Responses:
" + "\n---\n".join(f"{i+1}. {r}" for i, r in enumerate(responses))
judge = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": judge_prompt}]
)
return judge.content[0].text
# Usage example
result = best_of_n_inference("Write and explain a Python code to find the 10th term of the Fibonacci sequence.", n=4)
print(result)Terminology
Related Papers
Show HN: Neural Particle Automata
고정된 격자 대신 움직이는 파티클 위에서 동작하는 Neural Cellular Automata의 확장 버전으로, 형태 생성·포인트 클라우드 분류·텍스처 합성 등 다양한 작업에서 자기조직화 동작을 학습할 수 있다.
The annotated PyTorch training loop
PyTorch 학습 루프의 각 코드 줄이 왜 그 위치에 있어야 하는지, 순서를 바꾸거나 빠뜨렸을 때 어떤 문제가 생기는지를 단계별로 설명한 심층 가이드다.
When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks
VLM 자가학습 루프에서 verifier가 특정 태스크에 맞지 않으면 학습할수록 오히려 성능이 떨어지는데, DPO 손실값은 멀쩡히 내려가서 눈치채기도 어렵다.
The Role of Feedback Alignment in Self-Distillation
LLM이 스스로를 가르칠 때, 피드백을 모델의 추론 흐름에 단계별로 맞추면 GRPO보다 16점 이상 수학 추론 성능이 오른다.
Tiny hackable CUDA language model implementation
CUDA로 작성된 GPT(Generative Pretrained Transformer) 미니멀 구현체로, 텍스트뿐 아니라 모든 바이트 스트림을 학습할 수 있어 LLM 내부 구조를 직접 뜯어보고 싶은 개발자에게 유용하다.
CS336: Language Modeling from Scratch
Stanford에서 운영하는 LLM 전 과정 구현 강의로, 토크나이저부터 데이터 수집, 트랜스포머 구현, 분산 학습, RL 기반 정렬까지 직접 코딩하며 배운다. 이론이 아닌 구현 중심이라 실제로 LLM이 어떻게 작동하는지 깊이 이해하고 싶은 개발자에게 가장 체계적인 커리큘럼 중 하나다.
Original Abstract (Expand)
Large language models (LLMs) have achieved remarkable proficiency in understanding and generating human natural languages, mainly owing to the "scaling law" that optimizes relationships among language modeling loss, model parameters, and pre-trained tokens. However, with the exhaustion of high-quality internet corpora and increasing computational demands, the sustainability of pre-training scaling needs to be addressed. This paper presents a comprehensive survey of post-training scaling, an emergent paradigm aiming to relieve the limitations of traditional pre-training by focusing on the alignment phase, which traditionally accounts for a minor fraction of the total training computation. Our survey categorizes post-training scaling into three key methodologies: Supervised Fine-tuning (SFT), Reinforcement Learning from Feedback (RLxF), and Test-time Compute (TTC). We provide an in-depth analysis of the motivation behind post-training scaling, the scalable variants of these methodologies, and a comparative discussion against traditional approaches. By examining the latest advancements, identifying promising application scenarios, and highlighting unresolved issues, we seek a coherent understanding and map future research trajectories in the landscape of post-training scaling for LLMs.