Long-form RewardBench: 긴 텍스트 생성을 위한 Reward Model 평가 벤치마크

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

Mar 13, 2026•Hui Huang, Yancheng He, Wei Liu +7•View PDF

TL;DR Highlight

기존 Reward Model 벤치마크가 짧은 텍스트만 다루는 문제를 해결하기 위해, 긴 텍스트 생성 전용 평가 데이터셋을 처음으로 만들었다.

Who Should Read

RLHF나 Best-of-N 샘플링으로 LLM 출력 품질을 높이려는 ML 엔지니어. 특히 긴 답변 생성(RAG, 보고서 작성 등)에서 Reward Model을 선택하거나 학습시켜야 하는 상황.

Core Mechanics

QA, RAG, Chat, Writing, Reasoning 5개 카테고리, 1816개 샘플로 구성된 Long-form RewardBench 공개 — 기존 벤치마크들은 수십~수백 토큰짜리 짧은 텍스트만 평가했음
20개 이상 Reward Model 실험 결과, Sequence Classifier(마지막 레이어에 점수 헤드 붙인 모델)가 전반적으로 더 우수 — Skywork-Reward-V2-Llama-3.2-3B가 74.9점으로 1위
수학 Reasoning 서브셋에서는 반대로 Generative Model이 압도적 — GPT-4.1이 78.4%, Gemini-2.5-flash가 90.2%로 Classifier의 ~70%를 크게 앞섬. Chain-of-Thought 덕분
Generative Model은 Selection 모드(최고 응답 고르기)가 Scoring 모드(점수 매기기)보다 훨씬 잘 됨 — Gemini-2.5-flash 기준 73.6% vs 40.8%로 거의 2배 차이
'Long-form Needle-in-a-Haystack Test' 설계: 응답 중간에 오류를 심어놓고 탐지 능력 측정 → Generative Model은 오류가 중간에 있을 때 정확도 급락 (lost-in-the-middle 현상), Classifier는 응답 길이에 강건하지만 오류 위치에 민감
학습 데이터 길이 실험: Classifier는 짧은 데이터로 학습해도 긴 텍스트에서 성능 유지, Generative Model은 학습 데이터가 길수록 성능이 올라가는 뚜렷한 패턴

Evidence

1위 Skywork-Reward-V2-Llama-3.2-3B(3B 파라미터 Classifier)가 74.9점으로 GPT-4.1(74.6점), Claude Opus 4(74.4점) 같은 대형 Generative Model보다 높음
Gemini-2.5-flash의 Reasoning 정확도 90.2%로 대부분 Classifier의 ~70% 수준을 크게 상회
Fine-tuned Generative Judge 모델(RISE-Judge-Qwen2.5-32B)은 Best-of-N 평가에서 Scoring 모드 37.1%, invalid response 비율 46.7%로 실용성 매우 낮음
Needle-in-a-Haystack 테스트 7,200개 샘플 생성(1K~8K 토큰 × 8개 길이 × Factuality·Safety 2차원), GPT-4o는 응답 중간 오류 탐지 정확도가 양 끝 대비 현저히 낮음

How to Apply

Reward Model을 Best-of-N 샘플링에 쓰는 경우: Generative Model은 Scoring 모드 대신 반드시 Selection 모드(후보 중 최고 고르기)로 써야 성능이 2배 가까이 오름
긴 답변 품질 평가 파이프라인을 구축하는 경우: 수학/코딩 추론 태스크에는 Generative Model(Gemini-2.5급), 그 외 일반 생성 태스크에는 Classifier 계열을 선택하는 게 유리함
Reward Model을 직접 파인튜닝하는 경우: Classifier는 짧은 데이터로도 긴 텍스트에 잘 일반화되므로 학습 데이터 확보가 어려운 상황에서 Classifier 방식이 현실적 선택

Code Example

snippet

# Generative Reward Model을 Selection 모드로 사용하는 프롬프트 예시
# (Scoring 모드보다 훨씬 높은 정확도)

prompt_template = """
You are an expert evaluator. Given the following instruction and multiple responses,
select the BEST response that is most helpful, accurate, and comprehensive.

Instruction:
{instruction}

Response A:
{response_a}

Response B:
{response_b}

Response C:
{response_c}

Which response is the best? Answer with only 'A', 'B', or 'C'.
"""

# ❌ 피해야 할 Scoring 모드 (동점 처리 문제 발생)
bad_prompt = """
Rate the following response on a scale of 1-10.
Response: {response}
Score:
"""

# ✅ 권장하는 Selection 모드
import openai

def select_best_response(instruction, responses):
    labeled = {chr(65+i): r for i, r in enumerate(responses)}  # A, B, C...
    prompt = prompt_template.format(
        instruction=instruction,
        **{f'response_{k.lower()}': v for k, v in labeled.items()}
    )
    result = openai.chat.completions.create(
        model="gpt-4.1-2025-04-14",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1
    )
    best_key = result.choices[0].message.content.strip()
    return labeled.get(best_key)

Terminology

Reward ModelLLM이 생성한 여러 답변 중 어떤 게 더 좋은지 점수를 매기는 심판 모델. 사람 선호도를 학습해서 자동으로 품질을 판단함.

RLHF사람 피드백으로 LLM을 강화학습하는 방법. Reward Model이 점수를 주면 그 점수를 보상 삼아 모델을 더 좋은 답변 쪽으로 학습시킴.

Best-of-N (BoN) SamplingLLM에게 같은 질문을 N번 물어보고, Reward Model이 가장 좋다고 판단한 답변 1개를 최종 출력으로 선택하는 방법. 추론 시점에 품질을 높이는 대표적 기법.

Sequence ClassifierReward Model의 한 종류. 모델 마지막 레이어에 점수 헤드를 붙여서 텍스트를 입력하면 숫자 점수를 바로 출력하는 방식.

Generative Model (Reward)점수를 텍스트로 생성하는 방식의 Reward Model. GPT-4o 같은 모델에게 '이 답변 몇 점이야?'라고 물어보거나 '어느 게 더 나아?'라고 선택하게 하는 방식.

Lost-in-the-MiddleLLM이 긴 텍스트를 처리할 때 중간 부분의 내용을 상대적으로 덜 주목하는 현상. 앞뒤는 잘 기억하지만 가운데는 놓치는 경향.

LLM-as-a-Judge다른 LLM의 출력 품질을 GPT-4o 같은 강력한 LLM이 심판으로 평가하는 방법. 사람이 직접 라벨링하는 비용을 줄이기 위해 사용.

Related Resources

Long-form RewardBench GitHub

Original Abstract (Expand)

The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error's position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.