Cross-Context Review: Production과 Review 세션을 분리해서 LLM 출력 품질 높이기

Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions

Mar 12, 2026•Tae-Eun Song•View PDF

TL;DR Highlight

LLM이 자기가 만든 결과물을 같은 세션에서 검토하면 오류를 못 잡는다 — 새 세션에서 검토하면 F1 28.6%로 확 올라간다.

Who Should Read

LLM으로 코드나 문서를 생성하고 품질 검증 단계를 고민하는 백엔드/풀스택 개발자. 특히 AI 에이전트 파이프라인에서 자동 리뷰 단계를 설계하는 개발자.

Core Mechanics

LLM이 자기가 만든 결과물을 같은 세션에서 리뷰하면 '프로덕션 컨텍스트'에 앵커링돼서 오류를 합리화함 — 비판이 아니라 방어를 하게 됨
해결책은 단순함: 새 세션 열고, 결과물만 붙여넣고, 리뷰 요청 → 이게 Cross-Context Review (CCR)
같은 세션에서 두 번 리뷰(SR2)해도 효과 없음(p=0.11) — 반복이 아니라 컨텍스트 분리가 핵심임을 실험으로 증명
Critical 오류 탐지율에서 CCR이 SR보다 +11%p 높음 — 중요한 오류일수록 효과가 큼
코드 리뷰에서 가장 효과적 (F1 +4.7 포인트), 문서 > 스크립트 순
프로덕션 세션은 50K+ 토큰인데 CCR 리뷰 세션은 ~5K 토큰 — 오히려 비용이 더 쌀 수 있음

Evidence

CCR F1 28.6% vs SR 24.6%(p=0.008, d=0.52), SR2 21.7%(p<0.001, d=0.72), SA 23.8%(p=0.004, d=0.57) — 모든 베이스라인 대비 유의미하게 우세
같은 세션 반복 리뷰(SR2)는 SR 대비 개선 없음(p=0.11) — SR2가 찾아낸 findings는 더 많지만(5.5 vs 4.8) precision이 21.0%로 오히려 하락
Critical 오류 탐지율: CCR 40% vs SR 29% (+11%p) / Minor 오류는 차이 거의 없음(18% vs 19%)
논문 자체를 CCR로 검증했더니 CCR-1에서 15개 오류, CCR-2에서 10개 추가 오류(레퍼런스 저자 위조 포함) 발견 — 같은 세션 리뷰론 못 잡을 유형

How to Apply

코드 생성 후 커밋 전에: 새 채팅 창 열고 생성된 코드만 붙여넣은 뒤 'factual accuracy, consistency, contextual fitness, audience perspective, completeness 5가지 관점에서 리뷰해줘'라고 요청
AI 에이전트 파이프라인에 CCR 스테이지 추가: 아티팩트 생성 에이전트와 리뷰 에이전트를 분리 실행하고, 리뷰 에이전트에는 결과물 텍스트와 리뷰 프롬프트만 전달 (프로덕션 대화 히스토리 제외)
중요 문서(API 명세, 기술 튜토리얼)는 2라운드 CCR 적용: 1라운드에서 로직/수치 오류 잡고, 2라운드에서 레퍼런스/인용/사실관계 검증에 집중

Code Example

snippet

# CCR 리뷰 프롬프트 템플릿 (논문 Appendix A 기반)
CCR_REVIEW_PROMPT = """
Review the following {artifact_type} from a fresh perspective:

1. Factual accuracy: Are numbers, names, dates, and technical claims correct?
2. Internal consistency: Are there contradictions or terminology mismatches?
3. Contextual fitness: Would this work correctly in its intended environment?
4. Audience perspective: Could the target reader misinterpret any part?
5. Completeness: Is anything important missing?

For each issue found, provide:
- Location (line number or section)
- Description of the error
- Type (FACT/CONS/CTXT/RCVR/MISS)
- Severity (Critical/Major/Minor)
- Suggested fix

--- ARTIFACT START ---
{artifact_content}
--- ARTIFACT END ---
"""

# CCR 적용 예시 (OpenAI API)
from openai import OpenAI

def cross_context_review(artifact_content: str, artifact_type: str = "code") -> str:
    """
    새 클라이언트 인스턴스 = 새 컨텍스트.
    프로덕션 대화 히스토리를 절대 messages에 포함시키지 말 것.
    """
    client = OpenAI()  # 새 세션 — 이전 대화 히스토리 없음
    
    prompt = CCR_REVIEW_PROMPT.format(
        artifact_type=artifact_type,
        artifact_content=artifact_content
    )
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            # 중요: 프로덕션 세션 메시지 히스토리 포함 금지!
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content

# 사용법
generated_code = """
def get_business_days(start, end):
    count = 0
    for d in range((end - start).days):
        if (start + timedelta(d)).weekday() >= 6:  # 버그: 5여야 함
            continue
        count += 1
    return count
"""

review_result = cross_context_review(generated_code, artifact_type="Python function")
print(review_result)

Terminology

CCR (Cross-Context Review)글을 쓴 사람이 아닌 제3자가 검토하는 코드 리뷰처럼, LLM도 새 세션(빈 컨텍스트)에서 결과물을 검토하게 하는 방법. 작성자의 편견 없이 결과물만 보게 하는 것.

Anchoring Bias처음 본 정보에 판단이 고착되는 현상. LLM이 자기가 쓴 코드를 보면 '이 코드는 맞겠지'라고 전제하고 리뷰를 시작하게 됨.

SycophancyLLM이 사용자(또는 자기 자신)의 의견에 동의하려는 경향. RLHF(인간 피드백 강화학습)로 훈련된 모델일수록 비판보다 동의를 더 잘함.

RLHF인간이 '이 답변이 더 좋아요'라고 피드백을 주면서 모델을 학습시키는 방법. 덕분에 모델이 친절해지지만 비판적 사고가 약해지는 부작용이 있음.

Context WindowLLM이 한 번에 볼 수 있는 텍스트의 최대 길이. 프로덕션 세션은 대화가 쌓이면 50K+ 토큰이 되는데, 이게 리뷰 품질을 낮춤.

Lost in the MiddleLLM이 긴 텍스트에서 맨 앞과 맨 뒤는 잘 기억하는데 중간 부분은 흘려버리는 현상. 대화가 길어진 세션에서 자주 발생.

F1 Score정밀도(찾은 오류 중 진짜 오류 비율)와 재현율(실제 오류 중 찾아낸 비율)을 합친 종합 성능 지표. 높을수록 오류를 잘 잡는 것.

Original Abstract (Expand)

Large language models struggle to catch errors in their own outputs when the review happens in the same session that produced them. This paper introduces Cross-Context Review (CCR), a straightforward method where the review is conducted in a fresh session with no access to the production conversation history. We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions -- same-session Self-Review (SR), repeated Self-Review (SR2), context-aware Subagent Review (SA), and Cross-Context Review (CCR). Over 360 reviews, CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008, d=0.52), SR2 (21.7%, p<0.001, d=0.72), and SA (23.8%, p=0.004, d=0.57). The SR2 result matters most for interpretation: reviewing twice in the same session did not beat reviewing once (p=0.11), which rules out repetition as an explanation for CCR's advantage. The benefit comes from context separation itself. CCR works with any model, needs no infrastructure, and costs only one extra session.