ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning | AI Paper Digest

TL;DR Highlight

128K 토큰 컨텍스트에서 모델 내부 attention 신호로 핵심 증거만 추출해 재주입하면 추론 정확도가 24.6% 오른다.

Who Should Read

128K 이상 긴 문서를 LLM에 넣고 질의응답을 구현하는 백엔드 개발자, 또는 RAG 없이 full-context 방식으로 문서 처리를 하다가 답변 품질이 아쉬운 ML 엔지니어.

Core Mechanics

128K 토큰 컨텍스트에서 상위 0.1%(약 128개 토큰)만으로 전체 질문 관련 relevance score의 50~80%가 커버됨. 즉 핵심 증거는 극소수 토큰에 집중되어 있음.
ReContext는 모델 파라미터 수정 없이(training-free) 동작함. 모델 내부 attention 점수를 relevance 신호로 활용해 질문과 관련된 문장 스팬을 추출하고, 원본 컨텍스트를 유지한 채로 그 앞에 증거를 추가 삽입(replay)해서 최종 답변을 생성함.
재귀적(recursive) 방식으로 작동함. 1라운드에서 뽑은 증거를 프롬프트에 삽입한 뒤, 그 상태에서 다시 attention을 읽어 2라운드 증거를 추가하는 식으로 R번 반복. 각 라운드가 이전 라운드 결과를 조건으로 하기 때문에 multi-hop 추론에도 효과적.
원본 컨텍스트를 삭제하거나 압축하지 않음. 뽑아낸 증거는 '강조'를 위해 질문 바로 앞에 삽입되고, 원본은 그대로 남아있어 놓친 정보도 여전히 참조 가능함.
Qwen3-4B, Qwen3-8B, Llama3.1-8B 세 모델 모두에서 8개 벤치마크 기준 평균 랭킹 1위를 달성. thinking 모드 활성화, 64K 짧은 컨텍스트 설정에서도 효과 유지됨.
이론적으로도 증명됨. 각 replay 단계마다 hidden embedding이 정답 embedding에 코사인 유사도 기준으로 단조 증가(monotonic improvement)한다는 것을 수학적으로 증명함.

Evidence

8개 128K 벤치마크, 3개 백본 모델 기준 평균 accuracy가 Vanilla 0.24에서 ReContext 0.30으로 24.6% 상대적 향상. 모든 백본에서 평균 랭킹 1위(Qwen3-4B: 1.00, Qwen3-8B: 1.46, Llama3-8B: 1.29).
64K 짧은 컨텍스트에서도 Vanilla 대비 macro-average 0.21 → 0.28로 35.0% 상대 향상, 모든 지표에서 top-2 이내.
thinking 모드(Qwen3-4B)에서도 macro-average 28.0 → 32.6으로 16.7% 상대 향상, NQ Acc/F1, PopQA Acc, InfMC Acc에서 최고 성능.
런타임: Vanilla 44분 대비 ReContext 62분으로 41% 느리지만, DySCO(2시간 13분)보다는 훨씬 빠름. GPU 메모리는 128개 미만 토큰 추가에 그쳐 Vanilla와 동일 수준.

How to Apply

긴 문서 QA 파이프라인에서 full-context 프롬프트를 쓰고 있다면, 모델 attention 점수를 읽어 상위 K개 토큰이 포함된 문장을 추출하고, 그 문장들을 원본 컨텍스트 뒤 + 질문 앞에 삽입해서 재생성하면 됨. 오픈소스 모델(Qwen3, Llama3 계열)에서 attention weights에 접근 가능한 경우 바로 적용 가능.
multi-hop 추론(여러 문서를 연결해 답을 찾아야 하는 경우)이라면 R=2~4 라운드로 재귀적으로 반복하면 됨. 1라운드 증거를 삽입한 상태에서 2라운드 attention을 다시 읽으면 연관 문장이 추가로 발굴됨.
KV cache 스냅샷을 원본 컨텍스트 끝에 저장해두면, replay 단계에서 증거+질문 토큰만 새로 처리하면 되므로 재계산 비용을 크게 줄일 수 있음. 논문 구현체(GitHub)가 이 최적화를 이미 포함함.

Code Example

snippet

# ReContext 핵심 로직 의사코드 (Hugging Face 기반)
import torch

def get_relevance_scores(model, input_ids, question_token_ids, layer_heads=None):
    """
    모델 내부 attention으로 context 토큰별 relevance 점수 계산
    """
    with torch.no_grad():
        outputs = model(input_ids, output_attentions=True)
    
    # 질문 suffix 토큰(마지막 w=8개)의 attention을 cue로 사용
    cue_positions = list(range(input_ids.shape[1] - 8, input_ids.shape[1]))
    
    relevance_scores = torch.zeros(input_ids.shape[1])
    decay = 0.75
    running_score = None
    
    for layer_idx, attn in enumerate(outputs.attentions):
        # attn shape: (batch, heads, seq, seq)
        cue_attn = attn[0, :, cue_positions, :].mean(dim=(0, 1))  # (seq,)
        if running_score is None:
            running_score = cue_attn
        else:
            running_score = cue_attn + decay * running_score
            running_score = running_score / running_score.sum()
    
    return running_score

def recontext_inference(model, tokenizer, context, question, R=2, K=16):
    """
    ReContext: 재귀적 증거 replay 추론
    """
    # 문장 단위 분리
    sentences = context.split('. ')
    sentence_ranges = []  # 각 문장의 토큰 범위 추적
    
    evidence_pool = []
    current_prompt = context + '\n\n' + question
    
    for round_idx in range(R):
        # 현재 프롬프트 토크나이징
        input_ids = tokenizer(current_prompt, return_tensors='pt').input_ids
        
        # Relevance 점수 계산 (attention 기반)
        scores = get_relevance_scores(model, input_ids, question)
        
        # 원본 컨텍스트 토큰 범위에서만 Top-K 선택
        context_len = len(tokenizer(context).input_ids)
        context_scores = scores[:context_len]
        top_k_indices = context_scores.topk(K).indices
        
        # 선택된 토큰 → 문장으로 매핑
        selected_tokens = tokenizer.convert_ids_to_tokens(
            input_ids[0, top_k_indices]
        )
        new_evidence = extract_sentences_by_tokens(sentences, top_k_indices)
        
        # 중복 제거 후 evidence pool에 추가
        for sent in new_evidence:
            if sent not in evidence_pool:
                evidence_pool.append(sent)
        
        # Replay 프롬프트 재구성: [원본 컨텍스트; 증거; 질문]
        evidence_text = '\n'.join(evidence_pool)
        current_prompt = f"{context}\n\n[Evidence]\n{evidence_text}\n\n{question}"
    
    # 최종 답변 생성
    input_ids = tokenizer(current_prompt, return_tensors='pt').input_ids
    output = model.generate(input_ids, max_new_tokens=100)
    return tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)

# 사용 예시
# answer = recontext_inference(model, tokenizer, long_context, question, R=2, K=16)

Terminology

KV cacheTransformer가 이전에 처리한 토큰의 Key-Value 값을 저장해두는 메모리. 같은 내용을 반복 계산하지 않아도 되게 해주는 캐시 역할로, 긴 문서 처리 속도에 핵심적인 영향을 미침.

attentionTransformer가 입력 토큰들 중 어디에 집중할지 결정하는 메커니즘. 사람이 글을 읽을 때 중요한 단어에 집중하는 것과 유사하며, 여기서는 이 신호를 증거 추출에 활용함.

multi-hop 추론하나의 답을 얻기 위해 여러 문서나 문장을 연결해야 하는 추론 방식. 예: 'A의 CEO가 다닌 대학의 설립 연도'를 찾으려면 두 단계 정보를 연결해야 함.

training-free모델 파라미터를 전혀 수정하지 않고, 추론(inference) 단계에서만 동작하는 방법. GPU로 재학습할 필요 없이 바로 적용 가능.

relevance score질문과 각 토큰이 얼마나 관련 있는지 나타내는 점수. 여기서는 attention 가중치를 집계해서 계산함.

associative memory특정 단서(cue)를 입력하면 연관된 기억(trace)을 떠올리는 메모리 구조. 뇌의 기억 메커니즘과 유사하며, 논문에서는 Transformer의 attention이 이 역할을 한다고 해석함.

backbone기반이 되는 LLM 모델. 여기서는 Qwen3-4B, Qwen3-8B, Llama3.1-8B가 backbone으로 사용됨.

Related Papers

Related Resources

ReContext GitHub 공식 코드

Original Abstract (Expand)

Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. In this work, we propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation while preserving the full original context. This recursive selection process separates evidence organization from answer generation without training, external memory, or context pruning. We also provide a theoretical analysis based on associative memory, which characterizes the context as a memory store, the question as a retrieval cue, attention as cue-trace association, and replay as trace reactivation. Experiments on eight long-context datasets with 128K context length show that RECONTEXT consistently improves evidence utilization across Qwen3-4B, Qwen3-8B, and Llama3-8B, achieving the best average rank on all three backbones. Code is available at https://github.com/Yanjun-Zhao/ReContext.