Self-Compacting Language Model Agents: Rubric 기반 적응형 Context 압축

TL;DR Highlight

LLM 에이전트가 스스로 '지금 요약해도 되는지'를 판단하는 rubric을 추가하면, 파인튜닝 없이도 고정 주기 요약보다 정확도는 높고 비용은 30~70% 낮아진다.

Who Should Read

LLM 에이전트나 딥리서치 시스템에서 컨텍스트 창 초과 문제를 고민하는 백엔드/ML 엔지니어. 특히 긴 추론 트레이스나 멀티턴 검색 에이전트를 운영하면서 토큰 비용과 성능 저하를 동시에 잡고 싶은 개발자.

Core Mechanics

긴 에이전트 실행 기록에는 오래된 오류 추론, 이미 버린 검색 결과 등 '썩은 컨텍스트(context rot)'가 쌓이는데, 이게 이후 생성을 오염시켜 성능을 떨어뜨린다.
기존 시스템은 토큰 수가 일정 임계값을 넘으면 무조건 요약하는 고정 주기 방식을 쓰는데, 추론 중간에 요약이 터지면 이미 검증한 사실을 날려버리는 문제가 있다.
SELFCOMPACT는 '요약 도구'와 '언제 요약할지 판단하는 rubric(판단 기준 문서)'을 짝으로 제공한다. 모델이 직접 rubric에 따라 COMPRESS/CONTINUE를 결정하고, 결정이 나면 같은 모델이 요약도 수행한다.
Rubric은 '서브태스크가 완료됐는가', '추론이 수렴 중인가' 같은 구조적 조건을 체크하며, 추론 중간이거나 막혀 있을 때는 요약을 억제(suppress)한다.
KV 캐시 재사용 덕분에 rubric 판단 호출 자체는 거의 공짜다. 요약 프롬프트를 기존 컨텍스트에 append하는 방식이라 전체 재인코딩 없이 캐시된 prefix를 그대로 활용한다.
Rubric 없이 도구만 노출하면 모델마다 호출 타이밍이 들쑥날쑥해 오히려 고정 주기 방식과 비슷한 수준으로 떨어진다. Rubric이 핵심이다.

Evidence

경쟁 수학 벤치마크(IMO-Answerbench, HMMT Nov/Feb)에서 Qwen3.5-9B 기준 no-compaction 대비 최대 18.1점 향상, 12개 벤치마크/모델 조합 중 11개에서 고정 주기 방식을 앞섬.
에이전틱 검색(BrowseComp-Plus)에서 GLM-4.7-Flash +8.5점, MiniMax-M2.5 +9.2점, MiMo-V2-Flash +5.3점 향상을 달성하면서, no-compaction 대비 비용은 각각 67%, 63%, 33% 절감.
Rubric 제거 ablation: GLM-4.7-Flash에서 SELFCOMPACT 전체(46.4%) vs rubric 없는 버전(41.0%)으로 5.4점 차이. Rubric 없으면 고정 주기(41.5%)와 동급 수준으로 떨어짐.
Oracle 분석(현재 답이 맞을 때만 요약 억제)에서 Qwen3-4B-Instruct-2507 IMO-Answerbench 기준 52.9% 달성 — 고정 주기(41.4%) 대비 11.5점 차이로, 적응형 정책의 잠재 여지를 확인.

How to Apply

긴 멀티턴 에이전트 루프에서 N 스텝(토큰 수 또는 툴 호출 횟수)마다 rubric 프롬프트를 컨텍스트 끝에 append해서 COMPRESS/CONTINUE 판정을 받고, COMPRESS일 때만 요약을 실행하도록 scaffold를 수정하면 된다. 이때 append 방식을 쓰면 KV 캐시가 보존되어 추가 비용이 거의 없다.
수학 풀이 에이전트라면 'Q1: 최종 답이 나왔는가 / Q2: 최근 2라운드 동안 새 사실이 없는가 / Q3: 다음 단계가 명확한가'를 체크하는 rubric을 쓰고, 답이 나왔거나(Q1=Y) 막혀 있으면서 다음 스텝이 있을 때(Q2=Y ∧ Q3=Y)만 요약을 트리거하면 된다.
웹 검색 에이전트라면 'C1: 현재 턴이 닫힌 단위인가 / C2: 3~5개 핵심 사실로 요약 가능한가 / C3: 마지막 압축 이후 진전이 있었는가 / N1: 검색이 반복 루프에 빠지지 않았는가' 4가지 조건이 모두 충족될 때만 요약하도록 설정한다. 이 조건 중 하나라도 N이면 계속 탐색하게 두면 된다.

Code Example

snippet

# SELFCOMPACT scaffold 핵심 루프 (의사 코드)

RUBRIC_PROMPT_MATH = """
Judge your math-solving state from the conversation above. Answer Q1...Q3 with Y or N.
Q1 ANSWER: The latest round states a specific final answer (\\boxed{} or 'Final Answer:...')
Q2 STUCK: Your last 2 rounds added no non-trivial fact
Q3 HAS-NEXT: You can write the exact next step

Output: exactly 3 lines
Q1: Y/N -- <evidence>
Q2: Y/N -- <evidence>
Q3: Y/N -- <evidence>

Fire rule: COMPRESS iff Q1=Y OR (Q2=Y AND Q3=Y)
"""

SUMMARIZER_PROMPT = """
Create a compressed summary for another model to continue solving from.
RULES:
1. If a final answer was found, PRESERVE IT at the end.
2. Keep key insights, important calculations, and the reasoning path.
3. Remove redundant text, false starts, and unnecessary repetition.
"""

def selfcompact_loop(prompt, model, probe_interval=16384, max_rounds=12):
    context = [prompt]  # 전체 컨텍스트 (KV 캐시 포함)
    
    for step in range(max_rounds):
        # 1. 일반 생성
        response = model.generate(context)
        context.append(response)
        
        if is_final_answer(response):
            return response
        
        # 2. N 스텝마다 rubric 프로브 (append 방식으로 KV 캐시 보존)
        if step % probe_interval == 0:
            rubric_context = context + [RUBRIC_PROMPT_MATH]  # 원본 캐시 오염 없이 복사
            verdict = model.generate(rubric_context)  # COMPRESS or CONTINUE
            
            if parse_verdict(verdict) == "COMPRESS":
                # 3. 요약 실행 (동일 모델 사용, 외부 모델 불필요)
                summary_context = context + [SUMMARIZER_PROMPT]
                summary = model.generate(summary_context)
                
                # 4. 컨텍스트 리셋: 원본 질문 + 요약본으로 교체
                context = [prompt, summary]
                print(f"[COMPRESS] {len_tokens(context)}tok → {len_tokens([summary])}tok")
            # CONTINUE면 rubric 프로브 결과를 컨텍스트에서 제거하고 그냥 진행
    
    return context[-1]

Terminology

context rot긴 대화나 추론 과정에서 오래된 오류, 버린 아이디어가 쌓여 새 생성을 오염시키는 현상. 냉장고에 오래된 음식이 남아 새 식재료 냄새에 영향 주는 것과 비슷.

compaction긴 컨텍스트를 짧은 요약으로 압축해서 토큰 수를 줄이는 작업. 문서를 요약해서 메모장에 핵심만 옮겨 적는 것과 같음.

rubric이 논문에서는 '언제 요약할지 판단하는 기준 문서'. 시험 채점 기준표처럼, 모델이 특정 조건을 체크해서 COMPRESS/CONTINUE를 결정하게 하는 프롬프트.

KV cache트랜스포머 모델이 이전 토큰들을 다시 계산하지 않도록 저장해두는 캐시. 계산기가 중간 결과를 메모리에 저장해 재사용하는 것과 비슷.

scaffoldLLM 위에 씌우는 실행 프레임워크. 모델 자체를 바꾸지 않고 바깥에서 프롬프트 흐름, 도구 호출, 루프 등을 제어하는 코드.

ReAct loop추론(Reason)과 행동(Act)을 번갈아 하는 에이전트 패턴. '생각 → 검색 → 결과 확인 → 다시 생각' 사이클을 반복하는 방식.

fixed-interval compaction토큰 수가 일정 임계값(예: 컨텍스트 창의 30%)을 넘으면 무조건 요약을 실행하는 방식. 알람 시계처럼 내용 상관없이 정해진 시간에 무조건 울리는 것과 같음.

open-weight model모델 가중치가 공개되어 직접 다운로드해 실행할 수 있는 모델. GPT-4 같은 API 전용 모델과 달리 내 서버에서 직접 돌릴 수 있음.

Related Resources

Original Abstract (Expand)

Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered at a token threshold. Such triggers pay no heed to trajectory structure, risking discard of partial results mid-derivation or mid-search. We propose SelfCompact, a scaffold that allows the model itself to decide when and how to compact. Specifically, it pairs two inference-time elements: (i) a compaction tool the model invokes to summarize the accumulated context, and (ii) a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck). Both are needed. The tool alone is unevenly used across open-weight models, often invoked at unhelpful moments or not at all; the rubric alone cannot act. Together, they elicit effective adaptive compaction without any fine-tuning or external supervision. We present empirical results on six benchmarks (competitive math and agentic search) and seven models. Our results show that SelfCompact matches or exceeds fixed-interval summarization at a fraction of the token cost, improving over a no-summarization baseline by up to 18.1 points on math and 5-9 points on agentic search at 30-70% lower per-question cost. Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap, reframing when to compact as a capability that scaffolds can supply without training.