Structured Distillation로 AI 에이전트 개인 메모리 압축: 11배 Token 절감과 Retrieval 품질 유지

Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Mar 13, 2026•Sydney Lewis•View PDF

TL;DR Highlight

AI 코딩 에이전트와의 대화 기록을 11배 압축해서 검색 가능한 메모리로 만드는 방법 — vector search 기준 품질 손실 거의 없음

Who Should Read

Claude Code, Cursor 같은 AI 코딩 에이전트를 매일 쓰면서 '저번에 그 버그 어떻게 고쳤더라'를 반복하는 개발자. 장기 대화 히스토리를 RAG로 검색 가능하게 만들고 싶은 백엔드/인프라 개발자.

Core Mechanics

대화 교환(exchange) 하나를 4개 필드 구조체로 압축: 핵심 요약(exchange_core), 기술 세부사항(specific_context), 테마 룸 분류(room_assignments), 참조 파일 경로(files_touched) — 평균 371 토큰 → 38 토큰
Vector search(HNSW, Exact) 기준 압축 후에도 품질 저하 없음 — 40개 비교 중 vector search 20개 전부 통계적으로 유의미한 차이 없음(Bonferroni 보정 후)
BM25 키워드 검색은 압축 후 심각하게 망가짐 — 20개 BM25 비교 전부 유의미한 성능 저하, 효과 크기 |d|=0.031~0.756
최고 성능 조합은 verbatim BM25 + distilled HNSW 크로스레이어 검색 — MRR 0.759로 순수 verbatim 최고치(0.745)를 살짝 초과
1,000개 대화 교환을 ~39,000 토큰으로 컨텍스트에 넣을 수 있음 — verbatim이라면 ~407,000 토큰 필요
Claude Haiku 4.5로 4,182개 대화(14,340 교환) 전체 압축, 5개 로컬 LLM 그레이더(Qwen3-8B, Phi-3.5-Mini, Mistral-7B, Yi-1.5-9B, InternLM2.5-7B)로 214,519쌍 평가

Evidence

Best pure distilled 설정(Distill Core+Rooms / Exact / Weighted) MRR 0.717 — verbatim 최고치 0.745의 96% 달성
Cross-layer 최고 설정(BM25 on verbatim + HNSW on distilled) MRR 0.759 — pure verbatim 대비 102%
Vector search 20/20 비교에서 통계적 유의성 없음(p > 0.00125), BM25 20/20 비교 전부 유의미한 저하(p < 2e-15 수준)
96.8% of query vocabulary가 distilled corpus에 보존됨 — 사용자가 실제 검색하는 단어들은 거의 다 살아남음

How to Apply

Claude Code나 Cursor 대화 로그를 exchange 단위로 쪼개서 Claude Haiku로 {exchange_core, specific_context, room_assignments, files_touched} JSON 구조체로 배치 변환. all-MiniLM-L6-v2로 임베딩 후 FAISS에 저장하면 1/11 크기로 검색 가능한 인덱스 완성.
검색 시 BM25는 원본 verbatim 텍스트에, vector search는 distilled 텍스트에 걸어서 결과를 RRF로 합치면 단일 방식보다 MRR이 높아짐 — 이미 verbatim을 디스크에 갖고 있다면 추가 비용 없이 크로스레이어 검색 구현 가능.
distilled 인덱스는 라우팅 전용으로만 쓰고 유저에게 보여주는 결과는 항상 원본 verbatim 교환으로 드릴다운 — conversation_id + ply_start/end 백레퍼런스를 distilled 객체에 달아두면 됨.

Code Example

snippet

# Distillation prompt (Appendix B 기반, Claude Haiku 4.5 사용)
prompt = """
Distill this conversation exchange into JSON:
- "exchange_core": 1-2 sentences. What was accomplished or decided?
  Use the specific terms from the exchange. Do not invent details
  not present in the text.
- "specific_context": One concrete detail from the text: a number,
  error message, parameter name, or file path. Copy it exactly.
- "room_assignments": 1-3 rooms. Each room is a topic this exchange
  belongs to. {"room_type": "<file|concept|workflow>",
  "room_key": "<identifier>", "room_label": "<short label>",
  "relevance": <0.0-1.0>}

Project: {project_id}
Exchange (messages {ply_start}-{ply_end}):
{messages_text}

Respond with ONLY valid JSON.
"""

# files_touched는 LLM 생성 아님 — regex로 추출
import re
def extract_files_touched(exchange_text):
    pattern = r'[\w./\-]+\.(?:py|ts|js|go|rs|yaml|json|toml|md)'
    return list(set(re.findall(pattern, exchange_text)))

# Embedding + indexing
from sentence_transformers import SentenceTransformer
import faiss

model = SentenceTransformer('all-MiniLM-L6-v2')  # 22M params, CPU OK

def build_distill_index(palace_objects):
    texts = [f"{obj['exchange_core']}\n{obj['specific_context']}" 
             for obj in palace_objects]
    embeddings = model.encode(texts, show_progress_bar=True)
    
    index = faiss.IndexFlatL2(384)  # Exact search
    index.add(embeddings)
    return index, texts

# Cross-layer search: BM25 on verbatim + HNSW on distilled
from rank_bm25 import BM25Okapi

def cross_layer_search(query, verbatim_texts, distilled_texts, 
                        distill_index, top_k=10):
    # BM25 on verbatim
    tokenized = [t.split() for t in verbatim_texts]
    bm25 = BM25Okapi(tokenized)
    bm25_scores = bm25.get_scores(query.split())
    
    # Vector search on distilled
    q_emb = model.encode([query])
    _, vec_indices = distill_index.search(q_emb, top_k)
    
    # RRF fusion
    bm25_ranks = {i: r+1 for r, i in enumerate(bm25_scores.argsort()[::-1][:top_k])}
    vec_ranks = {i: r+1 for r, i in enumerate(vec_indices[0])}
    
    all_ids = set(bm25_ranks) | set(vec_ranks)
    rrf_scores = {i: 1/(60 + bm25_ranks.get(i, top_k+1)) + 
                     1/(60 + vec_ranks.get(i, top_k+1)) 
                  for i in all_ids}
    
    return sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:top_k]

Terminology

MRR검색 결과 중 첫 번째 '완벽히 관련된' 결과가 몇 번째에 나오는지 평균낸 지표. MRR 0.75면 대략 1~2번째 자리에 정답이 나온다는 뜻.

BM25단어 빈도 기반 키워드 검색 알고리즘. 구글 이전 검색 엔진들이 쓰던 방식으로, 정확한 단어가 있어야 잘 찾는다.

HNSW근사 최근접 이웃(Approximate Nearest Neighbor) 검색 알고리즘. 벡터 임베딩으로 의미가 비슷한 내용을 빠르게 찾는다.

Bonferroni correction여러 가설을 동시에 검정할 때 '운 좋게 맞는 경우'를 줄이기 위해 유의수준을 더 엄격하게 조정하는 통계 기법.

RRFReciprocal Rank Fusion. 여러 검색 결과 리스트를 하나로 합치는 방법. 각 결과의 순위를 역수로 변환해서 더한다.

RAGRetrieval-Augmented Generation. LLM이 답변할 때 관련 문서를 먼저 검색해서 컨텍스트로 넣어주는 방식. 외부 지식을 LLM에 주입하는 표준 패턴.

Cohen's d두 그룹 간 차이의 실용적 크기를 나타내는 지표. 0.2=작은 차이, 0.5=중간 차이, 0.8=큰 차이로 해석.

nDCGNormalized Discounted Cumulative Gain. 검색 결과의 순위까지 고려한 품질 지표. 관련도 높은 결과가 앞에 나올수록 점수가 높아진다.

Related Resources

Searchat: Semantic search for AI coding agent conversations (GitHub)

Original Abstract (Expand)

Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.