비정형 Recall에서 Schema 기반 Memory로: 반복적 Schema-Aware Extraction을 통한 신뢰할 수 있는 AI Memory

TL;DR Highlight

RAG 스타일 텍스트 검색 대신 Schema로 정의된 구조화 레코드에 메모리를 저장하면, 정확한 사실 조회·상태 추적·집계 쿼리에서 압도적으로 높은 정확도를 얻을 수 있다.

Who Should Read

AI 에이전트나 챗봇에 장기 메모리를 붙이려는 백엔드/ML 엔지니어. 특히 RAG 기반 메모리에서 오래된 상태가 리턴되거나 집계 쿼리가 틀리는 문제를 겪고 있는 개발자.

Core Mechanics

RAG 스타일 메모리(텍스트 저장 + 임베딩 검색)는 주제 기반 탐색엔 괜찮지만, 정확한 값 조회·상태 추적·집계·관계형 쿼리·부재(absence) 확인엔 구조적으로 한계가 있다.
Schema를 '무엇을 기억해야 하는가'에 대한 명시적 계약으로 정의하면, 누락된 필드가 '조용한 실패'가 아니라 '감지 가능한 오류'가 된다.
단일 패스(one-shot) 구조화 출력은 필드 정확도가 97%라도, 필드가 20개면 전체 레코드 정확도는 0.97^20 ≈ 54%로 폭락한다 — 복합 오류 때문.
xmemory는 추출을 Object Detection → Field Detection → Field Value Extraction 세 단계로 나누고, 각 단계에 검증 게이트(validation gate)와 로컬 재시도를 붙여서 이 문제를 해결한다.
메모리 컨텍스트를 Request(단일 요청) / Session(세션 전체) / Main(영구 저장소) 세 층으로 분리해서, 각 층이 담당하는 결정의 범위를 좁힌다.
쓰기 경로(write path)가 복잡해지는 대신, 읽기는 검증된 레코드에 대한 단순 쿼리가 되어 토큰 소비가 약 3배 줄고 결정 레이턴시도 낮아진다.

Evidence

end-to-end 메모리 벤치마크에서 xmemory F1 97.10% vs 비교 대상(Cognee, Mem0, Supermemory, Zep) 80.16%~87.24% — 최고 경쟁자 대비 약 10%p 차이.
구조화 추출(보험 청구 데이터셋) Object-level accuracy: xmemory + LLM judge 90.42% vs GPT-5.5 high reasoning 83.98%, Gemini 3.1 Pro 89.24% — 프론티어 모델 단독 사용보다 높음.
Output-level accuracy(전체 클레임의 모든 객체가 완벽히 맞는 비율): xmemory + judge 62.67% vs 최고 경쟁 모델 Gemini 3.1 Pro 61.67%, GPT-5.5 high reasoning 44%.
Splitwise 애플리케이션 태스크(자연어 식사 이벤트 → 집계 질의): xmemory 95.2% vs Supermemory 73.75%, Cognee 68%, Mem0(graph) 59.1%, Zep 25.7%.

How to Apply

현재 RAG 메모리에서 '지금 DB는 뭐야?', '현재 상태는?', '총 몇 번 실패했어?' 같은 쿼리가 틀릴 때: 대화 로그를 그대로 임베딩하는 대신 Schema(YAML/JSON Schema)를 먼저 정의하고, 그 Schema에 맞춰 구조화된 레코드로 변환하는 write pipeline을 추가하면 된다.
단일 LLM 호출로 복잡한 JSON을 뽑으려다 레코드 전체가 틀리는 문제가 있다면: 추출을 '객체 존재 여부 판단 → 어떤 필드가 있는지 판단 → 각 필드 값 추출' 3단계로 쪼개고, 각 단계에 타입/포맷 검증 + 실패 시 해당 필드만 재시도하는 로직을 넣으면 레코드 정확도가 크게 올라간다.
에이전트가 세션 간 상태를 유지해야 하는 경우: 단일 컨텍스트 대신 Request/Session/Main 세 레이어로 나눠서, 현재 요청의 임시 추출 결과는 Request에, 세션 중 부분 완성 객체는 Session에, 최종 확정된 사실은 Main에 버전 관리와 함께 저장하는 아키텍처를 적용할 수 있다.

Code Example

snippet

# Schema 정의 예시 (YAML)
# 'ServiceConfig'라는 엔티티: 어떤 컴포넌트가 어떤 DB를 쓰는지 기억
ServiceConfig:
  required:
    - component       # string: 'session store', 'cache' 등
    - database        # string: 'Redis', 'Postgres' 등
    - status          # enum: active | rejected | unknown
  optional:
    - reason          # string: 상태 변경 이유
    - changed_at      # datetime

# Write path: 대화에서 구조화 레코드 추출 (3단계 분해)
def extract_memory(text: str, schema: dict) -> dict:
    # 1단계: 객체가 존재하는지 판단
    obj_detected = llm_call(
        f"Does this text mention a ServiceConfig entity? Answer yes/no.\n{text}"
    )
    if obj_detected != 'yes':
        return None

    # 2단계: 어떤 필드가 언급됐는지 판단
    fields_present = llm_call(
        f"Which fields from {list(schema['required'] + schema['optional'])} "
        f"are mentioned in the text? Return JSON list.\n{text}"
    )

    # 3단계: 각 필드 값 추출 + 검증 + 재시도
    record = {}
    for field in fields_present:
        for attempt in range(3):  # 최대 3번 재시도
            value = llm_call(
                f"Extract '{field}' from the text. "
                f"If not mentioned, return null. Do NOT infer or approximate.\n{text}"
            )
            if validate_field(field, value, schema):
                record[field] = value
                break
        else:
            record[field] = None  # 명시적 unknown

    return record

# Read path: 구조화 쿼리 (추론 없이 직접 조회)
# SELECT database FROM ServiceConfig WHERE component = 'session store' AND status = 'active'

Terminology

Schema데이터베이스 테이블 설계도처럼, '무슨 정보를 어떤 형식으로 저장해야 하는지'를 미리 정의해둔 계약서. 이게 없으면 AI가 맘대로 요약하거나 추측한 값을 저장해버림.

RAGRetrieval-Augmented Generation의 약자. 질문이 들어오면 관련 텍스트 덩어리를 검색해서 LLM 컨텍스트에 붙여주는 방식. 주제 탐색엔 좋지만 정확한 사실 조회엔 약함.

Object-level accuracy추출한 레코드 전체(모든 필드)가 완벽히 맞는 비율. 필드 하나라도 틀리면 0점. 필드별 정확도가 97%여도 레코드 전체가 맞을 확률은 훨씬 낮을 수 있음.

Validation gate각 추출 단계 후 '이 값이 올바른 타입/형식/범위인가?'를 자동으로 체크하는 검증 장치. 실패하면 전체를 다시 하지 않고 그 필드만 재시도함.

Write path / Read pathWrite path = 데이터를 저장할 때 거치는 처리 과정. Read path = 저장된 데이터를 꺼낼 때 거치는 과정. 이 논문의 핵심은 복잡한 처리를 read 시점이 아니라 write 시점에 미리 해두자는 것.

Graph RAG일반 RAG에 지식 그래프(개념 간 관계)를 추가한 방식. 멀티홉 검색(A→B→C 연결)에 강하지만, 노드 자체는 여전히 텍스트라서 정확한 사실 조회엔 여전히 한계가 있음.

Data processing inequality정보이론 법칙. '변환/압축을 거치면 원본보다 정보가 줄거나 같다, 절대 늘 수 없다'는 원리. 텍스트를 요약하면 나중에 필요한 세부 정보가 사라질 수 있는 이유.

LLM judge다른 LLM의 출력 결과를 평가하는 별도의 LLM. 여기선 추출된 레코드가 올바른지 검토하고 피드백을 주는 역할로 write pipeline에 포함됨.

Related Resources

Original Abstract (Expand)

Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic recall, but it is mismatched to the kinds of memory that agents need in production: exact facts, current state, updates and deletions, aggregation, relations, negative queries, and explicit unknowns. These operations require memory to behave less like search and more like a system of record. This paper argues that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. We evaluate this design on structured extraction and end-to-end memory benchmarks. On the extraction benchmark, the judge-in-the-loop configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines. On our end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines. On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses. The results show that, for memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone.