User Log 기반 LLM 시스템 지속 학습 프레임워크: UNO

Improve Large Language Model Systems with User Logs

Feb 6, 2026•Changyue Wang, Weihang Su, Qingyao Ai +1•View PDF

TL;DR Highlight

유저 로그에서 노이즈를 걸러내고 LoRA 어댑터를 자동 생성해 배포 중인 LLM을 지속적으로 개선하는 프레임워크

Who Should Read

운영 중인 LLM 서비스의 유저 피드백을 모델 개선에 활용하고 싶은 ML 엔지니어. 특히 RAG나 메모리 시스템으로는 한계를 느끼고 파인튜닝을 고려 중인 팀.

Core Mechanics

유저 로그에서 자연어 피드백을 '규칙 셋'으로 자동 추출 → 비구조화 로그를 학습 가능한 형태로 변환
쿼리 + 규칙 벡터를 결합한 계층적 클러스터링으로 유사한 피드백끼리 묶어 LoRA 학습 난이도 낮춤
'Cognitive Gap(인지 격차)' 지표로 모델이 이미 아는 영역인지 모르는 영역인지 판단 → 노이즈 위험 사전 차단
Low Gap 클러스터엔 Expert LoRA(DPO 학습, 직접 답변 생성), High Gap 클러스터엔 Critic LoRA(초안 비판 후 재생성) 적용
LLM-as-Judge 기반 시뮬레이션 검증기로 배포 전 LoRA 품질 자동 검사 → Win Rate 미달 시 Reflective 경로로 전환
온라인 진화 설정에서 배포 후에도 새 로그로 지속 업데이트 가능한 lifelong learning 구조 확인

Evidence

Qwen3-8B 기준 Short-Long 태스크: UNO Norm-Score 77.09 vs RAG-Embedding 73.22, 메모리 기반 최고(MemoryOS) 74.62 모두 상회
phi-4 기준 Long-Short 태스크: UNO 52.60 vs 베이스라인 최고(Base) 46.46 — 다른 모든 메모리/RAG 방법은 베이스 모델보다 낮거나 비슷한 수준
Cognitive Gap 사전 필터링으로 DPO 학습 비용 최대 78% 절감 (Long-Short 태스크 기준)
UNO-Single(Reflective 경로 제거 버전)도 추가 컨텍스트 토큰 0개로 모든 RAG·메모리 방법 능가 — 토큰 효율성 압도적

How to Apply

유저 피드백(수정 요청, 불만 댓글 등)이 쌓인 대화 로그가 있다면 LLM으로 '규칙 셋' 추출 → 개선된 응답 생성 → preference pair 구성 파이프라인을 먼저 붙여볼 수 있음
클러스터별로 Cognitive Gap을 측정해 Low Gap은 DPO로 Expert LoRA 학습, High Gap은 Critic LoRA로 비판 모델 학습 — 도메인별 전문 어댑터를 별도로 두는 MoE(혼합 전문가) 구조와 유사하게 운영 가능
배포 전 LLM-as-Judge로 Win Rate를 자동 검증하는 시뮬레이션 검증기 패턴은 파인튜닝 checkpoint 선택 자동화에 바로 적용 가능 (perplexity보다 유의미하게 우수)

Code Example

snippet

# UNO의 핵심 로직 - 규칙 추출 및 preference pair 생성 예시

from transformers import AutoTokenizer, AutoModelForCausalLM

DISTILL_PROMPT = """
You are analyzing a user's feedback on an AI response.
Given the dialogue below, extract actionable editing rules.

Dialogue:
User query: {query}
AI response: {response}
User feedback: {feedback}

Output a numbered list of specific, actionable rules to improve the response.
If no meaningful feedback exists, output: EMPTY
"""

REVISE_PROMPT = """
Revise the following response according to the given rules.

Original query: {query}
Original response: {response}
Rules to apply:
{rules}

Revised response:
"""

def distill_rules(model, tokenizer, query, response, feedback):
    prompt = DISTILL_PROMPT.format(
        query=query, response=response, feedback=feedback
    )
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=256)
    rules = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return None if "EMPTY" in rules else rules

def build_preference_pair(model, tokenizer, query, orig_response, rules):
    """chosen = rules 적용한 개선 응답, rejected = 원본 응답"""
    revised = model.generate_with_prompt(
        REVISE_PROMPT.format(
            query=query, response=orig_response, rules=rules
        )
    )
    return {
        "prompt": query,
        "chosen": revised,   # y_w
        "rejected": orig_response  # y_l
    }

# Cognitive Gap 측정 (재랭커 활용)
def compute_cognitive_gap(reranker, query, user_rules, llm_predicted_rules):
    """
    user_rules: 유저 피드백에서 추출한 실제 규칙
    llm_predicted_rules: 유저 로그 없이 LLM이 예측한 규칙
    gap이 낮을수록 모델이 이미 아는 영역 → Expert LoRA 안전
    gap이 높을수록 모르는 영역 → Critic LoRA로 전환
    """
    scores = reranker.compute_relevance(user_rules, llm_predicted_rules)
    return 1 - min(scores)  # 최솟값 기준 gap

Terminology

LoRA모델 전체를 재학습하지 않고 작은 행렬 두 개만 추가해 학습하는 기법. 전체 옷을 새로 만드는 대신 패치만 덧대는 것과 같음.

DPO좋은 답변과 나쁜 답변 쌍을 보여주며 선호도를 학습시키는 방법. RLHF보다 구현이 단순하고 안정적.

Cognitive Gap모델이 유저 피드백을 얼마나 '이미 알고 있는지'의 거리. 갭이 작으면 파인튜닝이 안전, 크면 노이즈 위험이 높아 비판 모델로 우회.

LLM-as-Judge다른 LLM의 출력을 또 다른 LLM이 평가하는 방식. 사람 레이블러 없이 자동으로 품질 점수를 매길 수 있음.

Agglomerative Clustering작은 클러스터들을 점진적으로 합쳐나가는 계층적 군집화 방법. 미리 클러스터 수를 정하지 않아도 됨.

Off-policy Optimization실제 배포된 모델과 다른 환경(과거 버전)에서 수집된 데이터로 학습할 때 생기는 분포 불일치 문제.

Continual Learning기존에 배운 걸 잊지 않으면서 새로운 지식을 계속 추가 학습하는 방법론. LLM에서는 catastrophic forgetting이 큰 과제.

Related Resources

Original Abstract (Expand)

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .