Tiny-Critic RAG: Parameter-Efficient Small Language Model로 Agentic Fallback 최적화

Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

Mar 1, 2026•Yichao Wu, Penghao Liang, Yafei Xiang +5•View PDF

TL;DR Highlight

GPT-4o-mini 수준의 RAG 노이즈 필터링을 1.7B 소형 모델로 구현해 비용 98%, 지연 94.6% 절감

Who Should Read

RAG 파이프라인에 LLM-as-a-judge를 붙여 쓰는데 API 비용과 지연이 걱정인 백엔드/AI 엔지니어. 특히 ReAct 에이전트에서 잘못된 검색 결과가 토큰 낭비를 유발하는 문제를 겪고 있는 팀.

Core Mechanics

잘못된 검색 결과가 에이전트에 들어가면 할루시네이션뿐 아니라 불필요한 멀티홉 추론 루프와 중복 툴 호출로 TTFT(첫 토큰 응답 시간)와 비용이 폭발적으로 늘어남
Qwen3-1.7B를 LoRA(소수 파라미터만 학습하는 기법)로 파인튜닝해 '이 검색 결과 쓸 수 있나/없나'만 판단하는 초소형 라우터(Tiny-Critic)로 만듦
Constrained Decoding(출력 토큰을 pass/fail 두 개로만 제한)과 Non-Thinking Mode(CoT 생략)를 조합해 라우팅 자체를 42ms로 끝냄
판단 결과가 fail이면 MCP(Model Context Protocol)를 통해 Tavily Search 같은 폴백 툴을 호출해 깨끗한 컨텍스트를 다시 가져옴
Zero-shot Qwen3-1.7B는 FPR(오탐율)이 38.2%였지만, LoRA 학습 후 4.1%로 떨어짐 — 파인튜닝 없이는 라우터로 못 씀

Evidence

Routing F1-Score: Tiny-Critic 0.912 vs GPT-4o-mini(Heavy-CRAG) 0.934 — 통계적으로 동등한 수준
라우팅 TTFT: Heavy-CRAG 785ms → Tiny-Critic 42ms, 94.6% 감소
10k 쿼리당 명시적 비용: Heavy-CRAG $3.00 → Tiny-Critic $0.06, 98% 감소
45% 노이즈 환경에서 Faithfulness: Naive RAG 0.44 vs Tiny-Critic 0.86 (노이즈 없을 때 0.89 수준 유지)

How to Apply

기존 RAG 파이프라인의 LLM 평가자(GPT-4o-mini 등)를 Qwen3-1.7B + LoRA 라우터로 교체. 검색 결과와 쿼리를 concat해서 'pass'/'fail' 하나만 받도록 constrained decoding 설정.
fail 판정 시 MCP 툴 호출로 폴백 검색을 연결. Tavily나 자체 검색 API를 MCP 서버로 래핑해두면 에이전트 루프에 자연스럽게 끼워 넣을 수 있음.
학습 데이터는 Natural Questions + HotpotQA 기반 5,000쿼리, BGE-M3로 Hard Negative를 rank 10~20에서 샘플링하는 노이즈 주입 프로토콜을 그대로 적용해 도메인 특화 라우터를 만들 수 있음.

Code Example

snippet

# LoRA 라우터 추론 예시 (Constrained Decoding)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

model_id = "Qwen/Qwen3-1.7B"
loRA_path = "./tiny-critic-lora"  # 파인튜닝된 어댑터

tokenizer = AutoTokenizer.from_pretrained(model_id)
base_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base_model, lora_path)

PASS_TOKEN = tokenizer.encode("pass", add_special_tokens=False)[0]
FAIL_TOKEN = tokenizer.encode("fail", add_special_tokens=False)[0]

def route(query: str, docs: list[str]) -> str:
    context = "\n".join(docs)
    prompt = f"Query: {query}\nContext: {context}\nVerdict:"
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # Constrained decoding: pass/fail 토큰만 허용
    with torch.no_grad():
        logits = model(**inputs).logits[:, -1, :]  # 마지막 토큰 logits만
        mask = torch.full_like(logits, float("-inf"))
        mask[:, PASS_TOKEN] = 0
        mask[:, FAIL_TOKEN] = 0
        masked_logits = logits + mask
        token_id = masked_logits.argmax(dim=-1).item()
    
    return "pass" if token_id == PASS_TOKEN else "fail"

# 사용
result = route("Who wrote Hamlet?", ["Shakespeare wrote Othello in 1603."])  
# → "fail" → MCP fallback 트리거

Terminology

LoRA모델 전체를 재학습하지 않고 작은 행렬 두 개만 추가로 학습하는 기법. 원본 모델은 건드리지 않고 어댑터만 교체하는 방식이라 메모리와 시간이 훨씬 적게 듦.

Constrained DecodingLLM이 출력할 수 있는 토큰을 미리 지정한 것으로만 제한하는 기법. 여기서는 'pass'와 'fail' 두 단어만 나올 수 있게 강제해서 불필요한 생각을 차단.

TTFTTime-to-First-Token. 요청을 보내고 첫 번째 토큰이 나오기까지 걸리는 시간. 사용자가 체감하는 응답 속도의 핵심 지표.

Agentic RAG한 번 검색하고 끝내는 게 아니라, 결과를 스스로 평가하고 부족하면 다시 검색하거나 툴을 호출하는 자율 루프 방식의 RAG.

RAGAS Faithfulness생성된 답변이 검색된 컨텍스트에 얼마나 충실한지 측정하는 점수(0~1). 높을수록 할루시네이션이 적음.

Hard Negative겉으로는 관련 있어 보이지만 실제 정답과는 무관한 문서. 검색 시스템이 속기 쉬운 '함정 문서'.

MCPModel Context Protocol. AI 모델이 외부 툴(검색, DB 등)을 표준 방식으로 호출할 수 있게 해주는 Anthropic이 만든 프로토콜.

Related Resources

Original Abstract (Expand)

Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete forward passes for billion-parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time-to-First-Token (TTFT) and costs. We propose Tiny-Critic RAG, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA). Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing. Evaluations on noise-injected datasets demonstrate Tiny-Critic RAG achieves routing accuracy comparable to GPT-4o-mini while reducing latency by an order of magnitude, establishing a highly cost-effective paradigm for agent deployment.