Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models
TL;DR Highlight
Implementing GPT-4o-mini-level RAG noise filtering with a 1.7B small model — 98% cost reduction, 94.6% latency reduction.
Who Should Read
Backend/AI engineers who use LLM-as-a-judge in RAG pipelines but are worried about API costs and latency. Especially teams experiencing token waste from bad search results in ReAct agents.
Core Mechanics
- When bad search results enter an agent, it causes not just hallucination but also unnecessary multi-hop reasoning loops and duplicate tool calls, causing TTFT (time-to-first-token) and costs to explode
- Fine-tuning Qwen3-1.7B with LoRA creates a tiny router (Tiny-Critic) that only judges "can this search result be used / not used"
- Combining Constrained Decoding (limiting output tokens to just pass/fail) with Non-Thinking Mode (skipping CoT) reduces routing itself to 42ms
- When the judgment is fail, a fallback tool like Tavily Search is called via MCP (Model Context Protocol) to retrieve clean context
- Zero-shot Qwen3-1.7B had 38.2% FPR (false positive rate), which dropped to 4.1% after LoRA training — unusable as a router without fine-tuning
Evidence
- Routing F1-Score: Tiny-Critic 0.912 vs GPT-4o-mini (Heavy-CRAG) 0.934 — statistically equivalent level
- Routing TTFT: Heavy-CRAG 785ms → Tiny-Critic 42ms, 94.6% reduction
- Explicit cost per 10k queries: Heavy-CRAG $3.00 → Tiny-Critic $0.06, 98% reduction
- Faithfulness in 45% noise environment: Naive RAG 0.44 vs Tiny-Critic 0.86 (maintained near no-noise level of 0.89)
How to Apply
- Replace the existing RAG pipeline's LLM evaluator (GPT-4o-mini etc.) with a Qwen3-1.7B + LoRA router. Concat the search result and query, then configure constrained decoding to receive only "pass"/"fail".
- Connect fallback search on fail judgment via MCP tool call. Wrapping Tavily or your own search API as an MCP server lets you naturally insert it into the agent loop.
- Training data uses 5,000 queries based on Natural Questions + HotpotQA; apply the same noise injection protocol of sampling Hard Negatives from rank 10-20 with BGE-M3 to create a domain-specific router.
Code Example
# LoRA Router Inference Example (Constrained Decoding)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
model_id = "Qwen/Qwen3-1.7B"
loRA_path = "./tiny-critic-lora" # Fine-tuned adapter
tokenizer = AutoTokenizer.from_pretrained(model_id)
base_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base_model, lora_path)
PASS_TOKEN = tokenizer.encode("pass", add_special_tokens=False)[0]
FAIL_TOKEN = tokenizer.encode("fail", add_special_tokens=False)[0]
def route(query: str, docs: list[str]) -> str:
context = "\n".join(docs)
prompt = f"Query: {query}\nContext: {context}\nVerdict:"
inputs = tokenizer(prompt, return_tensors="pt")
# Constrained decoding: only pass/fail tokens allowed
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :] # Only the last token logits
mask = torch.full_like(logits, float("-inf"))
mask[:, PASS_TOKEN] = 0
mask[:, FAIL_TOKEN] = 0
masked_logits = logits + mask
token_id = masked_logits.argmax(dim=-1).item()
return "pass" if token_id == PASS_TOKEN else "fail"
# Usage
result = route("Who wrote Hamlet?", ["Shakespeare wrote Othello in 1603."])
# → "fail" → MCP fallback triggerTerminology
Related Resources
Original Abstract (Expand)
Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete forward passes for billion-parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time-to-First-Token (TTFT) and costs. We propose Tiny-Critic RAG, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA). Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing. Evaluations on noise-injected datasets demonstrate Tiny-Critic RAG achieves routing accuracy comparable to GPT-4o-mini while reducing latency by an order of magnitude, establishing a highly cost-effective paradigm for agent deployment.