Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

Mar 1, 2026•Yichao Wu, Penghao Liang, Yafei Xiang +5•View PDF

TL;DR Highlight

Implementing GPT-4o-mini-level RAG noise filtering with a 1.7B small model — 98% cost reduction, 94.6% latency reduction.

Who Should Read

Backend/AI engineers who use LLM-as-a-judge in RAG pipelines but are worried about API costs and latency. Especially teams experiencing token waste from bad search results in ReAct agents.

Core Mechanics

When bad search results enter an agent, it causes not just hallucination but also unnecessary multi-hop reasoning loops and duplicate tool calls, causing TTFT (time-to-first-token) and costs to explode
Fine-tuning Qwen3-1.7B with LoRA creates a tiny router (Tiny-Critic) that only judges "can this search result be used / not used"
Combining Constrained Decoding (limiting output tokens to just pass/fail) with Non-Thinking Mode (skipping CoT) reduces routing itself to 42ms
When the judgment is fail, a fallback tool like Tavily Search is called via MCP (Model Context Protocol) to retrieve clean context
Zero-shot Qwen3-1.7B had 38.2% FPR (false positive rate), which dropped to 4.1% after LoRA training — unusable as a router without fine-tuning

Evidence

Routing F1-Score: Tiny-Critic 0.912 vs GPT-4o-mini (Heavy-CRAG) 0.934 — statistically equivalent level
Routing TTFT: Heavy-CRAG 785ms → Tiny-Critic 42ms, 94.6% reduction
Explicit cost per 10k queries: Heavy-CRAG $3.00 → Tiny-Critic $0.06, 98% reduction
Faithfulness in 45% noise environment: Naive RAG 0.44 vs Tiny-Critic 0.86 (maintained near no-noise level of 0.89)

How to Apply

Replace the existing RAG pipeline's LLM evaluator (GPT-4o-mini etc.) with a Qwen3-1.7B + LoRA router. Concat the search result and query, then configure constrained decoding to receive only "pass"/"fail".
Connect fallback search on fail judgment via MCP tool call. Wrapping Tavily or your own search API as an MCP server lets you naturally insert it into the agent loop.
Training data uses 5,000 queries based on Natural Questions + HotpotQA; apply the same noise injection protocol of sampling Hard Negatives from rank 10-20 with BGE-M3 to create a domain-specific router.

Code Example

snippet

# LoRA Router Inference Example (Constrained Decoding)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

model_id = "Qwen/Qwen3-1.7B"
loRA_path = "./tiny-critic-lora"  # Fine-tuned adapter

tokenizer = AutoTokenizer.from_pretrained(model_id)
base_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base_model, lora_path)

PASS_TOKEN = tokenizer.encode("pass", add_special_tokens=False)[0]
FAIL_TOKEN = tokenizer.encode("fail", add_special_tokens=False)[0]

def route(query: str, docs: list[str]) -> str:
    context = "\n".join(docs)
    prompt = f"Query: {query}\nContext: {context}\nVerdict:"
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # Constrained decoding: only pass/fail tokens allowed
    with torch.no_grad():
        logits = model(**inputs).logits[:, -1, :]  # Only the last token logits
        mask = torch.full_like(logits, float("-inf"))
        mask[:, PASS_TOKEN] = 0
        mask[:, FAIL_TOKEN] = 0
        masked_logits = logits + mask
        token_id = masked_logits.argmax(dim=-1).item()
    
    return "pass" if token_id == PASS_TOKEN else "fail"

# Usage
result = route("Who wrote Hamlet?", ["Shakespeare wrote Othello in 1603."])  
# → "fail" → MCP fallback trigger

Terminology

LoRAInstead of retraining the full model, add and train only two small matrices. Doesn't touch the original model — just swaps the adapter, requiring much less memory and time.

Constrained DecodingA technique that limits which tokens an LLM can output to a predefined set. Here it forces only "pass" and "fail" to block unnecessary reasoning.

TTFTTime-to-First-Token. The time from sending a request to receiving the first token. The key metric for user-perceived response speed.

Agentic RAGInstead of searching once and stopping, this autonomously evaluates results and searches again or calls tools if insufficient — a self-directed loop pattern for RAG.

RAGAS FaithfulnessA score (0-1) measuring how faithful generated answers are to retrieved context. Higher means less hallucination.

Hard NegativeA document that looks related on the surface but is actually irrelevant to the real answer. A "trap document" that retrieval systems are prone to getting fooled by.

MCPModel Context Protocol. A protocol created by Anthropic that enables AI models to call external tools (search, DB, etc.) in a standardized way.

Related Resources

Original Abstract (Expand)

Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete forward passes for billion-parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time-to-First-Token (TTFT) and costs. We propose Tiny-Critic RAG, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA). Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing. Evaluations on noise-injected datasets demonstrate Tiny-Critic RAG achieves routing accuracy comparable to GPT-4o-mini while reducing latency by an order of magnitude, establishing a highly cost-effective paradigm for agent deployment.