DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning
TL;DR Highlight
A 3B model using RL to self-optimize queries achieved more than 2x better retrieval performance than GPT-4o and Claude-3.5-Sonnet.
Who Should Read
Backend/ML developers wanting to improve retrieval quality in RAG pipelines, or engineers building medical literature search or SQL automation systems.
Core Mechanics
- Optimizes queries using only RL without labeled training data — uses retrieval metrics like Recall and NDCG directly as rewards for trial-and-error learning
- 3B parameter model (Qwen2.5-3B-Instruct based) outperforms GPT-4o and Claude-3.5-Sonnet on literature retrieval and SQL generation
- RL is more effective than SFT — exceeds LEADS (SFT trained with GPT-4o distillation + human annotation) by 2.6x on recall
- <think> reasoning process dramatically improves performance — without reasoning in training, queries get stuck in a local minimum of infinite repetitive expansion
- BM25 + DeepRetrieval combo matches or beats dense retrieval on some datasets, 34x faster
- RL also outperforms SFT in SQL generation — surpasses GPT-4o on Spider benchmark using execution accuracy as reward without ground truth SQL
Evidence
- PubMed literature search Recall@3K: DeepRetrieval 65.07% vs prior SOTA (LEADS) 24.68% — 2.6x improvement
- ClinicalTrials.gov clinical trial search Recall@3K: DeepRetrieval 63.18% vs prior SOTA 32.11% — ~2x improvement
- SQL generation (Spider): DeepRetrieval3B-Coder 74.85% vs GPT-4o 73.50%, Claude-3.5-Sonnet 66.05%
- Speed: BM25 352s vs dense retrieval 12,232s (5.42M docs, 13,332 queries) — 34x faster
How to Apply
- For query optimization in RAG pipelines: apply RL training (PPO/GRPO) on a small model, using NDCG or Recall from the actual retriever (BM25/dense) as reward signal — improves retrieval quality without labels.
- For medical literature search / PubMed API integration: structure prompts with PICO format input and generate augmented queries using Boolean operators in <think>...</think><answer>...</answer> format.
- For Text-to-SQL pipelines considering switching from SFT to RL: design reward using execution accuracy alone without ground truth SQL — RL explores more diverse SQL strategies.
Code Example
# DeepRetrieval-style query generation prompt example (PubMed literature search)
SYSTEM_PROMPT = """
A conversation between User and Assistant. The Assistant is a clinical specialist
conducting a medical literature review. The task is to create query terms for PubMed.
The Assistant shows thinking in <think></think> tags and returns the final answer
in <answer></answer> tags as JSON.
The query must use Boolean operators (AND, OR) and parentheses for grouping.
"""
USER_TEMPLATE = """
The research is defined by the following PICO:
P: {population}
I: {intervention}
C: {comparison}
O: {outcome}
Please generate an optimized PubMed search query.
"""
# Model output example (after DeepRetrieval training)
# <think>
# The PICO describes desmopressin use in perioperative settings to minimize blood transfusion.
# Key terms: DDAVP (synonym for desmopressin), perioperative, blood transfusion, RCT
# </think>
# <answer>
# {
# "query": "((DDAVP OR desmopressin) AND (perioperative OR surgery) AND (blood transfusion OR allogeneic transfusion) AND (randomized controlled trial))"
# }
# </answer>
# RL reward function example (Recall-based)
def compute_retrieval_reward(recall_at_k: float) -> float:
if recall_at_k >= 0.7:
return 5.0
elif recall_at_k >= 0.5:
return 4.0
elif recall_at_k >= 0.4:
return 3.0
elif recall_at_k >= 0.3:
return 1.0
elif recall_at_k >= 0.1:
return 0.5
elif recall_at_k >= 0.05:
return 0.1
else:
return -3.5Terminology
Related Resources
Original Abstract (Expand)
Information retrieval systems are crucial for enabling effective access to large document collections. Recent approaches have leveraged Large Language Models (LLMs) to enhance retrieval performance through query augmentation, but often rely on expensive supervised learning or distillation techniques that require significant computational resources and hand-labeled data. We introduce DeepRetrieval, a reinforcement learning (RL) approach that trains LLMs for query generation through trial and error without supervised data (reference query). Using retrieval metrics as rewards, our system generates queries that maximize retrieval performance. DeepRetrieval outperforms leading methods on literature search with 65.07% (vs. previous SOTA 24.68%) recall for publication search and 63.18% (vs. previous SOTA 32.11%) recall for trial search using real-world search engines. DeepRetrieval also dominates in evidence-seeking retrieval, classic information retrieval and SQL database search. With only 3B parameters, it outperforms industry-leading models like GPT-4o and Claude-3.5-Sonnet on 11/13 datasets. These results demonstrate that our RL approach offers a more efficient and effective paradigm for information retrieval. Our data and code are available at: https://github.com/pat-jj/DeepRetrieval.