DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning

Feb 28, 2025•Pengcheng Jiang, Jiacheng Lin, Lang Cao +5•View PDF

TL;DR Highlight

A 3B model using RL to self-optimize queries achieved more than 2x better retrieval performance than GPT-4o and Claude-3.5-Sonnet.

Who Should Read

Backend/ML developers wanting to improve retrieval quality in RAG pipelines, or engineers building medical literature search or SQL automation systems.

Core Mechanics

Optimizes queries using only RL without labeled training data — uses retrieval metrics like Recall and NDCG directly as rewards for trial-and-error learning
3B parameter model (Qwen2.5-3B-Instruct based) outperforms GPT-4o and Claude-3.5-Sonnet on literature retrieval and SQL generation
RL is more effective than SFT — exceeds LEADS (SFT trained with GPT-4o distillation + human annotation) by 2.6x on recall
<think> reasoning process dramatically improves performance — without reasoning in training, queries get stuck in a local minimum of infinite repetitive expansion
BM25 + DeepRetrieval combo matches or beats dense retrieval on some datasets, 34x faster
RL also outperforms SFT in SQL generation — surpasses GPT-4o on Spider benchmark using execution accuracy as reward without ground truth SQL

Evidence

PubMed literature search Recall@3K: DeepRetrieval 65.07% vs prior SOTA (LEADS) 24.68% — 2.6x improvement
ClinicalTrials.gov clinical trial search Recall@3K: DeepRetrieval 63.18% vs prior SOTA 32.11% — ~2x improvement
SQL generation (Spider): DeepRetrieval3B-Coder 74.85% vs GPT-4o 73.50%, Claude-3.5-Sonnet 66.05%
Speed: BM25 352s vs dense retrieval 12,232s (5.42M docs, 13,332 queries) — 34x faster

How to Apply

For query optimization in RAG pipelines: apply RL training (PPO/GRPO) on a small model, using NDCG or Recall from the actual retriever (BM25/dense) as reward signal — improves retrieval quality without labels.
For medical literature search / PubMed API integration: structure prompts with PICO format input and generate augmented queries using Boolean operators in <think>...</think><answer>...</answer> format.
For Text-to-SQL pipelines considering switching from SFT to RL: design reward using execution accuracy alone without ground truth SQL — RL explores more diverse SQL strategies.

Code Example

snippet

# DeepRetrieval-style query generation prompt example (PubMed literature search)
SYSTEM_PROMPT = """
A conversation between User and Assistant. The Assistant is a clinical specialist
conducting a medical literature review. The task is to create query terms for PubMed.

The Assistant shows thinking in <think></think> tags and returns the final answer
in <answer></answer> tags as JSON.

The query must use Boolean operators (AND, OR) and parentheses for grouping.
"""

USER_TEMPLATE = """
The research is defined by the following PICO:
P: {population}
I: {intervention}
C: {comparison}
O: {outcome}

Please generate an optimized PubMed search query.
"""

# Model output example (after DeepRetrieval training)
# <think>
# The PICO describes desmopressin use in perioperative settings to minimize blood transfusion.
# Key terms: DDAVP (synonym for desmopressin), perioperative, blood transfusion, RCT
# </think>
# <answer>
# {
#   "query": "((DDAVP OR desmopressin) AND (perioperative OR surgery) AND (blood transfusion OR allogeneic transfusion) AND (randomized controlled trial))"
# }
# </answer>

# RL reward function example (Recall-based)
def compute_retrieval_reward(recall_at_k: float) -> float:
    if recall_at_k >= 0.7:
        return 5.0
    elif recall_at_k >= 0.5:
        return 4.0
    elif recall_at_k >= 0.4:
        return 3.0
    elif recall_at_k >= 0.3:
        return 1.0
    elif recall_at_k >= 0.1:
        return 0.5
    elif recall_at_k >= 0.05:
        return 0.1
    else:
        return -3.5

Terminology

PPOProximal Policy Optimization. An RL algorithm that maximizes reward while limiting update magnitude to prevent drastic model changes.

GRPOGroup Relative Policy Optimization. Learns through relative rewards within groups — no separate value function (critic) needed.

NDCGNormalized Discounted Cumulative Gain. A search quality metric — higher scores when relevant documents appear at the top.

BM25A traditional search algorithm using keyword frequency and document length. Fast and strong even without deep learning.

SFTSupervised Fine-Tuning. Trains the model by showing gold-standard examples. Hard to go beyond example quality.

Recall@KWhat percentage of all correct documents appear in the top K results.

Dense RetrievalRetrieval by converting documents and queries to vectors and searching by semantic similarity.

PICOA standard framework for medical research query construction. P (patient/problem), I (intervention), C (comparison), O (outcome).

Related Resources

Original Abstract (Expand)

Information retrieval systems are crucial for enabling effective access to large document collections. Recent approaches have leveraged Large Language Models (LLMs) to enhance retrieval performance through query augmentation, but often rely on expensive supervised learning or distillation techniques that require significant computational resources and hand-labeled data. We introduce DeepRetrieval, a reinforcement learning (RL) approach that trains LLMs for query generation through trial and error without supervised data (reference query). Using retrieval metrics as rewards, our system generates queries that maximize retrieval performance. DeepRetrieval outperforms leading methods on literature search with 65.07% (vs. previous SOTA 24.68%) recall for publication search and 63.18% (vs. previous SOTA 32.11%) recall for trial search using real-world search engines. DeepRetrieval also dominates in evidence-seeking retrieval, classic information retrieval and SQL database search. With only 3B parameters, it outperforms industry-leading models like GPT-4o and Claude-3.5-Sonnet on 11/13 datasets. These results demonstrate that our RL approach offers a more efficient and effective paradigm for information retrieval. Our data and code are available at: https://github.com/pat-jj/DeepRetrieval.