A Survey of Large Language Model Agents for Question Answering

Mar 24, 2025•Murong Yue•View PDF

TL;DR Highlight

RAG, tool use, multi-turn conversations and more — a single paper that covers every design pattern for LLM-based QA agents.

Who Should Read

Backend/AI developers building or operating LLM-based QA systems or RAG pipelines. Especially teams figuring out which techniques to use at each stage — retrieval, planning, and answer generation.

Core Mechanics

LLM Agent QA can be structured into 5 stages: Planning → Question Understanding → Information Retrieval → Answer Generation → Follow-up Interaction
Planning splits into prompt-based (ReAct, Active Retriever) and fine-tuning-based (FireAct, Learning from Failure) — training on failure trajectories actually improves performance
Query Expansion techniques (HyQE, Step-back reformulation) inflate user queries to boost retrieval accuracy, and can be plugged into RAG with just a few lines of code
The trend is toward Self-RAG-style approaches where the LLM self-evaluates document usefulness and dynamically decides whether to answer from internal knowledge or external docs
Prompt compression techniques like LLMLingua effectively reduce token costs by condensing long retrieved documents while preserving key information
Major open challenges: reducing hallucination, autonomous tool selection/creation, and the lack of benchmarks for evaluating CoT reasoning processes themselves

Evidence

Maps which techniques work on which datasets across diverse benchmarks (MMLU 50+ domains, GSM8k, MATH, HotpotQA, FinQA, etc.), categorized by pipeline stage
Self-RAG evaluates both document relevance and final answer contribution simultaneously, reporting improved noise document filtering vs. vanilla retrieval
FireAct fine-tunes smaller models on GPT-4-generated action trajectories, achieving improved planning ability on multi-hop QA tasks
LLMLingua's coarse-to-fine 2-stage compression drastically reduces prompt length while maintaining model performance (exact numbers in original paper)

How to Apply

Adding Query Expansion to your RAG pipeline: Instead of searching user queries as-is, add a prompt asking the LLM to 'generate 3 hypothetical documents related to this question (HyQE style)' or 'rephrase this question at a higher abstraction level (Step-back)' to boost retrieval recall.
Using LLM as a re-ranker after retrieval: Pull top-20 with BM25 or dense retrieval, then have the LLM score each document's relevance (Haystack-style) to improve top-3 quality.
For QA requiring numerical computation (finance, math): Instead of generating answers directly, switch to a Program-of-Thought (PoT) flow — generate Python code → execute → feed results into the answer to reduce calculation errors.

Code Example

snippet

# Step-back Query Reformulation example (directly applicable to RAG pipeline)
import openai

def stepback_query(original_query: str) -> str:
    prompt = f"""You are an expert at improving search queries.
Please rewrite the specific question below into a more general and broader conceptual question.
This will allow retrieval of more relevant documents.

Original question: {original_query}

More general question (output only one question):"""
    
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content.strip()

def hyqe_expansion(original_query: str, n: int = 3) -> list[str]:
    """HyQE: Generate hypothetical documents for query expansion"""
    prompt = f"""Please write {n} hypothetical document paragraphs that could answer the following question.
Write each paragraph as if it were a real document, and number them for distinction.

Question: {original_query}"""
    
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    raw = response.choices[0].message.content
    # Embed the hypothetical documents as vectors and use them for actual retrieval
    return [doc.strip() for doc in raw.split('\n\n') if doc.strip()]

# ReAct-style planning prompt template
REACT_PROMPT = """
You are an agent that thinks and acts step by step to answer questions.

Available tools:
- search(query): Search for information from the web/DB
- calculate(expression): Evaluate a mathematical expression
- finish(answer): Submit the final answer

Format:
Thought: Analyze the current situation and decide the next action
Action: tool_name(argument)
Observation: Result of the tool execution
(Repeat Thought/Action/Observation)
Thought: Ready to provide the final answer
Action: finish(final answer)

Question: {question}

Thought:"""

# Usage example
if __name__ == "__main__":
    query = "What school did Estella Leopold attend between August and November 1954?"
    
    stepback = stepback_query(query)
    print(f"Step-back query: {stepback}")
    # Example output: "What is Estella Leopold's educational history?"
    
    hypothetical_docs = hyqe_expansion(query)
    print(f"Number of hypothetical documents: {len(hypothetical_docs)}")
    # Embed these documents and use them for actual document retrieval

Terminology

ReActA framework where the LLM alternates between 'Reasoning' and 'Acting' to answer questions. Think of a detective who finds clues, analyzes them, then finds more clues in a loop.

RAGRetrieval-Augmented Generation. Instead of answering from memory alone, the LLM first searches external documents for reference. Like looking things up before answering rather than guessing.

Chain-of-Thought (CoT)A technique where the LLM writes out intermediate reasoning steps instead of jumping to the final answer. Like showing your work on a math problem.

Original Abstract (Expand)

This paper surveys the development of large language model (LLM)-based agents for question answering (QA). Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments. LLM-based agents address these challenges by leveraging LLMs as their core reasoning engine. These agents achieve superior QA results compared to traditional QA pipelines and naive LLM QA systems by enabling interaction with external environments. We systematically review the design of LLM agents in the context of QA tasks, organizing our discussion across key stages: planning, question understanding, information retrieval, and answer generation. Additionally, this paper identifies ongoing challenges and explores future research directions to enhance the performance of LLM agent QA systems.