A Survey of Large Language Model Agents for Question Answering
TL;DR Highlight
RAG, tool use, multi-turn conversations and more — a single paper that covers every design pattern for LLM-based QA agents.
Who Should Read
Backend/AI developers building or operating LLM-based QA systems or RAG pipelines. Especially teams figuring out which techniques to use at each stage — retrieval, planning, and answer generation.
Core Mechanics
- LLM Agent QA can be structured into 5 stages: Planning → Question Understanding → Information Retrieval → Answer Generation → Follow-up Interaction
- Planning splits into prompt-based (ReAct, Active Retriever) and fine-tuning-based (FireAct, Learning from Failure) — training on failure trajectories actually improves performance
- Query Expansion techniques (HyQE, Step-back reformulation) inflate user queries to boost retrieval accuracy, and can be plugged into RAG with just a few lines of code
- The trend is toward Self-RAG-style approaches where the LLM self-evaluates document usefulness and dynamically decides whether to answer from internal knowledge or external docs
- Prompt compression techniques like LLMLingua effectively reduce token costs by condensing long retrieved documents while preserving key information
- Major open challenges: reducing hallucination, autonomous tool selection/creation, and the lack of benchmarks for evaluating CoT reasoning processes themselves
Evidence
- Maps which techniques work on which datasets across diverse benchmarks (MMLU 50+ domains, GSM8k, MATH, HotpotQA, FinQA, etc.), categorized by pipeline stage
- Self-RAG evaluates both document relevance and final answer contribution simultaneously, reporting improved noise document filtering vs. vanilla retrieval
- FireAct fine-tunes smaller models on GPT-4-generated action trajectories, achieving improved planning ability on multi-hop QA tasks
- LLMLingua's coarse-to-fine 2-stage compression drastically reduces prompt length while maintaining model performance (exact numbers in original paper)
How to Apply
- Adding Query Expansion to your RAG pipeline: Instead of searching user queries as-is, add a prompt asking the LLM to 'generate 3 hypothetical documents related to this question (HyQE style)' or 'rephrase this question at a higher abstraction level (Step-back)' to boost retrieval recall.
- Using LLM as a re-ranker after retrieval: Pull top-20 with BM25 or dense retrieval, then have the LLM score each document's relevance (Haystack-style) to improve top-3 quality.
- For QA requiring numerical computation (finance, math): Instead of generating answers directly, switch to a Program-of-Thought (PoT) flow — generate Python code → execute → feed results into the answer to reduce calculation errors.
Code Example
# Step-back Query Reformulation example (directly applicable to RAG pipeline)
import openai
def stepback_query(original_query: str) -> str:
prompt = f"""You are an expert at improving search queries.
Please rewrite the specific question below into a more general and broader conceptual question.
This will allow retrieval of more relevant documents.
Original question: {original_query}
More general question (output only one question):"""
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.content.strip()
def hyqe_expansion(original_query: str, n: int = 3) -> list[str]:
"""HyQE: Generate hypothetical documents for query expansion"""
prompt = f"""Please write {n} hypothetical document paragraphs that could answer the following question.
Write each paragraph as if it were a real document, and number them for distinction.
Question: {original_query}"""
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
raw = response.choices[0].message.content
# Embed the hypothetical documents as vectors and use them for actual retrieval
return [doc.strip() for doc in raw.split('\n\n') if doc.strip()]
# ReAct-style planning prompt template
REACT_PROMPT = """
You are an agent that thinks and acts step by step to answer questions.
Available tools:
- search(query): Search for information from the web/DB
- calculate(expression): Evaluate a mathematical expression
- finish(answer): Submit the final answer
Format:
Thought: Analyze the current situation and decide the next action
Action: tool_name(argument)
Observation: Result of the tool execution
(Repeat Thought/Action/Observation)
Thought: Ready to provide the final answer
Action: finish(final answer)
Question: {question}
Thought:"""
# Usage example
if __name__ == "__main__":
query = "What school did Estella Leopold attend between August and November 1954?"
stepback = stepback_query(query)
print(f"Step-back query: {stepback}")
# Example output: "What is Estella Leopold's educational history?"
hypothetical_docs = hyqe_expansion(query)
print(f"Number of hypothetical documents: {len(hypothetical_docs)}")
# Embed these documents and use them for actual document retrievalTerminology
Original Abstract (Expand)
This paper surveys the development of large language model (LLM)-based agents for question answering (QA). Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments. LLM-based agents address these challenges by leveraging LLMs as their core reasoning engine. These agents achieve superior QA results compared to traditional QA pipelines and naive LLM QA systems by enabling interaction with external environments. We systematically review the design of LLM agents in the context of QA tasks, organizing our discussion across key stages: planning, question understanding, information retrieval, and answer generation. Additionally, this paper identifies ongoing challenges and explores future research directions to enhance the performance of LLM agent QA systems.