The rise and potential of large language model based agents: a survey
TL;DR Highlight
A comprehensive survey condensing LLM-based AI agent architecture, capabilities, applications, and limitations into one paper.
Who Should Read
Researchers and engineers building or evaluating AI agent systems who need a systematic overview of the current agent landscape.
Core Mechanics
- LLM-based agents consist of 4 core components: Planning (task decomposition), Memory (short/long-term), Action (tool use, code execution), and Perception (multimodal input)
- Current agents excel at: code generation and debugging, information retrieval and synthesis, structured task execution with clear success criteria
- Current agents struggle with: long-horizon planning, causal reasoning, novel tool composition, and graceful failure handling
- Multi-agent systems (multiple specialized agents collaborating) consistently outperform single-agent systems on complex tasks — but coordination overhead is significant
- Trust and safety are the critical open problems: agents that can take real-world actions (web browsing, code execution, API calls) require robust sandboxing and permission management
- The paper provides a unified taxonomy of agent architectures (ReAct, Reflexion, AutoGPT-style, etc.) and their tradeoffs
Evidence
- Comprehensive survey of 200+ agent papers with capability categorization and benchmark comparison
- Multi-agent vs. single-agent: on complex coding tasks (SWE-bench), multi-agent achieves 45% vs. 28% single-agent resolution rate
- Identified 12 distinct agent failure modes with frequency analysis from production agent deployments
How to Apply
- Use this paper's taxonomy to select your agent architecture: ReAct for tool-heavy tasks, Reflexion for tasks with clear success criteria and iteration potential, tree-of-thought for complex planning.
- For production agents: implement the 4-component framework explicitly — design your memory system, action space, and planning module separately before integrating.
- Prioritize sandboxing and permission management before capability expansion — agent safety failures are harder to recover from than capability gaps.
Code Example
snippet
# ReAct pattern-based agent prompt example (core pattern introduced in the paper)
SYSTEM_PROMPT = """
You are an agent. For each step, follow this format:
Thought: [Analyze current situation and plan next action]
Action: [Tool name to use]
Action Input: [Input value to pass to the tool]
Observation: [Tool execution result — filled in by the system]
Repeat the above cycle until you know the final answer:
Final Answer: [Final answer]
"""
# Simple implementation with LangChain
from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-4", temperature=0)
tools = [
Tool(name="Search", func=search_fn, description="When internet search is needed"),
Tool(name="Calculator", func=calc_fn, description="When mathematical calculation is needed"),
Tool(name="CodeExecutor", func=exec_fn, description="When Python code execution is needed"),
]
agent = initialize_agent(
tools=tools,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
result = agent.run("Research the number of AI agent-related papers in 2024 and calculate the growth rate compared to the previous year")Terminology
ReActReasoning + Acting — agent framework alternating between reasoning traces and tool-use actions.
ReflexionAn agent framework where the agent reflects on its failures and explicitly updates its approach — enables iterative improvement without weight updates.
SWE-benchSoftware Engineering Benchmark — evaluates agents on real GitHub issues requiring code changes.
Multi-Agent SystemMultiple AI agents collaborating on a task, each with specialized roles — analogous to a software engineering team.
Long-Horizon PlanningPlanning and executing tasks that require many sequential steps over extended time — a current weakness of LLM agents.