The rise and potential of large language model based agents: a survey

TL;DR Highlight

A comprehensive survey condensing LLM-based AI agent architecture, capabilities, applications, and limitations into one paper.

Who Should Read

Researchers and engineers building or evaluating AI agent systems who need a systematic overview of the current agent landscape.

Core Mechanics

LLM-based agents consist of 4 core components: Planning (task decomposition), Memory (short/long-term), Action (tool use, code execution), and Perception (multimodal input)
Current agents excel at: code generation and debugging, information retrieval and synthesis, structured task execution with clear success criteria
Current agents struggle with: long-horizon planning, causal reasoning, novel tool composition, and graceful failure handling
Multi-agent systems (multiple specialized agents collaborating) consistently outperform single-agent systems on complex tasks — but coordination overhead is significant
Trust and safety are the critical open problems: agents that can take real-world actions (web browsing, code execution, API calls) require robust sandboxing and permission management
The paper provides a unified taxonomy of agent architectures (ReAct, Reflexion, AutoGPT-style, etc.) and their tradeoffs

Evidence

Comprehensive survey of 200+ agent papers with capability categorization and benchmark comparison
Multi-agent vs. single-agent: on complex coding tasks (SWE-bench), multi-agent achieves 45% vs. 28% single-agent resolution rate
Identified 12 distinct agent failure modes with frequency analysis from production agent deployments

How to Apply

Use this paper's taxonomy to select your agent architecture: ReAct for tool-heavy tasks, Reflexion for tasks with clear success criteria and iteration potential, tree-of-thought for complex planning.
For production agents: implement the 4-component framework explicitly — design your memory system, action space, and planning module separately before integrating.
Prioritize sandboxing and permission management before capability expansion — agent safety failures are harder to recover from than capability gaps.

Code Example

snippet

# ReAct pattern-based agent prompt example (core pattern introduced in the paper)
SYSTEM_PROMPT = """
You are an agent. For each step, follow this format:

Thought: [Analyze current situation and plan next action]
Action: [Tool name to use]
Action Input: [Input value to pass to the tool]
Observation: [Tool execution result — filled in by the system]

Repeat the above cycle until you know the final answer:
Final Answer: [Final answer]
"""

# Simple implementation with LangChain
from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-4", temperature=0)

tools = [
    Tool(name="Search", func=search_fn, description="When internet search is needed"),
    Tool(name="Calculator", func=calc_fn, description="When mathematical calculation is needed"),
    Tool(name="CodeExecutor", func=exec_fn, description="When Python code execution is needed"),
]

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

result = agent.run("Research the number of AI agent-related papers in 2024 and calculate the growth rate compared to the previous year")

Terminology

ReActReasoning + Acting — agent framework alternating between reasoning traces and tool-use actions.

ReflexionAn agent framework where the agent reflects on its failures and explicitly updates its approach — enables iterative improvement without weight updates.

SWE-benchSoftware Engineering Benchmark — evaluates agents on real GitHub issues requiring code changes.

Multi-Agent SystemMultiple AI agents collaborating on a task, each with specialized roles — analogous to a software engineering team.

Long-Horizon PlanningPlanning and executing tasks that require many sequential steps over extended time — a current weakness of LLM agents.