A survey on large language model based autonomous agents
TL;DR Highlight
A comprehensive survey systematically covering LLM-based autonomous agent architecture design, applications, and evaluation.
Who Should Read
Backend/AI developers designing LLM agent systems for the first time or introducing agent patterns to existing pipelines. Engineers wanting to understand the internals of frameworks like AutoGPT or LangChain more deeply.
Core Mechanics
- Unifies agent architecture into 4 modules: Profile (role setting) → Memory → Planning → Action — most existing research fits this framework
- Hybrid Memory combining short-term (context window) and long-term (vector DB) is most effective; memory retrieval is formalized as a weighted sum of 3 criteria: recency + relevance + importance
- Planning divides into feedback-free (CoT, ToT, etc.) and feedback-based (ReAct, Reflexion, etc.) — feedback from environment/humans/models is far more powerful for complex tasks
- Capability acquisition strategies: fine-tuning (open-source like LLaMA) vs non-fine-tuning (prompt engineering + mechanism engineering) — closed models can only use the latter
- Applications span social simulation, chemistry lab automation (ChemCrow), software development (ChatDev, MetaGPT), to robotics (SayCan, Voyager)
- Key challenges identified: hallucination, prompt vulnerability, knowledge boundary issues (model knowing too much making simulations inaccurate), and reasoning efficiency
Evidence
- ToolBench dataset: collected 16,464 real APIs from 49 categories on RapidAPI Hub, fine-tuning LLaMA on this significantly improves tool use ability
- WebShop benchmark: evaluates agent's product search/purchase ability using 1.18M real Amazon products, quantitatively measuring LLM agent performance vs 13 human workers
- AgentBench: first systematic framework for evaluating LLMs as agents in various real environments, providing performance comparison across multiple domains
- MIND2WEB: collected 2,000+ open-ended tasks from 137 real websites across 31 domains for web agent fine-tuning
How to Apply
- When adding a memory module to an agent: using only context (short-term memory) causes window overflow, so store important info as embeddings in vector DB and apply Hybrid Memory retrieval with recency+relevance+importance weighted sum
- For complex multi-step task automation: ReAct pattern (think→act→observe cycle) or Reflexion (retry with language feedback after failure) works better than upfront full plan generation (CoT) for easier failure recovery
- For role-based multi-agent systems: separate PM, architect, developer roles like ChatDev/MetaGPT and clearly define each agent's profile in the system prompt to improve collaboration quality and output consistency
Code Example
# ReAct pattern-based agent prompt example
SYSTEM_PROMPT = """
You are an autonomous agent. Follow the Thought-Action-Observation loop.
Format:
Thought: [reasoning about what to do next]
Action: [tool_name(param1, param2)]
Observation: [result from tool]
... (repeat as needed)
Final Answer: [your final response]
Available tools:
- search(query): Search the web
- calculator(expression): Evaluate math
- memory_read(query): Retrieve from memory
- memory_write(content): Store to memory
"""
# Hybrid Memory retrieval scoring example
def score_memory(query_embedding, memory_item, current_time):
"""
Paper formula: m* = argmax α*s_rec + β*s_rel + γ*s_imp
"""
alpha, beta, gamma = 0.3, 0.5, 0.2 # weights are adjustable
# Recency: decreases as time passes
time_diff = current_time - memory_item['timestamp']
s_recency = 1.0 / (1.0 + time_diff.seconds / 3600)
# Relevance: cosine similarity
s_relevance = cosine_similarity(query_embedding, memory_item['embedding'])
# Importance: pre-rated by LLM on a scale of 1-10 and stored
s_importance = memory_item['importance'] / 10.0
return alpha * s_recency + beta * s_relevance + gamma * s_importanceTerminology
Related Resources
Original Abstract (Expand)
Autonomous agents have long been a research focus in academic and industry communities. Previous research often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of Web knowledge, large language models (LLMs) have shown potential in human-level intelligence, leading to a surge in research on LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of LLM-based autonomous agents from a holistic perspective. We first discuss the construction of LLM-based autonomous agents, proposing a unified framework that encompasses much of previous work. Then, we present a overview of the diverse applications of LLM-based autonomous agents in social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field.