A survey on large language model based autonomous agents

Aug 22, 2023•Lei Wang, Chengbang Ma, Xueyang Feng +10•View PDF

TL;DR Highlight

A comprehensive survey systematically covering LLM-based autonomous agent architecture design, applications, and evaluation.

Who Should Read

Backend/AI developers designing LLM agent systems for the first time or introducing agent patterns to existing pipelines. Engineers wanting to understand the internals of frameworks like AutoGPT or LangChain more deeply.

Core Mechanics

Unifies agent architecture into 4 modules: Profile (role setting) → Memory → Planning → Action — most existing research fits this framework
Hybrid Memory combining short-term (context window) and long-term (vector DB) is most effective; memory retrieval is formalized as a weighted sum of 3 criteria: recency + relevance + importance
Planning divides into feedback-free (CoT, ToT, etc.) and feedback-based (ReAct, Reflexion, etc.) — feedback from environment/humans/models is far more powerful for complex tasks
Capability acquisition strategies: fine-tuning (open-source like LLaMA) vs non-fine-tuning (prompt engineering + mechanism engineering) — closed models can only use the latter
Applications span social simulation, chemistry lab automation (ChemCrow), software development (ChatDev, MetaGPT), to robotics (SayCan, Voyager)
Key challenges identified: hallucination, prompt vulnerability, knowledge boundary issues (model knowing too much making simulations inaccurate), and reasoning efficiency

Evidence

ToolBench dataset: collected 16,464 real APIs from 49 categories on RapidAPI Hub, fine-tuning LLaMA on this significantly improves tool use ability
WebShop benchmark: evaluates agent's product search/purchase ability using 1.18M real Amazon products, quantitatively measuring LLM agent performance vs 13 human workers
AgentBench: first systematic framework for evaluating LLMs as agents in various real environments, providing performance comparison across multiple domains
MIND2WEB: collected 2,000+ open-ended tasks from 137 real websites across 31 domains for web agent fine-tuning

How to Apply

When adding a memory module to an agent: using only context (short-term memory) causes window overflow, so store important info as embeddings in vector DB and apply Hybrid Memory retrieval with recency+relevance+importance weighted sum
For complex multi-step task automation: ReAct pattern (think→act→observe cycle) or Reflexion (retry with language feedback after failure) works better than upfront full plan generation (CoT) for easier failure recovery
For role-based multi-agent systems: separate PM, architect, developer roles like ChatDev/MetaGPT and clearly define each agent's profile in the system prompt to improve collaboration quality and output consistency

Code Example

snippet

# ReAct pattern-based agent prompt example
SYSTEM_PROMPT = """
You are an autonomous agent. Follow the Thought-Action-Observation loop.

Format:
Thought: [reasoning about what to do next]
Action: [tool_name(param1, param2)]
Observation: [result from tool]
... (repeat as needed)
Final Answer: [your final response]

Available tools:
- search(query): Search the web
- calculator(expression): Evaluate math
- memory_read(query): Retrieve from memory
- memory_write(content): Store to memory
"""

# Hybrid Memory retrieval scoring example
def score_memory(query_embedding, memory_item, current_time):
    """
    Paper formula: m* = argmax α*s_rec + β*s_rel + γ*s_imp
    """
    alpha, beta, gamma = 0.3, 0.5, 0.2  # weights are adjustable
    
    # Recency: decreases as time passes
    time_diff = current_time - memory_item['timestamp']
    s_recency = 1.0 / (1.0 + time_diff.seconds / 3600)
    
    # Relevance: cosine similarity
    s_relevance = cosine_similarity(query_embedding, memory_item['embedding'])
    
    # Importance: pre-rated by LLM on a scale of 1-10 and stored
    s_importance = memory_item['importance'] / 10.0
    
    return alpha * s_recency + beta * s_relevance + gamma * s_importance

Terminology

ReActAn agent pattern of Thought→Action→Observation cycle. Like a human cooking: think about what to do → add ingredients → taste → repeat.

ReflexionA technique where the agent writes down 'why did I fail?' in language when it fails and incorporates that into the next attempt. Like writing an error log after a failed exam.

Chain of Thought (CoT)A prompting technique where LLMs go through intermediate reasoning steps before giving an answer. Like showing work on math problems.

Tree of Thoughts (ToT)Reasoning by branching out multiple possibilities like a tree instead of a single line of thought. Like calculating several chess moves ahead.

Hybrid MemoryCombining short-term memory (context window) and long-term memory (vector DB). Like keeping a notepad (short-term) and a reference book (long-term).

Related Resources

Original Abstract (Expand)

Autonomous agents have long been a research focus in academic and industry communities. Previous research often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of Web knowledge, large language models (LLMs) have shown potential in human-level intelligence, leading to a surge in research on LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of LLM-based autonomous agents from a holistic perspective. We first discuss the construction of LLM-based autonomous agents, proposing a unified framework that encompasses much of previous work. Then, we present a overview of the diverse applications of LLM-based autonomous agents in social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field.