A Survey on the Optimization of Large Language Model-based Agents

Mar 16, 2025•Shangheng Du, Jiabao Zhao, Jinxin Shi +4•View PDF

TL;DR Highlight

A survey paper that consolidates every optimization technique for building better LLM agents — from fine-tuning and RL to prompt engineering — all in one place.

Who Should Read

ML engineers and researchers developing or improving LLM-based agent systems. Especially useful for developers weighing which optimization strategy to choose: fine-tuning vs. RL vs. prompting.

Core Mechanics

Agent optimization methods fall into two broad categories: 'Parameter-driven' approaches that directly modify model parameters (fine-tuning, RL, hybrid), and 'Parameter-free' approaches that leave parameters unchanged (prompt engineering, RAG, multi-agent collaboration).
Four ways to construct trajectory data for fine-tuning: expert human annotation, generation by a strong model like GPT-4, agent self-exploration, and multi-agent collaboration — each with different quality/cost/scalability trade-offs.
RL-based optimization is divided into environment-based rewards, model-based rewards, and custom reward functions; PPO and DPO are the most widely used — with Llama-3, Mistral-7B, and Qwen2.5 family models as the dominant base models.
Sequential hybrid training (SFT → RL) outperforms SFT alone: warm up with high-quality trajectories via SFT, then refine the policy with PPO/DPO — the same paradigm as OpenAI's RFT (Reinforcement Fine-Tuning).
Five Parameter-free methods: past-experience-based (ExpeL, Reflexion), feedback-driven self-reflection, meta-prompt optimization (OPRO), tool-use optimization, and RAG-based external knowledge integration.
Key evaluation benchmarks vary by domain: math (GSM8K, MATH), code (SWE-bench, HumanEval), web navigation (WebArena, Mind2Web), tool use (T-Eval, ToolEval), and multi-task (AgentBench, GAIA).

Evidence

AgentBank dataset: covers 16 tasks with 51,287 filtered trajectories — one of the largest agent tuning datasets available.
SMART-Trajectory dataset: covers 17 tasks with 142,507 trajectories — the largest dataset introduced in this paper.
Agent Q based on Llama-3-70B achieved performance gains on web agent tasks with DPO; the xLAM-v0.1-r model was improved using the same approach.
Hybrid strategy (SFT+RL) in practice: more than 8 papers — including ETO (BC+DPO), AGILE (BC+PPO), SaySelf (SFT+PPO), and OPTIMA (SFT+DPO) — confirmed improvements over SFT alone.

How to Apply

To optimize an open-source agent (e.g., Llama-3-8B, Mistral-7B) for a specific domain: generate ReAct-style trajectories with GPT-4 → filter using environment feedback → fine-tune with LoRA → then apply DPO for preference alignment.
If fine-tuning costs are a concern, try Parameter-free methods first: add a Reflexion-style pattern to your prompt (self-reflection on failure before retrying), or store past successes and failures in memory like ExpeL so the agent can reference them in future runs.
When building multi-agent systems, choose based on task characteristics: LangGraph (workflow-centric), AutoGen (conversational collaboration), or CrewAI (role-based teams) — for complex software engineering tasks, refer to MetaGPT/ChatDev patterns.

Code Example

snippet

# Reflexion-style self-reflection prompt pattern (Parameter-free optimization)
system_prompt = """
You are an autonomous agent. Follow this process:
1. THINK: Analyze the task and plan your approach
2. ACT: Execute an action using available tools
3. OBSERVE: Check the result
4. REFLECT: If failed, analyze why and adjust strategy
5. Repeat until task is complete

If you encounter an error:
- Identify what went wrong
- Generate a corrected plan
- Do NOT repeat the same mistake
"""

# Example preference data structure for DPO
preference_data = {
    "instruction": "Search the web for today's weather and tell me.",
    "chosen": [
        {"role": "assistant", "content": "[SEARCH] today's weather\n[RESULT] Seoul 18°C, clear\nFinal answer: The current temperature in Seoul is 18°C and it is clear."}
    ],
    "rejected": [
        {"role": "assistant", "content": "I cannot access weather information. Please check it yourself."}
    ]
}

# LoRA-based agent fine-tuning configuration (HuggingFace PEFT)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
lora_config = LoraConfig(
    r=16,           # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # Only ~1% of total parameters are trained

Terminology

TrajectoryThe complete record of actions an agent takes while performing a task — a log of 'Thought → Action → Observation' sequences. Used as training data for fine-tuning.

SFTSupervised Fine-Tuning — a method where the model learns by imitating demonstrated correct answers, similar to studying worked examples in school. For agents, this means learning from high-quality trajectories.

RLReinforcement Learning — a training approach where the model interacts with an environment and learns to maximize rewards, analogous to learning a game by optimizing for a higher score.

PPOProximal Policy Optimization — a policy optimization algorithm that maximizes rewards while constraining how drastically the policy changes at each update. Widely used in RLHF for LLM fine-tuning.

DPODirect Preference Optimization — optimizes the policy directly by comparing pairs of preferred and rejected responses. Simpler to implement than PPO and requires no separate reward model.

LoRALow-Rank Adaptation — a parameter-efficient fine-tuning technique that adds small adapter matrices rather than updating the entire model. Like clip-on lenses, adapters can be attached and detached, using far less GPU memory.

ReActShort for Reasoning + Acting. An agent prompting pattern that repeats 'Thought → Action → Observation' cycles. The standard format for structuring agent trajectories.

RAGRetrieval-Augmented Generation — a method that retrieves relevant information from external databases or documents and incorporates it into LLM responses, enabling the model to answer questions about information it wasn't trained on.

Related Resources

Original Abstract (Expand)

With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks. However, current work typically relies on prompt design or fine-tuning strategies applied to vanilla LLMs, which often leads to limited effectiveness in complex agent-related environments. Although numerous recent studies have explored various strategies to optimize LLM-based agents for complex agent tasks, a systematic review summarizing and comparing these methods from a holistic perspective remains lacking. In this survey, we provide a comprehensive review of LLM-based agent optimization approaches, categorizing them into parameter-driven and parameter-free methods. We first focus on parameter-driven optimization, covering fine-tuning-based optimization, reinforcement learning-based optimization, and hybrid strategies, analyzing key aspects such as trajectory data construction, reward function design, and optimization algorithms. Additionally, we briefly discuss parameter-free strategies that optimize agent behavior through prompt engineering and external knowledge retrieval. Finally, we summarize the evaluation for agents, review key applications of LLM-based agents, and discuss the major challenges and promising future directions. A curated collection of the surveyed works is provided at https://github.com/YoungDubbyDu/LLM-Agent-Optimization.