A Survey on the Optimization of Large Language Model-based Agents
TL;DR Highlight
A survey paper that consolidates every optimization technique for building better LLM agents — from fine-tuning and RL to prompt engineering — all in one place.
Who Should Read
ML engineers and researchers developing or improving LLM-based agent systems. Especially useful for developers weighing which optimization strategy to choose: fine-tuning vs. RL vs. prompting.
Core Mechanics
- Agent optimization methods fall into two broad categories: 'Parameter-driven' approaches that directly modify model parameters (fine-tuning, RL, hybrid), and 'Parameter-free' approaches that leave parameters unchanged (prompt engineering, RAG, multi-agent collaboration).
- Four ways to construct trajectory data for fine-tuning: expert human annotation, generation by a strong model like GPT-4, agent self-exploration, and multi-agent collaboration — each with different quality/cost/scalability trade-offs.
- RL-based optimization is divided into environment-based rewards, model-based rewards, and custom reward functions; PPO and DPO are the most widely used — with Llama-3, Mistral-7B, and Qwen2.5 family models as the dominant base models.
- Sequential hybrid training (SFT → RL) outperforms SFT alone: warm up with high-quality trajectories via SFT, then refine the policy with PPO/DPO — the same paradigm as OpenAI's RFT (Reinforcement Fine-Tuning).
- Five Parameter-free methods: past-experience-based (ExpeL, Reflexion), feedback-driven self-reflection, meta-prompt optimization (OPRO), tool-use optimization, and RAG-based external knowledge integration.
- Key evaluation benchmarks vary by domain: math (GSM8K, MATH), code (SWE-bench, HumanEval), web navigation (WebArena, Mind2Web), tool use (T-Eval, ToolEval), and multi-task (AgentBench, GAIA).
Evidence
- AgentBank dataset: covers 16 tasks with 51,287 filtered trajectories — one of the largest agent tuning datasets available.
- SMART-Trajectory dataset: covers 17 tasks with 142,507 trajectories — the largest dataset introduced in this paper.
- Agent Q based on Llama-3-70B achieved performance gains on web agent tasks with DPO; the xLAM-v0.1-r model was improved using the same approach.
- Hybrid strategy (SFT+RL) in practice: more than 8 papers — including ETO (BC+DPO), AGILE (BC+PPO), SaySelf (SFT+PPO), and OPTIMA (SFT+DPO) — confirmed improvements over SFT alone.
How to Apply
- To optimize an open-source agent (e.g., Llama-3-8B, Mistral-7B) for a specific domain: generate ReAct-style trajectories with GPT-4 → filter using environment feedback → fine-tune with LoRA → then apply DPO for preference alignment.
- If fine-tuning costs are a concern, try Parameter-free methods first: add a Reflexion-style pattern to your prompt (self-reflection on failure before retrying), or store past successes and failures in memory like ExpeL so the agent can reference them in future runs.
- When building multi-agent systems, choose based on task characteristics: LangGraph (workflow-centric), AutoGen (conversational collaboration), or CrewAI (role-based teams) — for complex software engineering tasks, refer to MetaGPT/ChatDev patterns.
Code Example
# Reflexion-style self-reflection prompt pattern (Parameter-free optimization)
system_prompt = """
You are an autonomous agent. Follow this process:
1. THINK: Analyze the task and plan your approach
2. ACT: Execute an action using available tools
3. OBSERVE: Check the result
4. REFLECT: If failed, analyze why and adjust strategy
5. Repeat until task is complete
If you encounter an error:
- Identify what went wrong
- Generate a corrected plan
- Do NOT repeat the same mistake
"""
# Example preference data structure for DPO
preference_data = {
"instruction": "Search the web for today's weather and tell me.",
"chosen": [
{"role": "assistant", "content": "[SEARCH] today's weather\n[RESULT] Seoul 18°C, clear\nFinal answer: The current temperature in Seoul is 18°C and it is clear."}
],
"rejected": [
{"role": "assistant", "content": "I cannot access weather information. Please check it yourself."}
]
}
# LoRA-based agent fine-tuning configuration (HuggingFace PEFT)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # Only ~1% of total parameters are trainedTerminology
Related Resources
Original Abstract (Expand)
With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks. However, current work typically relies on prompt design or fine-tuning strategies applied to vanilla LLMs, which often leads to limited effectiveness in complex agent-related environments. Although numerous recent studies have explored various strategies to optimize LLM-based agents for complex agent tasks, a systematic review summarizing and comparing these methods from a holistic perspective remains lacking. In this survey, we provide a comprehensive review of LLM-based agent optimization approaches, categorizing them into parameter-driven and parameter-free methods. We first focus on parameter-driven optimization, covering fine-tuning-based optimization, reinforcement learning-based optimization, and hybrid strategies, analyzing key aspects such as trajectory data construction, reward function design, and optimization algorithms. Additionally, we briefly discuss parameter-free strategies that optimize agent behavior through prompt engineering and external knowledge retrieval. Finally, we summarize the evaluation for agents, review key applications of LLM-based agents, and discuss the major challenges and promising future directions. A curated collection of the surveyed works is provided at https://github.com/YoungDubbyDu/LLM-Agent-Optimization.