Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text
TL;DR Highlight
A pipeline that automatically generates multi-turn tool-use conversation data for LLM agent training from plain text like wikis and blogs, without requiring API specs
Who Should Read
ML engineers and AI researchers trying to fine-tune LLM-based agents but blocked by lack of high-quality multi-turn tool-use training data. Especially teams building domain-specific agents where data collection costs are prohibitive.
Core Mechanics
- Previous methods required pre-defined API sets, but this paper introduces a new paradigm for extracting multi-turn tool-use data directly from plain text like wikis and blogs
- About 14% of text segments contain multi-step workflows, making large text corpora viable data sources
- GEM pipeline has 4 stages: text filtering → workflow/tool definition extraction → dialogue generation with GLM-4.6 → complexity refinement
- Refinement stage is critical: average messages 30→46, tool types 5→8.6, tool calls 7.8→16.3 — dramatically increasing data complexity
- Qwen3-32B-GEM achieved 44.88% on BFCL V3, surpassing GPT-4.1 (38.88%) and DeepSeek-V3.2-Exp (37.38%)
- Distilled the GEM pipeline itself into a Trajectory Synthesizer trained on Qwen3-8B for low-cost mass data generation
Evidence
- Qwen3-32B-GEM BFCL V3 Overall 44.88% — surpassing proprietary models GPT-4.1 (38.88%) and DeepSeek-V3.2 (37.38%) using only out-of-domain data
- τ2-bench Retail Pass@4: Qwen3-32B-GEM 86.84% vs in-domain training data MUA 80.70%
- Refinement comparison: Qwen3-32B Overall 32.50% (without) → 44.88% (with), +12.38%p improvement
- Trajectory Synthesizer (Qwen3-8B based) achieves BFCL 28.38% vs GLM-4.6 full pipeline 30.25% — similar quality at drastically reduced cost
How to Apply
- Acquire public text corpora like WikiHow or Ultra-FineWeb, filter for documents containing multi-step procedures (about 14% according to the paper), and use them as agent training data sources
- Implement the 4-stage pipeline (filtering → tool extraction → dialogue generation → refinement) using a strong model (GPT-4o, Claude, etc.) to generate custom domain tool-use SFT data — refinement is essential as skipping it drops performance by 12%+ points
- First generate ~10K high-quality data points, then SFT-train a small model (8B) as a Trajectory Synthesizer for subsequent low-cost mass production
Code Example
# GEM Pipeline Stage 1: Prompt to determine whether multi-step workflow is included
prompt_filter = """
Determine whether the following text contains multi-step operations involving
the use of an APP, website, computer, or other machine.
If it contains, generate one sentence summary and identify:
- platform: operator / computer / phone / machine / other
- domain: computers_and_electronics / health / shopping / ...
- task_category: customer_support / developer_tools / databases / ...
Output:
<multi_step>False</multi_step>
or
<multi_step>True</multi_step>
<summary>...</summary>
<domain>...</domain>
<platform>...</platform>
<task>...</task>
Text: {text}
"""
# Stage 2: Workflow & Tool Extraction (OpenAI schema format)
prompt_tool_extract = """
You are a program design expert.
Given a workflow description, design functions to translate it into a program.
1. Extract all intermediate steps
2. Convert every step to a function and represent as execution graph
e.g., (login)->(search_query)->(update_item)
3. Generate API tool definitions in OpenAI JSON schema format
- Each tool: single, coherent capability
- Parameters: self-explanatory names, explicit types
- Include both read and write tools (get_*, update_*)
Workflow Description: {text}
Output format:
<workflow>
<steps>Step1: ...\nStep2: ...</steps>
<execution_graph>(api1)->(api2, api3)->...</execution_graph>
<tools>[{"name": "api_name", "description": "", "inputSchema": {...}}]</tools>
</workflow>
"""Terminology
Original Abstract (Expand)
Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow&tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on {\tau} - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.