Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

Jan 15, 2026•Zhihao Xu, Rumei Li, Jiahuan Li +4•View PDF

TL;DR Highlight

A pipeline that automatically generates multi-turn tool-use conversation data for LLM agent training from plain text like wikis and blogs, without requiring API specs

Who Should Read

ML engineers and AI researchers trying to fine-tune LLM-based agents but blocked by lack of high-quality multi-turn tool-use training data. Especially teams building domain-specific agents where data collection costs are prohibitive.

Core Mechanics

Previous methods required pre-defined API sets, but this paper introduces a new paradigm for extracting multi-turn tool-use data directly from plain text like wikis and blogs
About 14% of text segments contain multi-step workflows, making large text corpora viable data sources
GEM pipeline has 4 stages: text filtering → workflow/tool definition extraction → dialogue generation with GLM-4.6 → complexity refinement
Refinement stage is critical: average messages 30→46, tool types 5→8.6, tool calls 7.8→16.3 — dramatically increasing data complexity
Qwen3-32B-GEM achieved 44.88% on BFCL V3, surpassing GPT-4.1 (38.88%) and DeepSeek-V3.2-Exp (37.38%)
Distilled the GEM pipeline itself into a Trajectory Synthesizer trained on Qwen3-8B for low-cost mass data generation

Evidence

Qwen3-32B-GEM BFCL V3 Overall 44.88% — surpassing proprietary models GPT-4.1 (38.88%) and DeepSeek-V3.2 (37.38%) using only out-of-domain data
τ2-bench Retail Pass@4: Qwen3-32B-GEM 86.84% vs in-domain training data MUA 80.70%
Refinement comparison: Qwen3-32B Overall 32.50% (without) → 44.88% (with), +12.38%p improvement
Trajectory Synthesizer (Qwen3-8B based) achieves BFCL 28.38% vs GLM-4.6 full pipeline 30.25% — similar quality at drastically reduced cost

How to Apply

Acquire public text corpora like WikiHow or Ultra-FineWeb, filter for documents containing multi-step procedures (about 14% according to the paper), and use them as agent training data sources
Implement the 4-stage pipeline (filtering → tool extraction → dialogue generation → refinement) using a strong model (GPT-4o, Claude, etc.) to generate custom domain tool-use SFT data — refinement is essential as skipping it drops performance by 12%+ points
First generate ~10K high-quality data points, then SFT-train a small model (8B) as a Trajectory Synthesizer for subsequent low-cost mass production

Code Example

snippet

# GEM Pipeline Stage 1: Prompt to determine whether multi-step workflow is included
prompt_filter = """
Determine whether the following text contains multi-step operations involving
the use of an APP, website, computer, or other machine.
If it contains, generate one sentence summary and identify:
- platform: operator / computer / phone / machine / other
- domain: computers_and_electronics / health / shopping / ...
- task_category: customer_support / developer_tools / databases / ...

Output:
<multi_step>False</multi_step>
or
<multi_step>True</multi_step>
<summary>...</summary>
<domain>...</domain>
<platform>...</platform>
<task>...</task>

Text: {text}
"""

# Stage 2: Workflow & Tool Extraction (OpenAI schema format)
prompt_tool_extract = """
You are a program design expert.
Given a workflow description, design functions to translate it into a program.

1. Extract all intermediate steps
2. Convert every step to a function and represent as execution graph
   e.g., (login)->(search_query)->(update_item)
3. Generate API tool definitions in OpenAI JSON schema format
   - Each tool: single, coherent capability
   - Parameters: self-explanatory names, explicit types
   - Include both read and write tools (get_*, update_*)

Workflow Description: {text}

Output format:
<workflow>
  <steps>Step1: ...\nStep2: ...</steps>
  <execution_graph>(api1)->(api2, api3)->...</execution_graph>
  <tools>[{"name": "api_name", "description": "", "inputSchema": {...}}]</tools>
</workflow>
"""

Terminology

SFTSupervised Fine-Tuning — showing model answer examples and having it learn by imitation. Like studying from solved example problems in school.

TrajectoryThe complete flow of an AI agent's conversation with a user + tool calls + tool responses. A record of 'what the agent did in what order.'

BFCLBerkeley Function Calling Leaderboard. A public benchmark evaluating how accurately LLMs call functions (APIs).

Multi-turnA conversation where the user and AI exchange multiple messages. Not a one-shot interaction but performing tasks across multiple steps while maintaining context.

OpenAI schemaA standard format for defining functions/tools in JSON. Specifies name, parameter types, and descriptions so the LLM understands when and how to call tools.

τ-bench / τ2-benchBenchmarks that simulate user-agent interactions in real domains like airlines and shopping malls to comprehensively evaluate agent capabilities.

DistillationTraining a small model (student) using a large model's (teacher) outputs as training data. Transferring the large model's capabilities to a smaller model at lower cost.

Original Abstract (Expand)

Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow&tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on {\tau} - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.