Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
TL;DR Highlight
In Function-Calling agents, using only 32 tokens of CoT yields peak performance — using 256 tokens actually performs worse than no reasoning at all.
Who Should Read
Backend/AI developers building LLM-based function-calling agents or optimizing prompts — especially those configuring inference token budgets or CoT strategies.
Core Mechanics
- For Qwen2.5-1.5B-Instruct, answering directly without CoT yields 44.0% accuracy, but adding just 32 tokens of CoT raises it to 64.0% (+45%). In contrast, 256-token CoT drops to 25.0%, falling below the no-CoT baseline.
- Fine-grained search reveals the true optimal range is 8–16 tokens: d=16 achieves 69.0% and d=8 achieves 68.0%, both higher than d=32 (63–64%).
- Breaking down error types into 3 categories: without CoT, 'wrong function selection' dominates at 30.5%. 32-token CoT reduces this to 1.5% — meaning CoT serves a 'function routing (deciding which function to use)' role.
- With long CoT (256 tokens), wrong function selection spikes back to 28.0%, and hallucination (generating non-existent function names) rises to 18.0%. The longer the model thinks, the more it misleads itself.
- FR-CoT (Function-Routing CoT) enforces the format 'Function: [name] / Key args: [...]' at the start of reasoning, anchoring the function name as the first token. Without any fine-tuning — just a prompt template change — it eliminates hallucination completely to 0.0%.
- Qwen2.5-7B shows the same non-monotonic pattern (d=32: 82.5%, d=256: 18.0%). Phi-3-mini-4k-instruct shows monotonic degradation but never drops below baseline — because Phi-3 self-terminates early with EOS 68% of the time under long budgets, naturally keeping outputs short.
Evidence
- "Qwen2.5-1.5B: d=32 → 64.0%, d=256 → 25.0% (McNemar p<0.001); d=256 is -39pp vs. d=32. Qwen2.5-7B: d=32 → 82.5%, d=256 → 18.0% (McNemar p<0.001); collapse is even more severe in the 7B model. FR-CoT vs. constrained decoding (forcing function name via log-probabilities): on 7B, FR-CoT 83.0% vs. constrained d=32 63.5%, a +19.5pp difference (non-overlapping 95% CI). 88.6% of all solvable tasks can be optimally resolved with d*≤32 tokens; the oracle average reasoning token count is just 27.6."
How to Apply
- "If your function-calling agent prompt currently allocates a generous CoT budget of 512 tokens, cut it to 8–32 tokens immediately. Explicitly state a hard cap such as 'Think step by step (use at most 16 tokens)'. For production environments where hallucination is critical, use the FR-CoT template: end your prompt with 'Function: ' so the model is forced to generate a valid function name as its first token — achieves 0% hallucination with a single prompt line change, no fine-tuning required. If you are using constrained decoding (forcing function name selection via logits), FR-CoT outperforms it by +19.5pp on 7B+ models, delivering better results through prompt alone without logit access or additional forward passes."
Code Example
# FR-CoT prompt template example
# End the prompt with 'Function: ' to force the model to generate the function name as the first token
FR_COT_PROMPT_TEMPLATE = """
You are a function-calling assistant. Given the user query and available functions, call the correct function.
Available functions:
{function_schemas}
User query: {user_query}
Step 1 -- Identify:
Function: """
# Model starts generating here → the first token becomes the function name
# Reasoning continues:
# Key args: [arg=value, ...]
# [Based on the above, the JSON function call is:]
# JSON: ...
# Brief CoT example (d=16 or d=32)
BRIEF_COT_PROMPT_TEMPLATE = """
You are a function-calling assistant.
Available functions:
{function_schemas}
User query: {user_query}
Think step by step (use at most 16 tokens).
Reasoning: """
# Hard cap with max_new_tokens=16
# Append the following to generate the final answer:
# "\n\nBased on the above reasoning, the JSON function call is:\nJSON:"
Terminology
Related Resources
Original Abstract (Expand)
How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.