Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

Apr 2, 2026•Xuan Qi•View PDF

TL;DR Highlight

In Function-Calling agents, using only 32 tokens of CoT yields peak performance — using 256 tokens actually performs worse than no reasoning at all.

Who Should Read

Backend/AI developers building LLM-based function-calling agents or optimizing prompts — especially those configuring inference token budgets or CoT strategies.

Core Mechanics

For Qwen2.5-1.5B-Instruct, answering directly without CoT yields 44.0% accuracy, but adding just 32 tokens of CoT raises it to 64.0% (+45%). In contrast, 256-token CoT drops to 25.0%, falling below the no-CoT baseline.
Fine-grained search reveals the true optimal range is 8–16 tokens: d=16 achieves 69.0% and d=8 achieves 68.0%, both higher than d=32 (63–64%).
Breaking down error types into 3 categories: without CoT, 'wrong function selection' dominates at 30.5%. 32-token CoT reduces this to 1.5% — meaning CoT serves a 'function routing (deciding which function to use)' role.
With long CoT (256 tokens), wrong function selection spikes back to 28.0%, and hallucination (generating non-existent function names) rises to 18.0%. The longer the model thinks, the more it misleads itself.
FR-CoT (Function-Routing CoT) enforces the format 'Function: [name] / Key args: [...]' at the start of reasoning, anchoring the function name as the first token. Without any fine-tuning — just a prompt template change — it eliminates hallucination completely to 0.0%.
Qwen2.5-7B shows the same non-monotonic pattern (d=32: 82.5%, d=256: 18.0%). Phi-3-mini-4k-instruct shows monotonic degradation but never drops below baseline — because Phi-3 self-terminates early with EOS 68% of the time under long budgets, naturally keeping outputs short.

Evidence

"Qwen2.5-1.5B: d=32 → 64.0%, d=256 → 25.0% (McNemar p<0.001); d=256 is -39pp vs. d=32. Qwen2.5-7B: d=32 → 82.5%, d=256 → 18.0% (McNemar p<0.001); collapse is even more severe in the 7B model. FR-CoT vs. constrained decoding (forcing function name via log-probabilities): on 7B, FR-CoT 83.0% vs. constrained d=32 63.5%, a +19.5pp difference (non-overlapping 95% CI). 88.6% of all solvable tasks can be optimally resolved with d*≤32 tokens; the oracle average reasoning token count is just 27.6."

How to Apply

"If your function-calling agent prompt currently allocates a generous CoT budget of 512 tokens, cut it to 8–32 tokens immediately. Explicitly state a hard cap such as 'Think step by step (use at most 16 tokens)'. For production environments where hallucination is critical, use the FR-CoT template: end your prompt with 'Function: ' so the model is forced to generate a valid function name as its first token — achieves 0% hallucination with a single prompt line change, no fine-tuning required. If you are using constrained decoding (forcing function name selection via logits), FR-CoT outperforms it by +19.5pp on 7B+ models, delivering better results through prompt alone without logit access or additional forward passes."

Code Example

snippet

# FR-CoT prompt template example
# End the prompt with 'Function: ' to force the model to generate the function name as the first token

FR_COT_PROMPT_TEMPLATE = """
You are a function-calling assistant. Given the user query and available functions, call the correct function.

Available functions:
{function_schemas}

User query: {user_query}

Step 1 -- Identify:
Function: """
# Model starts generating here → the first token becomes the function name
# Reasoning continues:
# Key args: [arg=value, ...]
# [Based on the above, the JSON function call is:]
# JSON: ...

# Brief CoT example (d=16 or d=32)
BRIEF_COT_PROMPT_TEMPLATE = """
You are a function-calling assistant.

Available functions:
{function_schemas}

User query: {user_query}

Think step by step (use at most 16 tokens).
Reasoning: """
# Hard cap with max_new_tokens=16
# Append the following to generate the final answer:
# "\n\nBased on the above reasoning, the JSON function call is:\nJSON:"

Terminology

Chain-of-Thought (CoT)A technique that prompts the model to output its 'reasoning process' before the final answer — analogous to showing work when solving a math problem.

Function CallingA capability where an LLM outputs a function name and arguments in JSON format to call a predefined API or tool. A chatbot directly calling a weather API is a typical example.

Hallucination (환각)The phenomenon where a model fabricates things that do not exist. In this context, it refers to generating function names not present in the candidate function list.

Non-monotonicA pattern where increasing a value does not continuously improve performance — instead, performance degrades beyond a certain point. Contradicts the intuition that more tokens always means better results.

McNemar p<0.001A statistical test confirming that the accuracy difference between two conditions is not due to chance. p<0.001 means the difference is almost certainly real.

EOS (End of Sequence)A signal token that causes the model to stop generating. A high EOS rate means the model naturally stops before exhausting its token budget.

Constrained DecodingA technique that restricts model output to specific values. Here, it refers to forcing the selection of the highest-probability function name from the candidate list.

BFCL (Berkeley Function Calling Leaderboard)A standard benchmark for evaluating LLM function-calling ability, consisting of tasks that require selecting the correct function and arguments from multiple candidates.

Related Resources

Original Abstract (Expand)

How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.