짧을수록 좋다: Function-Calling 에이전트에서 Chain-of-Thought 토큰 예산의 비단조적 효과

Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

Apr 2, 2026•Xuan Qi•View PDF

TL;DR Highlight

Function-Calling 에이전트에서 CoT를 32토큰만 써야 최고 성능 — 256토큰 쓰면 아무 생각 안 하는 것보다 오히려 나빠진다.

Who Should Read

LLM 기반 도구 호출(function calling) 에이전트를 만들거나 프롬프트 최적화를 고민하는 백엔드/AI 개발자. 특히 추론 토큰 예산이나 CoT 전략을 설정 중인 사람.

Core Mechanics

Qwen2.5-1.5B-Instruct 기준, CoT 없이 바로 답하면 44.0% 정확도인데 32토큰 CoT만 추가하면 64.0%로 +45% 상승. 반면 256토큰 CoT는 25.0%로 오히려 베이스라인보다 낮아짐.
세밀한 탐색 결과 진짜 최적 구간은 8~16토큰. d=16일 때 69.0%, d=8일 때 68.0%로 d=32(63~64%)보다도 높음.
오류 원인 3가지로 분해해보면: CoT 없을 때는 '틀린 함수 선택'이 30.5%로 지배적. 32토큰 CoT가 이걸 1.5%로 줄여줌 — 즉 CoT가 'function routing(어떤 함수 쓸지 결정)' 역할을 함.
긴 CoT(256토큰)에서는 틀린 함수 선택이 다시 28.0%로 치솟고, 후보에 없는 함수를 지어내는 hallucination(환각)이 18.0%까지 올라감. 모델이 길게 생각할수록 스스로를 잘못된 방향으로 이끄는 것.
FR-CoT(Function-Routing CoT)는 추론 시작 시 'Function: [이름] / Key args: [...]' 형식을 강제해 함수명을 첫 토큰으로 고정. 파인튜닝 없이 프롬프트 템플릿만으로 hallucination을 0.0%로 완전히 제거.
Qwen2.5-7B도 동일한 비단조 패턴 확인(d=32: 82.5%, d=256: 18.0%). Phi-3-mini-4k-instruct는 단조 감소지만 베이스라인 아래로는 안 떨어짐 — 이는 Phi-3가 긴 예산에서 EOS(생성 종료 토큰)를 68% 확률로 스스로 일찍 내뱉어 자연히 짧게 멈추기 때문.

Evidence

Qwen2.5-1.5B: d=32 → 64.0%, d=256 → 25.0% (McNemar p<0.001). d=32 대비 d=256은 -39pp.
Qwen2.5-7B: d=32 → 82.5%, d=256 → 18.0% (McNemar p<0.001). 7B 모델에서 붕괴가 더 심각.
FR-CoT vs. constrained decoding(로그확률로 함수명 강제 선택): 7B 기준 FR-CoT 83.0% vs. constrained d=32 63.5%, +19.5pp 차이 (95% CI 비겹침).
전체 풀 수 있는 태스크의 88.6%가 d*≤32토큰으로 최적 해결 가능. 오라클 평균 추론 토큰은 27.6개에 불과.

How to Apply

Function-calling 에이전트 프롬프트에서 CoT 예산을 512토큰으로 넉넉하게 주고 있다면 당장 8~32토큰으로 줄여라. 'Think step by step (use at most 16 tokens)' 같이 하드캡을 명시하면 됨.
hallucination이 민감한 프로덕션 환경이라면 FR-CoT 템플릿을 쓰면 됨. 프롬프트 끝을 'Function: '으로 끝내서 모델이 첫 토큰으로 유효한 함수명을 강제로 생성하게 만드는 것 — 파인튜닝 없이 프롬프트 한 줄 변경으로 hallucination 0% 달성 가능.
constrained decoding(logit으로 함수명 강제 선택)을 쓰고 있다면, 7B 이상 모델에서는 FR-CoT가 +19.5pp 더 정확함. logit 접근이나 추가 forward pass 없이 프롬프트만으로 더 좋은 결과를 얻을 수 있음.

Code Example

snippet

# FR-CoT 프롬프트 템플릿 예시
# 프롬프트 끝을 'Function: '으로 마무리해서 모델이 함수명을 첫 토큰으로 생성하게 강제

FR_COT_PROMPT_TEMPLATE = """
You are a function-calling assistant. Given the user query and available functions, call the correct function.

Available functions:
{function_schemas}

User query: {user_query}

Step 1 -- Identify:
Function: """
# 모델이 여기서부터 생성 시작 → 첫 토큰이 함수명이 됨
# 이후 추론 계속:
# Key args: [arg=value, ...]
# [Based on the above, the JSON function call is:]
# JSON: ...

# 일반 Brief CoT 예시 (d=16 or d=32)
BRIEF_COT_PROMPT_TEMPLATE = """
You are a function-calling assistant.

Available functions:
{function_schemas}

User query: {user_query}

Think step by step (use at most 16 tokens).
Reasoning: """
# max_new_tokens=16으로 하드캡
# 이후 아래를 append해서 최종 답변 생성:
# "\n\nBased on the above reasoning, the JSON function call is:\nJSON:"

Terminology

Chain-of-Thought (CoT)모델이 최종 답 전에 '생각 과정'을 먼저 출력하게 하는 기법. 수학 문제 풀 때 풀이 과정 적는 것과 같음.

Function CallingLLM이 미리 정의된 API나 도구를 호출하도록 JSON 형식으로 함수명+인자를 출력하는 기능. 챗봇이 날씨 API를 직접 호출하는 것이 대표적 예.

Hallucination (환각)모델이 없는 것을 지어내는 현상. 여기서는 후보 함수 목록에 없는 함수명을 만들어내는 것.

Non-monotonic값이 증가한다고 성능이 계속 좋아지지 않고 어느 지점부터 오히려 나빠지는 패턴. 토큰을 늘릴수록 좋다는 상식을 깨는 현상.

McNemar p<0.001두 조건의 정확도 차이가 통계적으로 우연이 아님을 확인하는 검정. p<0.001이면 거의 확실하게 진짜 차이.

EOS (End of Sequence)모델이 생성을 스스로 멈추는 신호 토큰. EOS rate가 높다는 건 모델이 예산을 다 쓰기 전에 자연스럽게 멈춘다는 뜻.

Constrained Decoding모델 출력을 특정 값으로 강제 제한하는 기법. 여기서는 함수명을 후보 목록 중 가장 확률 높은 것으로 강제 선택.

BFCL (Berkeley Function Calling Leaderboard)LLM의 함수 호출 능력을 평가하는 표준 벤치마크. 여러 후보 함수 중 올바른 것과 인자를 맞추는 문제들로 구성.

Related Resources

Original Abstract (Expand)

How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.