Division-of-Thoughts: 온디바이스 에이전트를 위한 SLM-LLM 하이브리드 협업 추론 프레임워크

Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents

Feb 6, 2025•Chenyang Shao, Xinyuan Hu, Yutang Lin +1•View PDF

TL;DR Highlight

복잡한 쿼리를 작은 서브태스크로 쪼개서 쉬운 건 로컬 SLM, 어려운 건 클라우드 LLM에 분배해 비용 83% 절감하면서 정확도를 유지하는 에이전트 프레임워크

Who Should Read

모바일/엣지 환경에서 LLM 에이전트를 운영하면서 API 비용과 레이턴시를 줄이고 싶은 개발자. 특히 온디바이스 AI 어시스턴트나 스마트폰 에이전트를 개발 중인 팀.

Core Mechanics

쿼리를 서브태스크로 분해하면 복잡한 요청 안에도 로컬 SLM이 처리 가능한 쉬운 작업이 많다는 인사이트 — 이걸 활용해 클라우드 호출을 줄임
서브태스크 간 의존성을 DAG(방향 비순환 그래프)로 모델링해서 독립적인 태스크는 병렬 실행 — 순차 실행 대비 시간 66% 단축
Llama 3-8B에 MLP 어댑터만 붙여서 태스크 난이도를 분류 — 원본 모델 파라미터는 건드리지 않고 탈착 가능
어댑터 학습 데이터를 사람 레이블링 없이 자동 생성: α-Tree 알고리즘으로 토큰 확률의 불확실성을 난이도 지표로 활용
GPT-4o vs Llama 3-8B 7개 벤치마크 실험에서 API 비용 평균 83.57%, 처리 시간 66.12% 절감, 정확도는 최고 베이스라인 대비 평균 -2% 수준 유지
어댑터의 범용성 확인: 수학 카테고리 내 다른 벤치마크로 전이 시 거의 동일한 성능, 카테고리 간 전이도 노학습 대비 명확히 우수

Evidence

7개 벤치마크 평균 API 비용 83.57% 감소, 처리 시간 66.12% 감소 (최고 정확도 베이스라인 대비)
DROP 벤치마크에서 DoT 정확도 85% vs CoT(GPT-4o) 80% — 비용은 0.32¢ vs 1.30¢로 오히려 더 싸면서 더 정확
α-Tree(n=1) SLM 활용 비율 85.53%, 성공률 99.44% — Zero-shot LLM 평가 방식(53.11%, 92.78%) 대비 모두 우세
태스크 분해 독립성: DoT 방식 91.3% vs vanilla 프롬프팅 74.2% (MATH 기준 수동 레이블 50개 검증)

How to Apply

LLM 에이전트 파이프라인에서 사용자 쿼리를 먼저 서브태스크로 분해하고, 각 서브태스크에 대해 로컬 모델의 토큰 확률 분포를 보고 불확실성이 높은 것만 클라우드 API로 라우팅하는 구조로 바꿔볼 수 있음
서브태스크 의존성을 파악해 병렬 처리 가능한 단계를 async로 동시에 실행하면 — 특히 서브태스크가 8개 이상인 복잡한 워크플로우에서 30% 이상 시간 단축 효과
온디바이스 SLM(Llama 3-8B 수준)에 작은 MLP 헤드만 추가해 난이도 분류기를 학습시키되, 학습 데이터는 실제 추론 결과 피드백으로 자동 생성 — 별도 레이블링 비용 없음

Code Example

snippet

# Task Decomposition 프롬프트 패턴 (논문 Appendix C.1 기반)
decompose_prompt = """
I will now give you a [problem type] question.
Please break this problem down into several easy-to-solve steps.

Examples:
[8 hand-crafted few-shot examples]

Now the question is: {user_query}
Please decompose it into easy-to-solve steps.

Answer Format:
To solve the question "{user_query}", we need to know:
"1. {step1}", "2. {step2}", "3. {step3}"...
"""

# Dependency Construction 프롬프트 패턴 (Appendix C.2 기반)
dependency_prompt = """
Given the following subtasks for: [{original_question}]

Subtasks:
{subtask_list}

Please list the dependencies in the format:
'Subproblem A [xxx] -> Subproblem B [xxx]'
indicating that Subproblem A must be completed before Subproblem B.

Answer format (strictly follow, no explanation):
Step_i [sub-problem i] -> Step_j [sub-problem j]
"""

# 난이도 판단을 위한 sentence embedding 프롬프트 (Section 4.4 기반)
# SLM이 토큰 하나로 서브태스크 의미를 압축하게 유도
embedding_prompt = 'This sentence: "{subtask_text}" means in one word: '
# → 마지막 생성 토큰의 hidden state를 난이도 분류 MLP에 입력

Terminology

SLMSmall Language Model의 약자. Llama 3-8B처럼 스마트폰이나 PC에서 돌릴 수 있는 작은 언어 모델. 클라우드 GPT-4o보다 성능은 낮지만 비용이 0이고 응답이 빠름.

DAGDirected Acyclic Graph(방향 비순환 그래프). 작업 간 선후관계를 화살표로 표현한 그래프인데, 순환(루프)이 없는 것. 예: '1단계가 끝나야 2단계 시작 가능' 같은 의존관계를 표현할 때 씀.

CoTChain-of-Thought의 약자. '단계별로 생각해봐' 식으로 LLM이 중간 과정을 거쳐 답하게 하는 프롬프트 기법. 수학 문제나 논리 추론에서 정확도가 크게 높아짐.

ToTTree-of-Thoughts의 약자. CoT를 확장해 각 단계에서 여러 후보를 생성하고 좋은 것을 선택하는 방식. 더 정확하지만 비용이 CoT의 수배.

α-quantile토큰 생성 확률들의 특정 백분위수값. 모델이 자신 없을수록 각 토큰의 확률이 낮아지는 원리를 이용해, 이 값이 낮으면 '어려운 태스크'로 판단함.

In-Context Learning모델 파라미터를 바꾸지 않고, 프롬프트에 예시 몇 개를 넣어주는 것만으로 새 패턴을 학습하는 방식. 마치 시험 직전에 예제 풀이 몇 개 보고 유형을 파악하는 것과 비슷.

Plug-and-Play Adapter기존 모델에 착탈식으로 붙이는 작은 추가 모듈. 원본 모델 가중치는 건드리지 않아서 기존 능력이 보존되고, 어댑터만 교체하면 다른 용도로 활용 가능.

Related Resources

Original Abstract (Expand)

The rapid expansion of web content has made on-device AI assistants indispensable for helping users manage the increasing complexity of online tasks. The emergent reasoning ability in large language models offer a promising path for next-generation on-device AI agents. However, deploying full-scale Large Language Models (LLMs) on resource-limited local devices is challenging. In this paper, we propose Division-o f-Thoughts (DoT), a collaborative reasoning framework leveraging the synergy between locally deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT leverages a Task Decomposer to elicit the inherent planning abilities in language models to decompose user queries into smaller sub-tasks, which allows hybrid language models to fully exploit their respective strengths. Besides, DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks and create a dependency graph, facilitating parallel reasoning of sub-tasks and the identification of key steps. To allocate the appropriate model based on the difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an additional task head attached to the SLM that does not alter the SLM's parameters. To boost adapter's task allocation capability, we propose a self-reinforced training method that relies solely on task execution feedback. Extensive experiments on various benchmarks demonstrate that our DoT significantly reduces LLM costs while maintaining competitive reasoning accuracy. Specifically, DoT reduces the average reasoning time and API costs by 66.12% and 83.57%, while achieving comparable reasoning accuracy with the best baseline methods.