Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
TL;DR Highlight
A prompting technique that cuts tokens by 84% compared to CoT while maintaining accuracy — just by changing the system prompt
Who Should Read
AI/backend developers looking to reduce LLM API costs or response latency, especially those using CoT prompting with GPT-4o or Claude and feeling the burden of token costs.
Core Mechanics
- Reasoning is divided into 3 paradigms: Conceptual Chaining, Chunked Symbolism, and Expert Lexicons — each specialized for commonsense/math/medical-type problems respectively
- A lightweight 67M DistilBERT-based router automatically selects which paradigm to use by reading the question (96.4% accuracy, inference speed 0.012s)
- Average 74% token reduction across 18 datasets, with an average accuracy loss of only 0.83% — statistically insignificant
- On math problems, accuracy actually improves — on Qwen-2.5-32B, CoT 84.17% → SoT 86.94%, while tokens drop from 222 → 88
- Replacing CoT with SoT in ensemble pipelines like Self-Consistency, Self-Refine, and Multi-Agent Debate maintains or improves performance
- Multilingual experiments in Korean, Italian, and German also show 80–84% token reduction, with accuracy dropping by at most 1.33%
Evidence
- GPT-4o: CoT 84.64% vs SoT 84.55% — 0.09% accuracy difference with 76.2% token reduction
- Claude Sonnet 3.5: CoT 85.01% vs SoT 84.50% — 0.51% accuracy difference with 68.99% token reduction
- In Multi-Agent Debate, SoT improved accuracy by 0.57% over CoT while reducing tokens by 68.9%
- Korean MMMLU: CoT 74.20% vs SoT 73.40% — 0.80% accuracy drop, tokens reduced from 308 → 49 (84.09% reduction)
How to Apply
- Choose a paradigm based on the question type and swap in the corresponding system prompt: Chunked Symbolism for equations/calculations, Conceptual Chaining for commonsense/multi-hop, and Expert Lexicons for medical/legal/engineering domains
- Attach the public router model on HuggingFace (saytes/SoT_DistilBERT) to automate paradigm selection — a single classification call determines which prompt to use
- For existing Self-Consistency or multi-agent pipelines built on CoT, simply replace the system prompt with the SoT version to apply it immediately
Code Example
# Conceptual Chaining paradigm system prompt (for commonsense/multi-hop reasoning)
SYSTEM_PROMPT_CC = """
You are a reasoning expert specializing in structured concept linking.
Extract key terms and present reasoning as stepwise chains using arrows (→).
Do NOT use full sentences. Keep each step minimal.
Format:
<think>
#concept_A → #concept_B → answer
</think>
\\boxed{Final Answer}
"""
# Chunked Symbolism paradigm system prompt (for math/calculation problems)
SYSTEM_PROMPT_CS = """
You are a reasoning expert using Chunked Symbolism.
Define variables, write equations, solve step-by-step with minimal text.
Format:
<think>
var1 = value, var2 = value
result = var1 op var2 = number
</think>
\\boxed{Final Answer}
"""
# Expert Lexicons paradigm system prompt (for medical/engineering/legal domains)
SYSTEM_PROMPT_EL = """
You are a reasoning expert using Expert Lexicons.
Use domain-specific abbreviations, symbols, and jargon to compress reasoning.
No full sentences. Maximize information density.
Format:
<think>
TERM → definition, ACRONYM ∈ {components}, ∴ conclusion
</think>
\\boxed{Final Answer}
"""
# Automatic paradigm selection using the router model (using public HuggingFace model)
from transformers import pipeline
router = pipeline("text-classification", model="saytes/SoT_DistilBERT")
def get_sot_prompt(question: str) -> str:
result = router(question)[0]["label"]
mapping = {
"conceptual_chaining": SYSTEM_PROMPT_CC,
"chunked_symbolism": SYSTEM_PROMPT_CS,
"expert_lexicons": SYSTEM_PROMPT_EL,
}
return mapping.get(result, SYSTEM_PROMPT_CC)
# Usage example
question = "A car accelerates at 2.5 m/s² for 10 seconds from 15 m/s. Final velocity?"
system_prompt = get_sot_prompt(question)
print(f"Selected paradigm prompt: {system_prompt[:50]}...")Terminology
Related Resources
Original Abstract (Expand)
Recent advances in large language models (LLMs) have enabled strong reasoning capabilities through Chain-of-Thought (CoT) prompting, which elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs, leading to increased computational overhead. We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints to reduce token usage while preserving reasoning accuracy. SoT is designed as a flexible, modular approach and is instantiated with three paradigms--Conceptual Chaining, Chunked Symbolism, and Expert Lexicons--each tailored to distinct reasoning tasks and selected dynamically at test-time by a lightweight routing model. Across 18 reasoning datasets spanning multiple domains, languages, and modalities, SoT achieves token reductions of up to 84% with minimal accuracy loss. In tasks such as mathematical and multi-hop reasoning, it even improves accuracy while shortening outputs.