Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Mar 7, 2025•Simon A. Aytes, Jinheon Baek, Sung Ju Hwang•View PDF

TL;DR Highlight

A prompting technique that cuts tokens by 84% compared to CoT while maintaining accuracy — just by changing the system prompt

Who Should Read

AI/backend developers looking to reduce LLM API costs or response latency, especially those using CoT prompting with GPT-4o or Claude and feeling the burden of token costs.

Core Mechanics

Reasoning is divided into 3 paradigms: Conceptual Chaining, Chunked Symbolism, and Expert Lexicons — each specialized for commonsense/math/medical-type problems respectively
A lightweight 67M DistilBERT-based router automatically selects which paradigm to use by reading the question (96.4% accuracy, inference speed 0.012s)
Average 74% token reduction across 18 datasets, with an average accuracy loss of only 0.83% — statistically insignificant
On math problems, accuracy actually improves — on Qwen-2.5-32B, CoT 84.17% → SoT 86.94%, while tokens drop from 222 → 88
Replacing CoT with SoT in ensemble pipelines like Self-Consistency, Self-Refine, and Multi-Agent Debate maintains or improves performance
Multilingual experiments in Korean, Italian, and German also show 80–84% token reduction, with accuracy dropping by at most 1.33%

Evidence

GPT-4o: CoT 84.64% vs SoT 84.55% — 0.09% accuracy difference with 76.2% token reduction
Claude Sonnet 3.5: CoT 85.01% vs SoT 84.50% — 0.51% accuracy difference with 68.99% token reduction
In Multi-Agent Debate, SoT improved accuracy by 0.57% over CoT while reducing tokens by 68.9%
Korean MMMLU: CoT 74.20% vs SoT 73.40% — 0.80% accuracy drop, tokens reduced from 308 → 49 (84.09% reduction)

How to Apply

Choose a paradigm based on the question type and swap in the corresponding system prompt: Chunked Symbolism for equations/calculations, Conceptual Chaining for commonsense/multi-hop, and Expert Lexicons for medical/legal/engineering domains
Attach the public router model on HuggingFace (saytes/SoT_DistilBERT) to automate paradigm selection — a single classification call determines which prompt to use
For existing Self-Consistency or multi-agent pipelines built on CoT, simply replace the system prompt with the SoT version to apply it immediately

Code Example

snippet

# Conceptual Chaining paradigm system prompt (for commonsense/multi-hop reasoning)
SYSTEM_PROMPT_CC = """
You are a reasoning expert specializing in structured concept linking.
Extract key terms and present reasoning as stepwise chains using arrows (→).
Do NOT use full sentences. Keep each step minimal.

Format:
<think>
#concept_A → #concept_B → answer
</think>
\\boxed{Final Answer}
"""

# Chunked Symbolism paradigm system prompt (for math/calculation problems)
SYSTEM_PROMPT_CS = """
You are a reasoning expert using Chunked Symbolism.
Define variables, write equations, solve step-by-step with minimal text.

Format:
<think>
var1 = value, var2 = value
result = var1 op var2 = number
</think>
\\boxed{Final Answer}
"""

# Expert Lexicons paradigm system prompt (for medical/engineering/legal domains)
SYSTEM_PROMPT_EL = """
You are a reasoning expert using Expert Lexicons.
Use domain-specific abbreviations, symbols, and jargon to compress reasoning.
No full sentences. Maximize information density.

Format:
<think>
TERM → definition, ACRONYM ∈ {components}, ∴ conclusion
</think>
\\boxed{Final Answer}
"""

# Automatic paradigm selection using the router model (using public HuggingFace model)
from transformers import pipeline

router = pipeline("text-classification", model="saytes/SoT_DistilBERT")

def get_sot_prompt(question: str) -> str:
    result = router(question)[0]["label"]
    mapping = {
        "conceptual_chaining": SYSTEM_PROMPT_CC,
        "chunked_symbolism": SYSTEM_PROMPT_CS,
        "expert_lexicons": SYSTEM_PROMPT_EL,
    }
    return mapping.get(result, SYSTEM_PROMPT_CC)

# Usage example
question = "A car accelerates at 2.5 m/s² for 10 seconds from 15 m/s. Final velocity?"
system_prompt = get_sot_prompt(question)
print(f"Selected paradigm prompt: {system_prompt[:50]}...")

Terminology

CoTShort for Chain-of-Thought. A technique where adding 'think step by step' to a prompt causes the LLM to write out its reasoning at length. Improves accuracy but uses many tokens.

DistilBERTA compact language model that is 40% lighter than BERT. Fast and cost-effective for simple NLP tasks like classification. Commonly used where lightweight judgment is needed, such as in routers.

패러다임 라우터A classifier that automatically decides which reasoning paradigm to use based on the question. Like an airport baggage conveyor, it routes each piece of luggage (question) to the right lane (paradigm).

Conceptual ChainingA method of expressing reasoning with minimal words by connecting concepts with arrows (→). Compresses the flow of thought into a word chain, like 'Seoul → Korea → Korean Won'.

Chunked SymbolismA method of compressing math/physics problems by converting them into variables and equations. Breaks down solutions into symbols, like 'v = 15, a = 2.5, t = 10, vf = 40'.

Expert LexiconsA method of compressing reasoning using abbreviations and notation from specialized fields like medicine, law, or engineering. Similar to how a doctor concisely tells a colleague 'STEMI, MONA protocol'.

Self-ConsistencyAn ensemble technique that runs the same question through multiple reasoning passes and selects the most frequently occurring answer as the final answer. Works like a majority vote to increase reliability.

Related Resources

Original Abstract (Expand)

Recent advances in large language models (LLMs) have enabled strong reasoning capabilities through Chain-of-Thought (CoT) prompting, which elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs, leading to increased computational overhead. We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints to reduce token usage while preserving reasoning accuracy. SoT is designed as a flexible, modular approach and is instantiated with three paradigms--Conceptual Chaining, Chunked Symbolism, and Expert Lexicons--each tailored to distinct reasoning tasks and selected dynamically at test-time by a lightweight routing model. Across 18 reasoning datasets spanning multiple domains, languages, and modalities, SoT achieves token reductions of up to 84% with minimal accuracy loss. In tasks such as mathematical and multi-hop reasoning, it even improves accuracy while shortening outputs.