Chain of Thought Prompting Elicits Reasoning in Large Language Models

Jan 28, 2022•Jason Wei, Xuezhi Wang, Dale Schuurmans +5•View PDF

TL;DR Highlight

Showing LLMs examples with 'thinking process' included dramatically improves math, commonsense, and symbolic reasoning — proven experimentally.

Who Should Read

Developers building complex reasoning features with GPT-4, Claude, or similar LLMs. AI engineers who want to push model performance up with prompt engineering alone.

Core Mechanics

Adding few-shot examples that show 'step-by-step reasoning process' (chain-of-thought) to the prompt lets models handle complex reasoning much better
PaLM 540B with just 8 chain-of-thought examples hits 56.9% on GSM8K (math benchmark), surpassing fine-tuned GPT-3
This effect is 'large model only' — on models under 100B parameters it either has no effect or hurts performance. It's an emergent ability at scale.
Works beyond math: commonsense reasoning (StrategyQA 75.6% vs previous best 69.4%), sports knowledge (95.4% vs human expert 84%), symbolic manipulation, and more
The order matters: reasoning process first, then answer — adding explanations after the answer, just dots, or only formulas doesn't work
Achieved without fine-tuning — one model handles various reasoning tasks across domains using only prompts

Evidence

PaLM 540B + chain-of-thought: GSM8K 56.9% (vs standard prompting 17.9%, +39%p; beats fine-tuned GPT-3 at 55%)
Codex + chain-of-thought: GSM8K 63.1% (+43.4%p), SVAMP 76.4%, ASDiv 80.4%
StrategyQA: PaLM 540B chain-of-thought 75.6% — surpasses supervised SOTA 69.4%
Sports understanding: PaLM 540B chain-of-thought 95.4% — beats human expert level of 84%

How to Apply

For prompts requiring complex calculation or reasoning, don't just write the answer in few-shot examples — format them as 'step-by-step solution → final answer.' With GPT-4-class models you'll see immediate results.
If arithmetic errors are frequent, add post-processing that re-evaluates the equation parts generated by chain-of-thought using Python eval for extra accuracy (LaMDA 137B baseline: GSM8K 14.3% → 17.8%).
For zero-shot use, try adding 'Let's think step by step.' and if that's not enough, write 8 or so reasoning examples and switch to few-shot.

Code Example

snippet

# Chain-of-Thought prompt example (OpenAI API)

from openai import OpenAI
client = OpenAI()

cot_prompt = """
Q: There were 23 apples. 20 were used at lunch and 6 more were bought. How many apples are there?
A: There were 23 to start. 20 were used at lunch, so 23 - 20 = 3 remained. 6 more were bought, so 3 + 6 = 9. The answer is 9.

Q: Roger has 5 tennis balls. He bought 2 cans of 3 balls each. How many tennis balls does he have now?
A: Roger started with 5. 2 cans × 3 balls = 6 balls were added. 5 + 6 = 11. The answer is 11.

Q: There were 15 cars in a parking lot. 8 left in the morning and 12 arrived in the afternoon. How many cars are there now?
A:"""

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": cot_prompt}]
)

print(response.choices[0].message.content)
# Expected output: Started with 15. 8 left, so 15 - 8 = 7 remained. 12 arrived, so 7 + 12 = 19. The answer is 19.

Terminology

Chain-of-Thought (CoT)A prompting method where the model writes out a reasoning process step-by-step before giving the final answer. Like showing your work on a math test — writing out steps improves accuracy.

Few-shot PromptingShowing the model 2-8 examples of 'do it like this' before a new problem. Like solving a few practice problems before the real exam.

Emergent AbilityA capability that suddenly appears in LLMs above a certain scale but doesn't exist in smaller models. CoT is only effective on models 100B+ parameters.

GSM8KA math word problem benchmark for elementary/middle school level. 8,500 problems requiring multi-step arithmetic reasoning.

Zero-shotSolving a problem with no examples whatsoever. Zero-shot CoT means just adding 'Let's think step by step.' without providing any examples.

Related Resources

Original Abstract (Expand)

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.