Chain of Thought Prompting Elicits Reasoning in Large Language Models
TL;DR Highlight
Showing LLMs examples with 'thinking process' included dramatically improves math, commonsense, and symbolic reasoning — proven experimentally.
Who Should Read
Developers building complex reasoning features with GPT-4, Claude, or similar LLMs. AI engineers who want to push model performance up with prompt engineering alone.
Core Mechanics
- Adding few-shot examples that show 'step-by-step reasoning process' (chain-of-thought) to the prompt lets models handle complex reasoning much better
- PaLM 540B with just 8 chain-of-thought examples hits 56.9% on GSM8K (math benchmark), surpassing fine-tuned GPT-3
- This effect is 'large model only' — on models under 100B parameters it either has no effect or hurts performance. It's an emergent ability at scale.
- Works beyond math: commonsense reasoning (StrategyQA 75.6% vs previous best 69.4%), sports knowledge (95.4% vs human expert 84%), symbolic manipulation, and more
- The order matters: reasoning process first, then answer — adding explanations after the answer, just dots, or only formulas doesn't work
- Achieved without fine-tuning — one model handles various reasoning tasks across domains using only prompts
Evidence
- PaLM 540B + chain-of-thought: GSM8K 56.9% (vs standard prompting 17.9%, +39%p; beats fine-tuned GPT-3 at 55%)
- Codex + chain-of-thought: GSM8K 63.1% (+43.4%p), SVAMP 76.4%, ASDiv 80.4%
- StrategyQA: PaLM 540B chain-of-thought 75.6% — surpasses supervised SOTA 69.4%
- Sports understanding: PaLM 540B chain-of-thought 95.4% — beats human expert level of 84%
How to Apply
- For prompts requiring complex calculation or reasoning, don't just write the answer in few-shot examples — format them as 'step-by-step solution → final answer.' With GPT-4-class models you'll see immediate results.
- If arithmetic errors are frequent, add post-processing that re-evaluates the equation parts generated by chain-of-thought using Python eval for extra accuracy (LaMDA 137B baseline: GSM8K 14.3% → 17.8%).
- For zero-shot use, try adding 'Let's think step by step.' and if that's not enough, write 8 or so reasoning examples and switch to few-shot.
Code Example
# Chain-of-Thought prompt example (OpenAI API)
from openai import OpenAI
client = OpenAI()
cot_prompt = """
Q: There were 23 apples. 20 were used at lunch and 6 more were bought. How many apples are there?
A: There were 23 to start. 20 were used at lunch, so 23 - 20 = 3 remained. 6 more were bought, so 3 + 6 = 9. The answer is 9.
Q: Roger has 5 tennis balls. He bought 2 cans of 3 balls each. How many tennis balls does he have now?
A: Roger started with 5. 2 cans × 3 balls = 6 balls were added. 5 + 6 = 11. The answer is 11.
Q: There were 15 cars in a parking lot. 8 left in the morning and 12 arrived in the afternoon. How many cars are there now?
A:"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": cot_prompt}]
)
print(response.choices[0].message.content)
# Expected output: Started with 15. 8 left, so 15 - 8 = 7 remained. 12 arrived, so 7 + 12 = 19. The answer is 19.Terminology
Related Resources
Original Abstract (Expand)
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.