Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
TL;DR Highlight
A prompting technique that breaks complex problems into simpler sub-problems solved sequentially — crushes Chain-of-Thought on hard problems.
Who Should Read
Backend/AI developers using LLMs to handle complex multi-step reasoning tasks. Especially developers experiencing accuracy drops with Chain-of-Thought on long inputs or complex problems.
Core Mechanics
- Two-stage processing: first ask 'what sub-problems can this be broken into?' then solve sub-problems in order, using each previous answer for the next
- Chain-of-Thought (CoT) degrades on problems harder than prompt examples (e.g., longer inputs) — Least-to-Most solves this easy-to-hard generalization problem
- On SCAN benchmark (natural language command → action sequence), GPT-3 code-davinci-002 + Least-to-Most hits 99.7% with just 14 examples. CoT gets only 16.2% under same conditions
- Specialized neuro-symbolic SCAN models needed 15,000+ training examples — this approach gets equal or better performance with a few examples and no fine-tuning
- On GSM8K problems requiring 5+ steps, CoT gets 39.07% vs Least-to-Most 45.23% — the gap grows on harder problems
- Limitation: requires designing new decomposition prompts per domain. Math decomposition prompts don't transfer to commonsense reasoning
Evidence
- SCAN length split accuracy: Standard 16.7%, CoT 16.2%, Least-to-Most 99.7% (code-davinci-002)
- Last-letter-concatenation 12-word list: CoT 31.8% vs Least-to-Most 74.0% (2-shot). Even 8-shot CoT (38.4%) loses to 2-shot Least-to-Most (74.0%)
- DROP benchmark (paragraph reading + numerical reasoning): CoT 74.77% vs Least-to-Most 82.45% on non-football subset
- GSM8K 5+ step problems: CoT 39.07% → Least-to-Most 45.23%, ~6%p improvement
How to Apply
- For multi-step reasoning tasks (e.g., complex query analysis, multi-hop QA), split your prompt in 2: ① 'What sub-questions can this be broken into?' ② Solve each sub-question in order, appending previous answers as context for the next prompt
- When designing prompt examples, use dependent examples in 'smaller case → larger case that includes it' order — the model learns the recursive pattern. E.g., [solve 'A, B'] → [use 'A, B' result to solve 'A, B, C']
- For outputs requiring long sequences, use intermediate representations like Python expressions to improve token efficiency, then expand with a post-processing script for the final output
Code Example
# Least-to-Most Prompting implementation example (Python + OpenAI API)
import openai
client = openai.OpenAI()
# Stage 1: Problem decomposition prompt
decomposition_prompt = """Break down a complex problem step by step.
Q: Elsa has 5 apples. Anna has 2 more than Elsa. How many apples do they have together?
A: Sub-questions to solve this problem:
1. How many apples does Anna have?
2. How many apples do they have together?
Q: {user_question}
A:"""
# Stage 2: Sequential sub-problem solving prompt
def solve_subproblems(subproblems, context=""):
results = []
for subproblem in subproblems:
solving_prompt = f"""{context}
Sub-question: {subproblem}
Answer:"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": solving_prompt}]
)
answer = response.choices[0].message.content
context += f"\nQ: {subproblem}\nA: {answer}"
results.append({"subproblem": subproblem, "answer": answer})
return results, context
# Execution
question = "Tom earns $3 a day, and Jerry earns twice as much as Tom. How much will they have combined after one week?"
# Step 1: Decompose
decomp_response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": decomposition_prompt.format(user_question=question)}]
)
subproblems = ["How much does Jerry earn per day?", "How much does Tom earn in a week?", "How much does Jerry earn in a week?", "How much do they earn combined in a week?"]
# Step 2: Solve sequentially (using previous answers as context)
results, final_context = solve_subproblems(subproblems)
print(final_context)Terminology
Original Abstract (Expand)
Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Our experimental results on tasks related to symbolic manipulation, compositional generalization, and math reasoning reveal that least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts. A notable finding is that when the GPT-3 code-davinci-002 model is used with least-to-most prompting, it can solve the compositional generalization benchmark SCAN in any split (including length split) with an accuracy of at least 99% using just 14 exemplars, compared to only 16% accuracy with chain-of-thought prompting. This is particularly noteworthy because neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples. We have included prompts for all the tasks in the Appendix.