Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

May 21, 2022•Denny Zhou, Nathanael Scharli, Le Hou +7•View PDF

TL;DR Highlight

A prompting technique that breaks complex problems into simpler sub-problems solved sequentially — crushes Chain-of-Thought on hard problems.

Who Should Read

Backend/AI developers using LLMs to handle complex multi-step reasoning tasks. Especially developers experiencing accuracy drops with Chain-of-Thought on long inputs or complex problems.

Core Mechanics

Two-stage processing: first ask 'what sub-problems can this be broken into?' then solve sub-problems in order, using each previous answer for the next
Chain-of-Thought (CoT) degrades on problems harder than prompt examples (e.g., longer inputs) — Least-to-Most solves this easy-to-hard generalization problem
On SCAN benchmark (natural language command → action sequence), GPT-3 code-davinci-002 + Least-to-Most hits 99.7% with just 14 examples. CoT gets only 16.2% under same conditions
Specialized neuro-symbolic SCAN models needed 15,000+ training examples — this approach gets equal or better performance with a few examples and no fine-tuning
On GSM8K problems requiring 5+ steps, CoT gets 39.07% vs Least-to-Most 45.23% — the gap grows on harder problems
Limitation: requires designing new decomposition prompts per domain. Math decomposition prompts don't transfer to commonsense reasoning

Evidence

SCAN length split accuracy: Standard 16.7%, CoT 16.2%, Least-to-Most 99.7% (code-davinci-002)
Last-letter-concatenation 12-word list: CoT 31.8% vs Least-to-Most 74.0% (2-shot). Even 8-shot CoT (38.4%) loses to 2-shot Least-to-Most (74.0%)
DROP benchmark (paragraph reading + numerical reasoning): CoT 74.77% vs Least-to-Most 82.45% on non-football subset
GSM8K 5+ step problems: CoT 39.07% → Least-to-Most 45.23%, ~6%p improvement

How to Apply

For multi-step reasoning tasks (e.g., complex query analysis, multi-hop QA), split your prompt in 2: ① 'What sub-questions can this be broken into?' ② Solve each sub-question in order, appending previous answers as context for the next prompt
When designing prompt examples, use dependent examples in 'smaller case → larger case that includes it' order — the model learns the recursive pattern. E.g., [solve 'A, B'] → [use 'A, B' result to solve 'A, B, C']
For outputs requiring long sequences, use intermediate representations like Python expressions to improve token efficiency, then expand with a post-processing script for the final output

Code Example

snippet

# Least-to-Most Prompting implementation example (Python + OpenAI API)
import openai

client = openai.OpenAI()

# Stage 1: Problem decomposition prompt
decomposition_prompt = """Break down a complex problem step by step.

Q: Elsa has 5 apples. Anna has 2 more than Elsa. How many apples do they have together?
A: Sub-questions to solve this problem:
1. How many apples does Anna have?
2. How many apples do they have together?

Q: {user_question}
A:"""

# Stage 2: Sequential sub-problem solving prompt
def solve_subproblems(subproblems, context=""):
    results = []
    for subproblem in subproblems:
        solving_prompt = f"""{context}
Sub-question: {subproblem}
Answer:"""
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": solving_prompt}]
        )
        answer = response.choices[0].message.content
        context += f"\nQ: {subproblem}\nA: {answer}"
        results.append({"subproblem": subproblem, "answer": answer})
    return results, context

# Execution
question = "Tom earns $3 a day, and Jerry earns twice as much as Tom. How much will they have combined after one week?"

# Step 1: Decompose
decomp_response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": decomposition_prompt.format(user_question=question)}]
)
subproblems = ["How much does Jerry earn per day?", "How much does Tom earn in a week?", "How much does Jerry earn in a week?", "How much do they earn combined in a week?"]

# Step 2: Solve sequentially (using previous answers as context)
results, final_context = solve_subproblems(subproblems)
print(final_context)

Terminology

Chain-of-Thought (CoT)A prompting technique where LLMs write out step-by-step reasoning instead of jumping to the final answer. Like showing your work on math problems.

few-shot promptingShowing the model a few examples to learn the pattern from before solving new problems. Like giving 2-3 example solutions before practice problems.

Least-to-Most PromptingBreaking complex problems into easy sub-problems and solving them in order, using previous answers as context. Like building a foundation before adding floors.

SCANCompositional generalization benchmark. Tests whether models can combine simple commands learned from training into novel complex commands.

GeneralizationThe ability to apply learned patterns to new, unseen situations — especially handling inputs harder or different from training examples.

Original Abstract (Expand)

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Our experimental results on tasks related to symbolic manipulation, compositional generalization, and math reasoning reveal that least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts. A notable finding is that when the GPT-3 code-davinci-002 model is used with least-to-most prompting, it can solve the compositional generalization benchmark SCAN in any split (including length split) with an accuracy of at least 99% using just 14 exemplars, compared to only 16% accuracy with chain-of-thought prompting. This is particularly noteworthy because neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples. We have included prompts for all the tasks in the Appendix.