Large Language Models Are Human-Level Prompt Engineers

Nov 3, 2022•Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han +4•View PDF

TL;DR Highlight

The APE algorithm generates and selects optimal prompts from input-output examples, achieving human-level or better performance on all 24 tasks.

Who Should Read

ML engineers or LLM app developers wasting time writing prompts manually. Directly applicable when you want to optimize system prompts for specific tasks.

Core Mechanics

Proposes APE (Automatic Prompt Engineer) — given just a few input-output examples, an LLM generates dozens to hundreds of prompt candidates and picks the best-scoring one
On InstructGPT (text-davinci-002), achieves equal or better performance than human-crafted prompts on all 24/24 Instruction Induction tasks (IQM 0.810 vs human 0.749)
Zero-Shot Chain-of-Thought prompts can also be auto-optimized — APE found 'Let's work this out in a step by step way to be sure we have the right answer.' improves over 'Let's think step by step.' (MultiArith 78.7→82.0, GSM8K 40.7→43.0)
Prepending generated prompts to few-shot in-context learning improves performance on 21 of 24 tasks — prompts can be up to 5x more token-efficient
APE-found prompts on TruthfulQA raise simultaneous truthfulness+informativeness rate from human prompts (30%) to over 40%
Prompts are optimized for the generation model — using InstructGPT-generated prompts with GPT-3 drops performance significantly. Generation model = execution model for best results

Evidence

24/24 Instruction Induction tasks: APE (InstructGPT) IQM 0.810 exceeds human 0.749
21 of 21 BIG-Bench tasks: equal or better zero-shot performance vs human-written prompts
MultiArith: existing CoT 78.7 → APE CoT 82.0 / GSM8K: 40.7 → 43.0
Sampling 64 prompt candidates achieves human-level performance; performance increases monotonically with more samples

How to Apply

To optimize a system prompt for a specific task: prepare 5-10 input-output examples, have GPT-4 or InstructGPT generate 50 prompt candidates using 'The instruction was <COMPLETE>' template → run each candidate on a validation set → rank by accuracy → adopt the top prompt
To improve Chain-of-Thought prompts: filter problems where 'Let's think step by step.' gives correct answers to build a CoT dataset, then use APE to explore various 'Let's'-prefixed variations and find the best-performing prefix
If prompt candidates are insufficient, apply Iterative Monte Carlo Search: take high-scoring prompts and loop 3-5 times using 'Generate a variation of the following instruction while keeping the semantic meaning.' template to mutate and re-sample

Code Example

snippet

# APE core flow — Python pseudocode
import openai

def generate_prompt_candidates(demos, n=50):
    """
    demos: input-output examples in [(input, output), ...] format
    """
    demo_str = "\n".join([f"Input: {i}\nOutput: {o}" for i, o in demos])
    meta_prompt = f"""I gave a friend an instruction and five inputs.
 The friend read the instruction and wrote an output for every one of the inputs.
 Here are the input-output pairs:
{demo_str}
The instruction was"""
    
    candidates = []
    for _ in range(n):
        resp = openai.Completion.create(
            model="text-davinci-002",
            prompt=meta_prompt,
            max_tokens=50,
            temperature=0.9
        )
        candidates.append(resp.choices[0].text.strip())
    return candidates

def score_candidate(instruction, val_demos, model="text-davinci-002"):
    """Evaluate prompt quality by execution accuracy"""
    correct = 0
    for q, a in val_demos:
        prompt = f"Instruction: {instruction}\nInput: {q}\nOutput:"
        resp = openai.Completion.create(model=model, prompt=prompt, max_tokens=20, temperature=0)
        pred = resp.choices[0].text.strip()
        if pred == a:
            correct += 1
    return correct / len(val_demos)

# Main APE loop
train_demos = [("cat", "c"), ("dog", "d"), ("apple", "a")]  # examples
val_demos = [("banana", "b"), ("orange", "o")]

candidates = generate_prompt_candidates(train_demos, n=50)
scores = [(c, score_candidate(c, val_demos)) for c in candidates]
best_prompt = max(scores, key=lambda x: x[1])
print(f"Best prompt: {best_prompt[0]} (score: {best_prompt[1]:.2f})")

# Template for Zero-shot CoT optimization
cot_meta_prompt = """
Instruction: Answer the following question.
Q: {question}
A: Let's <INSERT>. {reasoning}
"""

Terminology

APEAutomatic Prompt Engineer. Instead of humans manually crafting prompts, an LLM generates candidate prompts and automatically selects the best one.

Instruction InductionAutomatically inferring a natural language instruction from just a few input-output examples. E.g., (cat→c, dog→d) → infer the task is 'give first letter.'

IQMInterquartile Mean. Average of the middle 50%, ignoring extreme outliers — a robust metric for comparing across diverse tasks.

Zero-Shot CoTAdding a trigger phrase like 'Let's think step by step.' to make the model reason without any examples. APE can automatically find better trigger phrases.

Monte Carlo SearchA stochastic search method that explores the space by randomly sampling many candidates and tracking the best.

Related Resources

APE GitHub Code Repository

Original Abstract (Expand)

By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the"program,"optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer.