Large Language Models Are Human-Level Prompt Engineers
TL;DR Highlight
The APE algorithm generates and selects optimal prompts from input-output examples, achieving human-level or better performance on all 24 tasks.
Who Should Read
ML engineers or LLM app developers wasting time writing prompts manually. Directly applicable when you want to optimize system prompts for specific tasks.
Core Mechanics
- Proposes APE (Automatic Prompt Engineer) — given just a few input-output examples, an LLM generates dozens to hundreds of prompt candidates and picks the best-scoring one
- On InstructGPT (text-davinci-002), achieves equal or better performance than human-crafted prompts on all 24/24 Instruction Induction tasks (IQM 0.810 vs human 0.749)
- Zero-Shot Chain-of-Thought prompts can also be auto-optimized — APE found 'Let's work this out in a step by step way to be sure we have the right answer.' improves over 'Let's think step by step.' (MultiArith 78.7→82.0, GSM8K 40.7→43.0)
- Prepending generated prompts to few-shot in-context learning improves performance on 21 of 24 tasks — prompts can be up to 5x more token-efficient
- APE-found prompts on TruthfulQA raise simultaneous truthfulness+informativeness rate from human prompts (30%) to over 40%
- Prompts are optimized for the generation model — using InstructGPT-generated prompts with GPT-3 drops performance significantly. Generation model = execution model for best results
Evidence
- 24/24 Instruction Induction tasks: APE (InstructGPT) IQM 0.810 exceeds human 0.749
- 21 of 21 BIG-Bench tasks: equal or better zero-shot performance vs human-written prompts
- MultiArith: existing CoT 78.7 → APE CoT 82.0 / GSM8K: 40.7 → 43.0
- Sampling 64 prompt candidates achieves human-level performance; performance increases monotonically with more samples
How to Apply
- To optimize a system prompt for a specific task: prepare 5-10 input-output examples, have GPT-4 or InstructGPT generate 50 prompt candidates using 'The instruction was <COMPLETE>' template → run each candidate on a validation set → rank by accuracy → adopt the top prompt
- To improve Chain-of-Thought prompts: filter problems where 'Let's think step by step.' gives correct answers to build a CoT dataset, then use APE to explore various 'Let's'-prefixed variations and find the best-performing prefix
- If prompt candidates are insufficient, apply Iterative Monte Carlo Search: take high-scoring prompts and loop 3-5 times using 'Generate a variation of the following instruction while keeping the semantic meaning.' template to mutate and re-sample
Code Example
# APE core flow — Python pseudocode
import openai
def generate_prompt_candidates(demos, n=50):
"""
demos: input-output examples in [(input, output), ...] format
"""
demo_str = "\n".join([f"Input: {i}\nOutput: {o}" for i, o in demos])
meta_prompt = f"""I gave a friend an instruction and five inputs.
The friend read the instruction and wrote an output for every one of the inputs.
Here are the input-output pairs:
{demo_str}
The instruction was"""
candidates = []
for _ in range(n):
resp = openai.Completion.create(
model="text-davinci-002",
prompt=meta_prompt,
max_tokens=50,
temperature=0.9
)
candidates.append(resp.choices[0].text.strip())
return candidates
def score_candidate(instruction, val_demos, model="text-davinci-002"):
"""Evaluate prompt quality by execution accuracy"""
correct = 0
for q, a in val_demos:
prompt = f"Instruction: {instruction}\nInput: {q}\nOutput:"
resp = openai.Completion.create(model=model, prompt=prompt, max_tokens=20, temperature=0)
pred = resp.choices[0].text.strip()
if pred == a:
correct += 1
return correct / len(val_demos)
# Main APE loop
train_demos = [("cat", "c"), ("dog", "d"), ("apple", "a")] # examples
val_demos = [("banana", "b"), ("orange", "o")]
candidates = generate_prompt_candidates(train_demos, n=50)
scores = [(c, score_candidate(c, val_demos)) for c in candidates]
best_prompt = max(scores, key=lambda x: x[1])
print(f"Best prompt: {best_prompt[0]} (score: {best_prompt[1]:.2f})")
# Template for Zero-shot CoT optimization
cot_meta_prompt = """
Instruction: Answer the following question.
Q: {question}
A: Let's <INSERT>. {reasoning}
"""Terminology
Related Resources
Original Abstract (Expand)
By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the"program,"optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer.