Optimizing Prompts for Large Language Models: A Causal Approach
TL;DR Highlight
An automatic prompt optimization framework that causally separates prompt effects from query difficulty, performing especially well on harder queries.
Who Should Read
Backend/ML engineers struggling with inconsistent prompt quality across queries in LLM-based services. Particularly useful for teams running tasks with high difficulty variance, such as math problem solving, code generation, and data analysis.
Core Mechanics
- The core problem with existing APO (Automatic Prompt Optimization): reward models can't distinguish whether high performance is due to a good prompt or an inherently easy query — this creates spurious correlations
- CPO uses DML (Double Machine Learning, a causal inference ML technique) to remove the influence of query characteristics and build a reward model that estimates only the causal effect of the prompt itself
- Prompts and queries are embedded with nomic-embed-text-v1.5 → dimensionality reduced with PCA → causal estimation performed in a continuous vector space
- Two-stage structure: Stage 1 (offline causal reward learning) + Stage 2 (tree search-based prompt optimization using the reward model)
- Particularly strong on difficult queries (MATH Level 5: 82%, DABench Hard: 50%) compared to existing methods — which see sharp performance drops on hard queries
- Causal model performance improves steadily as more data accumulates, and per-query prompt optimization requires only 7 LLM calls, keeping real-world operational costs low
Evidence
- MATH overall accuracy: CPO 90.00% vs. top competitor PromptBreeder 88.67% — at Level 5 (hardest difficulty): 82% vs. 79~80%
- VisEval overall accuracy: CPO 54.75% vs. APE/OPRO 53.25% — at Extra Hard difficulty: 34% vs. 26~36%
- DABench Hard difficulty: CPO 50% vs. best competitor 42% (PromptAgent) — other methods range from 25~39%
- Causal reward model Kendall's tau-b (prompt ranking predictability): MATH non-causal ML 0.0441 → CPO 0.0608 (+38%), VisEval 0.0980 → 0.1283 (+31%)
How to Apply
- If your existing service has tens of thousands of (query, prompt, score) logs accumulated, you can train a causal reward model with DML for reuse — then search for the optimal prompt per new query with just 7 LLM calls
- For tasks with high difficulty variance like math solving, code generation, or data analysis, if you're wondering 'why does prompt performance vary so much on hard cases,' try removing the difficulty effect via query embedding + PCA + DML
- Teams using static prompts (same prompt for all queries) should consider referencing CPO's Stage 2 tree search (B=5 variant generation, K=3 selection, R=3 rounds) to transition to a dynamic per-query prompt selection pipeline
Code Example
# CPO Stage 2 tree search core logic sketch (Python pseudocode)
# Models: Qwen2.5-14B (task/prompt shared), nomic-embed-text-v1.5 (embedding)
from econml.dml import CausalForestDML
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
# Stage 1: Train causal reward model
embedder = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')
# Embed queries/prompts then reduce dimensions with PCA
X_emb = embedder.encode(queries) # query embeddings
T_emb = embedder.encode(prompts) # prompt embeddings
X_pca = PCA(n_components=40).fit_transform(X_emb) # queries: 40 dimensions
T_pca = PCA(n_components=15).fit_transform(T_emb) # prompts: 15 dimensions
# Estimate causal effect with DML (removes confounding from query characteristics)
model = CausalForestDML(
model_y=GradientBoostingClassifier(n_estimators=100, max_depth=3),
model_t=MultiOutputRegressor(GradientBoostingRegressor(n_estimators=100, max_depth=3)),
cv=5 # cross-fitting
)
model.fit(Y=scores, T=T_pca, X=X_pca)
# Stage 2: Search for optimal prompt for a new query
def optimize_prompt_for_query(query, seed_prompt, causal_model, B=5, K=3, R=3):
surviving = [seed_prompt]
for round in range(R):
candidates = []
for prompt in surviving:
# Generate B variants with LLM
new_prompts = llm_generate_variants(prompt, n=B)
candidates.extend(new_prompts)
# Score with causal reward (no LLM calls needed!)
query_emb = PCA_query.transform(embedder.encode([query]))
cand_embs = PCA_prompt.transform(embedder.encode(candidates))
causal_scores = causal_model.effect(query_emb, T0=baseline_emb, T1=cand_embs)
# Select top K
surviving = [candidates[i] for i in argsort(causal_scores)[-K:]]
return surviving[argmax(causal_scores[-K:])] # final optimal promptTerminology
Original Abstract (Expand)
Large Language Models (LLMs) are increasingly embedded in enterprise workflows, yet their performance remains highly sensitive to prompt design. Automatic Prompt Optimization (APO) seeks to mitigate this instability, but existing approaches face two persistent challenges. First, commonly used prompt strategies rely on static instructions that perform well on average but fail to adapt to heterogeneous queries. Second, more dynamic approaches depend on offline reward models that are fundamentally correlational, confounding prompt effectiveness with query characteristics. We propose Causal Prompt Optimization (CPO), a framework that reframes prompt design as a problem of causal estimation. CPO operates in two stages. First, it learns an offline causal reward model by applying Double Machine Learning (DML) to semantic embeddings of prompts and queries, isolating the causal effect of prompt variations from confounding query attributes. Second, it utilizes this unbiased reward signal to guide a resource-efficient search for query-specific prompts without relying on costly online evaluation. We evaluate CPO across benchmarks in mathematical reasoning, visualization, and data analytics. CPO consistently outperforms human-engineered prompts and state-of-the-art automated optimizers. The gains are driven primarily by improved robustness on hard queries, where existing methods tend to deteriorate. Beyond performance, CPO fundamentally reshapes the economics of prompt optimization: by shifting evaluation from real-time model execution to an offline causal model, it enables high-precision, per-query customization at a fraction of the inference cost required by online methods. Together, these results establish causal inference as a scalable foundation for reliable and cost-efficient prompt optimization in enterprise LLM deployments.