Optimizing Prompts for Large Language Models: A Causal Approach

Feb 2, 2026•Wei Chen, Yanbin Fang, Shuran Fu +2•View PDF

TL;DR Highlight

An automatic prompt optimization framework that causally separates prompt effects from query difficulty, performing especially well on harder queries.

Who Should Read

Backend/ML engineers struggling with inconsistent prompt quality across queries in LLM-based services. Particularly useful for teams running tasks with high difficulty variance, such as math problem solving, code generation, and data analysis.

Core Mechanics

The core problem with existing APO (Automatic Prompt Optimization): reward models can't distinguish whether high performance is due to a good prompt or an inherently easy query — this creates spurious correlations
CPO uses DML (Double Machine Learning, a causal inference ML technique) to remove the influence of query characteristics and build a reward model that estimates only the causal effect of the prompt itself
Prompts and queries are embedded with nomic-embed-text-v1.5 → dimensionality reduced with PCA → causal estimation performed in a continuous vector space
Two-stage structure: Stage 1 (offline causal reward learning) + Stage 2 (tree search-based prompt optimization using the reward model)
Particularly strong on difficult queries (MATH Level 5: 82%, DABench Hard: 50%) compared to existing methods — which see sharp performance drops on hard queries
Causal model performance improves steadily as more data accumulates, and per-query prompt optimization requires only 7 LLM calls, keeping real-world operational costs low

Evidence

MATH overall accuracy: CPO 90.00% vs. top competitor PromptBreeder 88.67% — at Level 5 (hardest difficulty): 82% vs. 79~80%
VisEval overall accuracy: CPO 54.75% vs. APE/OPRO 53.25% — at Extra Hard difficulty: 34% vs. 26~36%
DABench Hard difficulty: CPO 50% vs. best competitor 42% (PromptAgent) — other methods range from 25~39%
Causal reward model Kendall's tau-b (prompt ranking predictability): MATH non-causal ML 0.0441 → CPO 0.0608 (+38%), VisEval 0.0980 → 0.1283 (+31%)

How to Apply

If your existing service has tens of thousands of (query, prompt, score) logs accumulated, you can train a causal reward model with DML for reuse — then search for the optimal prompt per new query with just 7 LLM calls
For tasks with high difficulty variance like math solving, code generation, or data analysis, if you're wondering 'why does prompt performance vary so much on hard cases,' try removing the difficulty effect via query embedding + PCA + DML
Teams using static prompts (same prompt for all queries) should consider referencing CPO's Stage 2 tree search (B=5 variant generation, K=3 selection, R=3 rounds) to transition to a dynamic per-query prompt selection pipeline

Code Example

snippet

# CPO Stage 2 tree search core logic sketch (Python pseudocode)
# Models: Qwen2.5-14B (task/prompt shared), nomic-embed-text-v1.5 (embedding)

from econml.dml import CausalForestDML
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# Stage 1: Train causal reward model
embedder = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')

# Embed queries/prompts then reduce dimensions with PCA
X_emb = embedder.encode(queries)        # query embeddings
T_emb = embedder.encode(prompts)        # prompt embeddings
X_pca = PCA(n_components=40).fit_transform(X_emb)   # queries: 40 dimensions
T_pca = PCA(n_components=15).fit_transform(T_emb)   # prompts: 15 dimensions

# Estimate causal effect with DML (removes confounding from query characteristics)
model = CausalForestDML(
    model_y=GradientBoostingClassifier(n_estimators=100, max_depth=3),
    model_t=MultiOutputRegressor(GradientBoostingRegressor(n_estimators=100, max_depth=3)),
    cv=5  # cross-fitting
)
model.fit(Y=scores, T=T_pca, X=X_pca)

# Stage 2: Search for optimal prompt for a new query
def optimize_prompt_for_query(query, seed_prompt, causal_model, B=5, K=3, R=3):
    surviving = [seed_prompt]
    for round in range(R):
        candidates = []
        for prompt in surviving:
            # Generate B variants with LLM
            new_prompts = llm_generate_variants(prompt, n=B)
            candidates.extend(new_prompts)
        
        # Score with causal reward (no LLM calls needed!)
        query_emb = PCA_query.transform(embedder.encode([query]))
        cand_embs = PCA_prompt.transform(embedder.encode(candidates))
        causal_scores = causal_model.effect(query_emb, T0=baseline_emb, T1=cand_embs)
        
        # Select top K
        surviving = [candidates[i] for i in argsort(causal_scores)[-K:]]
    
    return surviving[argmax(causal_scores[-K:])]  # final optimal prompt

Terminology

APOAutomatic Prompt Optimization. Instead of humans manually tuning prompts, an algorithm automatically finds better prompts.

DMLDouble Machine Learning. A technique that uses machine learning to remove the influence of confounding variables when estimating causal relationships. Analogous to first removing the effect of a patient's baseline health when measuring a drug's effectiveness.

CATEConditional Average Treatment Effect. A numeric measure of how much performance improves when using prompt A versus prompt B for a given query. Calculated differently for each query.

confounderA confounding variable that intervenes between cause and effect, distorting their relationship. In this paper, 'query difficulty' is the confounder — if a complex prompt is used on a hard query, it's unclear whether low performance is due to the prompt or the difficulty of the problem.

PCAPrincipal Component Analysis. A technique that compresses high-dimensional data to lower dimensions while retaining important information. Used to reduce 768-dimensional embeddings to 15~40 dimensions for causal estimation.

Kendall's tau-bA statistic (-1 to +1) that measures how well two ranked lists agree. In this paper, it evaluates how well the prompt ranking predicted by the causal model matches the ranking based on actual performance.

spurious correlationA false correlation where data appears correlated despite no actual causal relationship. Example: ice cream sales and drowning incidents are correlated (both peak in summer), but ice cream doesn't cause drowning.

Original Abstract (Expand)

Large Language Models (LLMs) are increasingly embedded in enterprise workflows, yet their performance remains highly sensitive to prompt design. Automatic Prompt Optimization (APO) seeks to mitigate this instability, but existing approaches face two persistent challenges. First, commonly used prompt strategies rely on static instructions that perform well on average but fail to adapt to heterogeneous queries. Second, more dynamic approaches depend on offline reward models that are fundamentally correlational, confounding prompt effectiveness with query characteristics. We propose Causal Prompt Optimization (CPO), a framework that reframes prompt design as a problem of causal estimation. CPO operates in two stages. First, it learns an offline causal reward model by applying Double Machine Learning (DML) to semantic embeddings of prompts and queries, isolating the causal effect of prompt variations from confounding query attributes. Second, it utilizes this unbiased reward signal to guide a resource-efficient search for query-specific prompts without relying on costly online evaluation. We evaluate CPO across benchmarks in mathematical reasoning, visualization, and data analytics. CPO consistently outperforms human-engineered prompts and state-of-the-art automated optimizers. The gains are driven primarily by improved robustness on hard queries, where existing methods tend to deteriorate. Beyond performance, CPO fundamentally reshapes the economics of prompt optimization: by shifting evaluation from real-time model execution to an offline causal model, it enables high-precision, per-query customization at a fraction of the inference cost required by online methods. Together, these results establish causal inference as a scalable foundation for reliable and cost-efficient prompt optimization in enterprise LLM deployments.