UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization

Mar 12, 2026•Ofir Marom•View PDF

TL;DR Highlight

Defining prompt objectives as mathematical formulas instead of natural language lets LLMs optimize multiple conditions simultaneously with higher precision.

Who Should Read

Backend/ML engineers designing LLM tasks that need to satisfy multiple conditions simultaneously, like recommendation systems, filtering, or trading agents. People frustrated that no amount of prompt tweaking gets the performance they need.

Core Mechanics

Natural language prompts have fundamental ambiguity — expressions like "moderate risk" or "only comedy and romance movies" are interpreted differently by each model
Defining objectives as formulas like O(a) = E[X1|A=a] × E[X2|A=a] lets the LLM independently estimate each condition then multiply them to select the optimal answer
Consistent performance improvement over natural language-based prompts confirmed on all three models: Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro
"Harsh prompts" with stronger language actually score lower than Basic on both metrics in GPT-5.4 — increasing natural language intensity doesn't resolve ambiguity
Works zero-shot without additional training data, labels, or examples; no separate scoring function needed
When conditions are hierarchically dependent (e.g., genre must pass before rating is meaningful), can be extended with a binary gating structure

Evidence

On Claude Sonnet 4.6: Precision@10 +12.7%, NDCG@10 +16.5% (vs Basic prompt)
Statistically significant performance improvement confirmed on all three models — Wilcoxon signed-rank test p < 0.01
On GPT-5.4, Harsh prompt actually lower than Basic: Precision 0.518 vs 0.532, NDCG 0.712 vs 0.739

How to Apply

For multi-condition tasks, formulate the objective as O(a) = E[X1|A=a] × E[X2|A=a], then instruct the LLM with this template: (1) generate candidates → (2) independently estimate expected value for each variable → (3) select the maximum
For order-dependent conditions (e.g., score only meaningful if it's a certain category), use binary gating — if parent node is 0, child probability also converges to 0
Variable selection is key: too many increases compute cost, too few results in only proxy optimization — extract only the core conditions of your task as variables

Code Example

snippet

# UtilityMax Prompt Template (Movie Recommendation Example)

prompt = """
I want you to solve the following task: Recommend the top 10 movies for this user.

Formally, let K represent your knowledge. This includes all your internal knowledge
stored through your parameters as well as any external knowledge provided in this prompt.

Let P(A | K) represent your probability distribution over answers given K. Let a be an
answer in A.

Let S | A=a be a random variable representing the predicted user rating score (1-5)
for movie a given the user's watch history.

Let G1 | A=a be a binary random variable representing whether movie a belongs to the
comedy genre.

Let G2 | A=a be a binary random variable representing whether movie a belongs to the
romance genre.

Your task is to find the optimal answer a* that maximises:
O(a) = E[S | A=a] x P(G1=1 | A=a) x P(G2=1 | A=a)

To do this you must:
1. Generate a set of candidate movies.
2. For each candidate, estimate E[S | A=a], P(G1=1 | A=a), and P(G2=1 | A=a)
   individually using your internal knowledge, then compute O(a).
3. Return the top 10 movies a* that maximise O.

User's watch history: {watch_history}
"""

Terminology

Influence DiagramA flowchart connecting decisions and their outcomes with arrows. Visualizes "if I recommend movie A → what's the probability the user will like it?"

DAGDirected Acyclic Graph. A structure that flows in one direction only (A → B → C) and never loops back to A.

NDCGNormalized Discounted Cumulative Gain. A recommendation quality metric. Higher scores for good results at the top, with diminishing credit for lower positions.

Zero-shotRunning an LLM using only instructions with no examples or prior training. Like taking a test cold without having seen any practice problems.

Utility FunctionA formula that combines multiple objectives into a single score. Like "total combat power" in an RPG that aggregates attack, defense, and speed.

Precision@10The proportion of truly good items among the top 10 recommended. 5 hits out of 10 = 0.5.

Wilcoxon signed-rank testA statistical test verifying that the performance difference between two methods isn't due to chance. p < 0.01 means there's a >99% probability the difference is real.

Original Abstract (Expand)

The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.