CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems

Mar 30, 2026•Kangkang Sun, Jun Wu, Jianhua Li +3•View PDF

TL;DR Highlight

A novel uncertainty metric for multi-LLM collaboration that simultaneously measures 'how confident each model is' and 'how much the models disagree with each other'

Who Should Read

Developers building agentic systems by combining multiple LLMs. Particularly ML engineers who need to evaluate the reliability of multi-model ensembles in high-stakes domains like healthcare or legal, where incorrect answers carry serious risk.

Core Mechanics

Existing uncertainty measures (e.g., Semantic Entropy) only capture a single model's internal confidence, failing to detect cases where multiple models are each highly confident but give conflicting answers
CoE decomposes uncertainty into two components — 'intra-model uncertainty (UA)' and 'inter-model disagreement (UE)' — enabling diagnosis of *why* a system is uncertain
High UA calls for prompt improvement or sampling diversification, while high UE requires model alignment — merging the two into one metric destroys this distinction entirely
Experiments with LLaMA-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, and Mistral-7B-Instruct show that CoE's advantage grows as the number of models increases (AUROC 0.683 → 0.772)
A training-free post-processing weight adjustment heuristic based on CoE improves accuracy by up to +39.0% (using asymmetric KL divergence)
Asymmetric KL divergence significantly outperforms symmetric methods like JS/Wasserstein/Hellinger — because it directionally measures how far each model deviates from the ensemble mean

Evidence

On TriviaQA with a 3-model setup, CoE achieves AUROC 0.772, surpassing the previous best baselines UE (0.716) and Semantic Entropy (0.687)
On SQuAD with a 3-model setup, AUROC reaches 0.878; with 6 models it remains at 0.811, up to 20% above baselines
CoE-based adjustment heuristic yields +39.0% accuracy gain (KL), vs. JS at +27.5%, Hellinger at +23.0%, and Wasserstein at +20.5%, with asymmetric KL dominating by a wide margin
Increasing sample count from 4 to 8 improves ensemble accuracy from 81.9% to 96.0%; ensemble accuracy remains stable at 92–95% across temperature range 0.8–1.0

How to Apply

In a multi-LLM pipeline, sample responses from each model, cluster semantically equivalent answers using bidirectional entailment, compute UA+UE, and use the result as a filter to flag low-confidence queries for further review
When CoE is high: if UA is high, lower each model's temperature or augment few-shot prompts; if UE is high, implement branching logic that re-weights models or inserts an additional verification step
Since it attaches as post-processing to already-generated outputs, no model retraining is needed — it can be added as a plug-in to existing multi-LLM ensembles and is immediately applicable for building selective prediction systems that route high-uncertainty cases to human review

Code Example

snippet

import numpy as np
from scipy.special import rel_entr

def collaborative_entropy(cluster_probs_list, weights=None):
    """
    cluster_probs_list: list of cluster probability distributions for each model
      e.g.: [[0.8, 0.1, 0.1], [0.1, 0.8, 0.1], [0.7, 0.2, 0.1]]
    weights: per-model weights (uniform if None)
    """
    K = len(cluster_probs_list)
    if weights is None:
        weights = [1.0 / K] * K
    
    probs = [np.array(p) + 1e-10 for p in cluster_probs_list]  # avoid zero
    probs = [p / p.sum() for p in probs]
    
    # ensemble mean distribution
    ensemble_mean = sum(w * p for w, p in zip(weights, probs))
    
    # UA: average semantic entropy across models
    def shannon_entropy(p):
        return -np.sum(p * np.log(p + 1e-10))
    
    UA = np.mean([shannon_entropy(p) for p in probs])
    
    # UE: weighted sum of KL divergence between each model's distribution and the ensemble mean
    UE = sum(w * np.sum(rel_entr(p, ensemble_mean))
             for w, p in zip(weights, probs))
    
    CoE = UA + UE
    return {"CoE": CoE, "UA": UA, "UE": UE}

# Example: probability distributions over 3 semantic clusters for 3 models
model_outputs = [
    [0.8, 0.1, 0.1],   # LLaMA: confident in cluster 0
    [0.1, 0.8, 0.1],   # Qwen: confident in cluster 1 (inter-model disagreement!)
    [0.7, 0.2, 0.1],   # Mistral: confident in cluster 0
]

result = collaborative_entropy(model_outputs)
print(f"CoE: {result['CoE']:.4f}")
print(f"UA (intra-model): {result['UA']:.4f}")
print(f"UE (inter-model): {result['UE']:.4f}")
# Low UA and high UE -> each model is confident, but they disagree with each other

Terminology

Semantic EntropyAn uncertainty metric that measures how semantically diverse the answers are when an LLM responds to the same question multiple times. If the model always answers 'Seoul' to 'What is the capital of France?', entropy is 0; if it gives varied answers like 'Seoul/Busan/Daejeon', entropy is high.

AUROCA metric that evaluates how well an uncertainty score distinguishes incorrect answers from correct ones. A score of 1.0 is perfect; 0.5 is equivalent to random. Higher values mean the signal 'this answer is risky, filter it out' is more accurate.

KL DivergenceA formula that measures how different two probability distributions are. It is asymmetric, so the distance from A→B differs from B→A. In CoE, it is used to directionally measure how far each model deviates from the ensemble mean.

Bidirectional EntailmentA method for determining whether two sentences have the same meaning by checking if each logically implies the other. 'Obama was the U.S. president' and 'The former U.S. president is Obama' satisfy bidirectional entailment → they belong to the same cluster.

Aleatoric UncertaintyUncertainty arising from inherent ambiguity in the data itself. Like the randomness of a coin flip, it is fundamentally stochastic and cannot be reduced by collecting more data.

Epistemic UncertaintyUncertainty arising from lack of knowledge. It can be reduced with more information or better models. In multi-LLM settings, it appears when models trained on different knowledge bases diverge in their answers.

AURACA metric that measures how well accuracy improves as uncertain predictions are progressively rejected. Higher values indicate that 'discarding uncertain outputs leads to better accuracy', meaning selective prediction is working well.

Original Abstract (Expand)

Uncertainty estimation in multi-LLM systems remains largely single-model-centric: existing methods quantify uncertainty within each model but do not adequately capture semantic disagreement across models. To address this gap, we propose Collaborative Entropy (CoE), a unified information-theoretic metric for semantic uncertainty in multi-LLM collaboration. CoE is defined on a shared semantic cluster space and combines two components: intra-model semantic entropy and inter-model divergence to the ensemble mean. CoE is not a weighted ensemble predictor; it is a system-level uncertainty measure that characterizes collaborative confidence and disagreement. We analyze several core properties of CoE, including non-negativity, zero-value certainty under perfect semantic consensus, and the behavior of CoE when individual models collapse to delta distributions. These results clarify when reducing per-model uncertainty is sufficient and when residual inter-model disagreement remains. We also present a simple CoE-guided, training-free post-hoc coordination heuristic as a practical application of the metric. Experiments on \textit{TriviaQA} and \textit{SQuAD} with LLaMA-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, and Mistral-7B-Instruct show that CoE provides stronger uncertainty estimation than standard entropy- and divergence-based baselines, with gains becoming larger as additional heterogeneous models are introduced. Overall, CoE offers a useful uncertainty-aware perspective on multi-LLM collaboration.