FUSE: Ensembling Verifiers with Zero Labeled Data

Apr 20, 2026•Joonhyuk Lee, Virginia Ma, Sarah Zhao +4•View PDF

TL;DR Highlight

FUSE automatically ensembles multiple LLM verification models without ground truth labels, achieving Best-of-N performance comparable to semi-supervised learning.

Who Should Read

ML engineers deploying test-time scaling or Best-of-N sampling in production, and developers seeking to improve response quality by combining multiple reward models without access to labeled data.

Core Mechanics

FUSE automatically weights and ensembles scores from multiple verifiers (LLM judgment models, reward models, etc.) without ground truth labels, outperforming simple averaging or majority voting in accurately selecting the best response.
The core idea is to automatically transform verifier scores to maximize satisfaction of Triplet Conditional Independence (TCI—the assumption that the outputs of three verifiers are independent given the ground truth) and then estimate each verifier’s accuracy using statistical moment techniques.
Because real-world LLM verifiers often violate TCI, directly applying the existing Jaffe et al. (2015) algorithm results in worse performance than naive ensemble in 7 out of 10 settings. FUSE resolves this issue with a score transformation step.
Pseudo-labels are created from the estimated verifier accuracies and used to train an ensemble function like logistic regression to select the final response, eliminating the need for the stronger independence assumption of Joint Conditional Independence (JCI).
Thanks to its query-conditional mode, which operates independently for each query, FUSE outperforms semi-supervised learning-based WEAVER even in heterogeneous environments with mixed domains, with FUSE’s advantage increasing when labels are limited to a specific domain.
FUSE works even on state-of-the-art models like Gemini 3 Pro and GPT-5 on the still-unsolved Humanity's Last Exam benchmark, with potential applications in data filtering, benchmark auditing, and unsupervised model ranking.

Evidence

"On GPQA Diamond (70B, Best-of-100), FUSE achieved 64.4% vs. WEAVER (using 5% labels) at 64.1%, reaching semi-supervised learning levels without labels, outperforming the baseline in 27 out of 40 comparisons.\n\nOn Humanity's Last Exam (649 questions, Best-of-50), FUSE scored 54.3%, surpassing Pass@1 (52.1%), WEAVER (51.2%), and naive ensemble (51.4%). Naive ensemble was the only benchmark where performance was lower than random selection.\n\nOn IMO Shortlist (123 questions, Best-of-50), FUSE achieved 63.8%, outperforming WEAVER (62.1%), semi-supervised logistic regression (60.2%), and the oracle best verifier (59.7%).\n\nOn the Saad-Falcon et al. dataset (8B/70B, 10 settings), FUSE improved performance by at least +2.3%p and up to +12.3%p compared to naive ensemble, and up to 17.0%p (MMLU Pro 70B) compared to majority vote."

How to Apply

"If you are running a Best-of-N pipeline and have multiple reward models, apply the FUSE algorithm after min-max normalizing each model’s scores to the range [-1, 1] to automatically estimate ensemble weights without collecting labels.\n\nWhen processing a mixed-domain query set (e.g., math + coding + common sense), use FUSE’s query-conditional mode to allow verifier weights to vary by domain, achieving better performance than semi-supervised learning methods trained on a single label set.\n\nWhen you need to select high-quality responses without ground truth labels for synthetic data selection or RLHF data filtering, use multiple LLM judges as verifiers, ensemble scores with FUSE, and use only the top responses as training data."

Code Example

snippet

Terminology

Best-of-N (BoN)A method of generating N responses and selecting the best one. Similar to taking multiple exams and submitting the best score.

VerifierA model that scores whether an LLM-generated response is correct. An automated grader replacing human reviewers.

Reward ModelA model that assigns a numerical score to the quality of a response. Frequently used in RLHF training, with higher scores indicating better responses.

TCI (Triplet Conditional Independence)The assumption that three verifiers are independent when they know the ground truth. Often violated in practice because similar models tend to make similar mistakes.

Method of Moments (MoM)A technique for estimating model parameters from statistical moments of the data, such as the mean and variance. Allows for calculation directly from formulas without collecting large amounts of data.

Pseudo-labelA temporary label assigned by a model when ground truth labels are unavailable. Used to train models in a supervised learning manner.

Sensitivity/SpecificityA model’s ability to correctly identify ‘positive’ cases (sensitivity) and ‘negative’ cases (specificity). Similar to positive and negative predictive values in medical testing.

Test-time ScalingA technique for improving performance during inference by allowing the model to think more or generate more candidates, rather than further training the model.

Related Resources

Original Abstract (Expand)

Verification of model outputs is rapidly emerging as a key primitive for both training and real-world deployment of large language models (LLMs). In practice, this often involves using imperfect LLM judges and reward models since ground truth acquisition can be time-consuming and expensive. We introduce Fully Unsupervised Score Ensembling (FUSE), a method for improving verification quality by ensembling verifiers without access to ground truth correctness labels. The key idea behind FUSE is to control conditional dependencies between verifiers in a manner that improves the unsupervised performance of a class of spectral algorithms from the ensembling literature. Despite requiring zero ground truth labels, FUSE typically matches or improves upon semi-supervised alternatives in test-time scaling experiments with diverse sets of generator models, verifiers, and benchmarks. In particular, we validate our method on both conventional academic benchmarks such as GPQA Diamond and on frontier, unsaturated benchmarks such as Humanity's Last Exam and IMO Shortlist questions.