Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models

Mar 19, 2025•M. Wong, C. Tan•View PDF

TL;DR Highlight

A cRLHF framework that improves LLM code generation quality by aggregating per-line code evaluations from multiple annotators using Bayesian inference — no separate Reward Model required.

Who Should Read

ML engineering teams looking to apply RLHF to code generation models without the cost of training a Reward Model, by leveraging crowdsourced feedback instead. Particularly relevant for teams building internal GitHub Copilot-style tools or fine-tuning domain-specific code LLMs.

Core Mechanics

cRLHF: multiple annotators label each line of code as correct/wrong, and Bayesian inference automatically aggregates these labels into a reward score — eliminating the need for a separate Reward Model training step
Honeypot questions (problems where the ground truth is known in advance) are used to automatically measure each annotator's reliability score pi, providing built-in quality filtering that down-weights inaccurate feedback
Applied to 17 models (up to 15B parameters) including WizardCoder-15B, StarCoder2-15B, CodeLlama-13B/7B, and DeepSeek-Coder-6.7B using TRL + LoRA + PPO fine-tuning
The benefits of cRLHF are more pronounced for larger models, while smaller models (e.g., PolyCoder-160M) show minimal improvement
L1 regularization (sparse regularization) is used to automatically filter out noisy annotators and prevent overfitting
Annotator reliability updates can be handled via regularized logistic regression optimization instead of iterative updates, enabling one-shot global estimation

Evidence

Average across 17 models on HumanEval benchmark: Pass@1 +0.2%, Pass@10 +0.3%, Pass@100 +1.2%
Average across models on MBPP benchmark: Pass@1 +0.2%, Pass@10 +0.6%, Pass@100 +0.6%
StarCoder2-15B on MBPP — Pass@100: 84.2% → 84.6%, Pass@10: 43.3% → 44.1%
On HumanEval, 12 out of 17 models showed Pass@10 improvement; on MBPP, 10 models showed simultaneous improvement on both Pass@10 and Pass@100

How to Apply

When building a code annotation platform, initialize annotator pi values by setting the first N tasks as Honeypot questions (problems where you already know the answer, without revealing it) — this automatically filters out low-quality participants
When fine-tuning an existing code LLM using TRL's PPO Trainer + LoRA, you can directly inject cRLHF's aligned score (s = number of correct lines / total lines) as the reward value instead of using a reward model
If you have internal code review data or team member evaluation results, you can apply the same Bayesian aggregation logic to convert multiple reviewers' opinions into a single reward score and reuse it for fine-tuning

Code Example

snippet

import numpy as np

def logit(p):
    return np.log(p / (1 - p))

def logit_inv(x):
    return np.exp(x) / (1 + np.exp(x))

def bayesian_aggregate(annotations, pi_values):
    """
    annotations: list of +1 (correct) or -1 (wrong) from each annotator
    pi_values:   list of each annotator's reliability score (0~1)
    returns:     P(line is correct | all annotations)
    """
    score = sum(eps * logit(pi) for eps, pi in zip(annotations, pi_values))
    return logit_inv(score)

def update_pi(pi, correctness, p_bar=1.0, lam=1.0):
    """
    pi:          current annotator reliability
    correctness: mu = +1 if annotation was right, -1 if wrong
    p_bar:       system confidence in ground truth (1.0 for honeypot)
    """
    return logit_inv(logit(pi) + lam * correctness * logit(p_bar))

# Example: 3 annotators evaluating a single line of code
annotations = [1, 1, -1]       # annotators 1 and 2 say correct, annotator 3 says wrong
pi_values   = [0.8, 0.7, 0.3]  # reliability score of each annotator

prob_correct = bayesian_aggregate(annotations, pi_values)
print(f"Probability this line is correct: {prob_correct:.3f}")

# reward score = number of correct lines / total number of lines
# this value is used directly as the PPO reward

Terminology

RLHFA technique for further training an LLM using human preference judgments ('this response is better') as reward signals. This is how ChatGPT is trained to respond in a human-friendly manner.

PPOProximal Policy Optimization. An RL training algorithm that improves a model's policy gradually without making drastic changes — similar to turning a steering wheel only slightly when learning to drive.

Pass@kThe probability that at least one of k code generations by an LLM passes the test. Pass@1 means succeeding on the very first attempt; Pass@100 means passing at least once out of 100 tries.

LoRAA lightweight fine-tuning technique that adds small adapter matrices instead of retraining the entire model. It achieves similar performance while using significantly less GPU memory.

Bayesian inferenceA method of updating probabilities by incorporating new evidence into prior knowledge. Useful for mathematically handling the uncertainty that annotators may sometimes be wrong.

Honeypot questionA technique where questions with known answers are secretly mixed into a task to measure how reliable a participant is — similar to trap questions in surveys that say 'Check this box if you are a robot.'

SFTSupervised Fine-Tuning. A training method where the model learns to replicate labeled examples. Used before RLHF to establish the model's baseline capabilities.

Related Resources

Original Abstract (Expand)

This paper studies how AI-assisted programming and large language models (LLM) improve software developers' ability via AI tools (LLM agents) like Github Copilot and Amazon CodeWhisperer, while integrating human feedback to enhance reinforcement learning (RLHF) with crowd-sourced computation to enhance text-to-code generation. Additionally, we demonstrate that our Bayesian optimization framework supports AI alignment in code generation by distributing the feedback collection burden, highlighting the value of collecting human feedback of good quality. Our empirical evaluations demonstrate the efficacy of this approach, showcasing how LLM agents can be effectively trained for improved text-to-code generation. Our Bayesian optimization framework can be designed for general domain-specific languages, promoting the alignment of large language model capabilities with human feedback in AI-assisted programming for code generation.