Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models
TL;DR Highlight
A cRLHF framework that improves LLM code generation quality by aggregating per-line code evaluations from multiple annotators using Bayesian inference — no separate Reward Model required.
Who Should Read
ML engineering teams looking to apply RLHF to code generation models without the cost of training a Reward Model, by leveraging crowdsourced feedback instead. Particularly relevant for teams building internal GitHub Copilot-style tools or fine-tuning domain-specific code LLMs.
Core Mechanics
- cRLHF: multiple annotators label each line of code as correct/wrong, and Bayesian inference automatically aggregates these labels into a reward score — eliminating the need for a separate Reward Model training step
- Honeypot questions (problems where the ground truth is known in advance) are used to automatically measure each annotator's reliability score pi, providing built-in quality filtering that down-weights inaccurate feedback
- Applied to 17 models (up to 15B parameters) including WizardCoder-15B, StarCoder2-15B, CodeLlama-13B/7B, and DeepSeek-Coder-6.7B using TRL + LoRA + PPO fine-tuning
- The benefits of cRLHF are more pronounced for larger models, while smaller models (e.g., PolyCoder-160M) show minimal improvement
- L1 regularization (sparse regularization) is used to automatically filter out noisy annotators and prevent overfitting
- Annotator reliability updates can be handled via regularized logistic regression optimization instead of iterative updates, enabling one-shot global estimation
Evidence
- Average across 17 models on HumanEval benchmark: Pass@1 +0.2%, Pass@10 +0.3%, Pass@100 +1.2%
- Average across models on MBPP benchmark: Pass@1 +0.2%, Pass@10 +0.6%, Pass@100 +0.6%
- StarCoder2-15B on MBPP — Pass@100: 84.2% → 84.6%, Pass@10: 43.3% → 44.1%
- On HumanEval, 12 out of 17 models showed Pass@10 improvement; on MBPP, 10 models showed simultaneous improvement on both Pass@10 and Pass@100
How to Apply
- When building a code annotation platform, initialize annotator pi values by setting the first N tasks as Honeypot questions (problems where you already know the answer, without revealing it) — this automatically filters out low-quality participants
- When fine-tuning an existing code LLM using TRL's PPO Trainer + LoRA, you can directly inject cRLHF's aligned score (s = number of correct lines / total lines) as the reward value instead of using a reward model
- If you have internal code review data or team member evaluation results, you can apply the same Bayesian aggregation logic to convert multiple reviewers' opinions into a single reward score and reuse it for fine-tuning
Code Example
import numpy as np
def logit(p):
return np.log(p / (1 - p))
def logit_inv(x):
return np.exp(x) / (1 + np.exp(x))
def bayesian_aggregate(annotations, pi_values):
"""
annotations: list of +1 (correct) or -1 (wrong) from each annotator
pi_values: list of each annotator's reliability score (0~1)
returns: P(line is correct | all annotations)
"""
score = sum(eps * logit(pi) for eps, pi in zip(annotations, pi_values))
return logit_inv(score)
def update_pi(pi, correctness, p_bar=1.0, lam=1.0):
"""
pi: current annotator reliability
correctness: mu = +1 if annotation was right, -1 if wrong
p_bar: system confidence in ground truth (1.0 for honeypot)
"""
return logit_inv(logit(pi) + lam * correctness * logit(p_bar))
# Example: 3 annotators evaluating a single line of code
annotations = [1, 1, -1] # annotators 1 and 2 say correct, annotator 3 says wrong
pi_values = [0.8, 0.7, 0.3] # reliability score of each annotator
prob_correct = bayesian_aggregate(annotations, pi_values)
print(f"Probability this line is correct: {prob_correct:.3f}")
# reward score = number of correct lines / total number of lines
# this value is used directly as the PPO rewardTerminology
Related Resources
Original Abstract (Expand)
This paper studies how AI-assisted programming and large language models (LLM) improve software developers' ability via AI tools (LLM agents) like Github Copilot and Amazon CodeWhisperer, while integrating human feedback to enhance reinforcement learning (RLHF) with crowd-sourced computation to enhance text-to-code generation. Additionally, we demonstrate that our Bayesian optimization framework supports AI alignment in code generation by distributing the feedback collection burden, highlighting the value of collecting human feedback of good quality. Our empirical evaluations demonstrate the efficacy of this approach, showcasing how LLM agents can be effectively trained for improved text-to-code generation. Our Bayesian optimization framework can be designed for general domain-specific languages, promoting the alignment of large language model capabilities with human feedback in AI-assisted programming for code generation.