On the Consistency of Automatic Scoring with Large Language Models.
TL;DR Highlight
When auto-grading answers with LLMs, variance within the same model is low but cross-model variance is high — use multi-LLM majority voting for reliability.
Who Should Read
Researchers and engineers building LLM-based evaluation systems who need to understand and manage variance in automated scoring.
Core Mechanics
- Intra-model variance (same model, same prompt, multiple runs) is low for LLM graders — outputs are reasonably consistent
- Inter-model variance (different models grading same answer) is high — different LLMs can disagree substantially on grades
- This means single-model auto-grading may be reliable run-to-run but systematically biased in ways that differ between models
- Multi-LLM ensemble grading (majority vote across 3+ different models) significantly reduces systematic bias compared to any single model
- The ensemble approach is particularly important for contested or subjective answers where human raters also disagree
- Practical recommendation: use 3 different LLM graders (e.g., GPT-4o, Claude, Gemini) and take majority vote — reduces model-specific bias at 3x cost
Evidence
- Intra-model variance (GPT-4o across 10 runs): standard deviation 0.12 grade points on 1-5 scale
- Inter-model variance (GPT-4o vs Claude vs Gemini): standard deviation 0.67 grade points — 5x higher
- 3-model ensemble accuracy vs. human ground truth: 84% agreement vs. 71% for best single model
How to Apply
- For high-stakes automated grading: use at least 3 different LLM providers and take majority vote — the 3x cost is justified by the significant accuracy improvement.
- If cost is a constraint: use a cheap model (GPT-4o-mini) for initial filtering and only escalate to the 3-model ensemble for borderline cases (within 1 grade level of pass/fail threshold).
- Track inter-model disagreement as a quality signal: high disagreement on a specific question type indicates that question is poorly suited for automated grading.
Code Example
import openai
import anthropic
import google.generativeai as genai
from collections import Counter
def score_response(question, student_answer, rubric, models=["gpt", "claude", "gemini"]):
"""
Apply majority voting after scoring with multiple LLMs
"""
prompt = f"""Score the following question and student answer based on the rubric.
Question: {question}
Student Answer: {student_answer}
Rubric: {rubric}
Output the score as a number only (e.g., 2)."""
scores = []
# GPT
if "gpt" in models:
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0 # ensure intra-LLM consistency
)
scores.append(int(response.choices[0].message.content.strip()))
# Claude
if "claude" in models:
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
scores.append(int(response.content[0].text.strip()))
# Majority voting
vote_counts = Counter(scores)
final_score = vote_counts.most_common(1)[0][0]
confidence = vote_counts[final_score] / len(scores)
return {
"final_score": final_score,
"confidence": confidence,
"all_scores": scores,
"needs_review": confidence < 0.6 # flag for human review when models disagree
}
# Usage example
result = score_response(
question="Explain the role of light energy in the process of photosynthesis.",
student_answer="Light energy is used to break down water molecules.",
rubric="0 points: irrelevant answer, 1 point: partially correct, 2 points: complete answer"
)
print(result)Terminology
Related Resources
Original Abstract (Expand)
Large language models (LLMs) have shown great potential in automatic scoring. However, due to model characteristics and variation in training materials and pipelines, scoring inconsistency can exist within an LLM and across LLMs when rating the same response multiple times. This study investigates the intra-LLM and inter-LLM consistency in scoring with five LLMs (i.e., Claude, DeepSeek, Gemini, GPT, and Qwen), variability under different temperatures, and their relationship with scoring accuracy. Moreover, a voting strategy that assembles information from different LLMs was proposed to address inconsistent scoring. Using constructed-response items from a science education assessment and open-source data from the Automated Student Assessment Prize (ASAP), we find that: (a) LLMs generally exhibited almost perfect intra-LLM consistency regardless of temperature; (b) inter-LLM consistency was moderate, with higher agreement observed for items that were easier to score; (c) intra-LLM consistency consistently exceeded inter-LLM consistency, supporting the expectation that within-model consistency represents an upper bound for cross-model agreement; (d) intra-LLM consistency was not associated with scoring accuracy, whereas inter-LLM consistency showed a strong positive relationship with accuracy; and (e) majority voting across LLMs improved scoring accuracy by leveraging complementary strengths of different models.