On the Consistency of Automatic Scoring with Large Language Models.
TL;DR Highlight
When auto-grading answers with LLMs, variance within the same model is low but cross-model variance is high — use multi-LLM majority voting for reliability.
Who Should Read
Researchers and engineers building LLM-based evaluation systems who need to understand and manage variance in automated scoring.
Core Mechanics
- Intra-model variance (same model, same prompt, multiple runs) is low for LLM graders — outputs are reasonably consistent
- Inter-model variance (different models grading same answer) is high — different LLMs can disagree substantially on grades
- This means single-model auto-grading may be reliable run-to-run but systematically biased in ways that differ between models
- Multi-LLM ensemble grading (majority vote across 3+ different models) significantly reduces systematic bias compared to any single model
- The ensemble approach is particularly important for contested or subjective answers where human raters also disagree
- Practical recommendation: use 3 different LLM graders (e.g., GPT-4o, Claude, Gemini) and take majority vote — reduces model-specific bias at 3x cost
Evidence
- Intra-model variance (GPT-4o across 10 runs): standard deviation 0.12 grade points on 1-5 scale
- Inter-model variance (GPT-4o vs Claude vs Gemini): standard deviation 0.67 grade points — 5x higher
- 3-model ensemble accuracy vs. human ground truth: 84% agreement vs. 71% for best single model
How to Apply
- For high-stakes automated grading: use at least 3 different LLM providers and take majority vote — the 3x cost is justified by the significant accuracy improvement.
- If cost is a constraint: use a cheap model (GPT-4o-mini) for initial filtering and only escalate to the 3-model ensemble for borderline cases (within 1 grade level of pass/fail threshold).
- Track inter-model disagreement as a quality signal: high disagreement on a specific question type indicates that question is poorly suited for automated grading.
Code Example
import openai
import anthropic
import google.generativeai as genai
from collections import Counter
def score_response(question, student_answer, rubric, models=["gpt", "claude", "gemini"]):
"""
Apply majority voting after scoring with multiple LLMs
"""
prompt = f"""Score the following question and student answer based on the rubric.
Question: {question}
Student Answer: {student_answer}
Rubric: {rubric}
Output the score as a number only (e.g., 2)."""
scores = []
# GPT
if "gpt" in models:
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0 # ensure intra-LLM consistency
)
scores.append(int(response.choices[0].message.content.strip()))
# Claude
if "claude" in models:
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
scores.append(int(response.content[0].text.strip()))
# Majority voting
vote_counts = Counter(scores)
final_score = vote_counts.most_common(1)[0][0]
confidence = vote_counts[final_score] / len(scores)
return {
"final_score": final_score,
"confidence": confidence,
"all_scores": scores,
"needs_review": confidence < 0.6 # flag for human review when models disagree
}
# Usage example
result = score_response(
question="Explain the role of light energy in the process of photosynthesis.",
student_answer="Light energy is used to break down water molecules.",
rubric="0 points: irrelevant answer, 1 point: partially correct, 2 points: complete answer"
)
print(result)Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Related Resources
Original Abstract (Expand)
Large language models (LLMs) have shown great potential in automatic scoring. However, due to model characteristics and variation in training materials and pipelines, scoring inconsistency can exist within an LLM and across LLMs when rating the same response multiple times. This study investigates the intra-LLM and inter-LLM consistency in scoring with five LLMs (i.e., Claude, DeepSeek, Gemini, GPT, and Qwen), variability under different temperatures, and their relationship with scoring accuracy. Moreover, a voting strategy that assembles information from different LLMs was proposed to address inconsistent scoring. Using constructed-response items from a science education assessment and open-source data from the Automated Student Assessment Prize (ASAP), we find that: (a) LLMs generally exhibited almost perfect intra-LLM consistency regardless of temperature; (b) inter-LLM consistency was moderate, with higher agreement observed for items that were easier to score; (c) intra-LLM consistency consistently exceeded inter-LLM consistency, supporting the expectation that within-model consistency represents an upper bound for cross-model agreement; (d) intra-LLM consistency was not associated with scoring accuracy, whereas inter-LLM consistency showed a strong positive relationship with accuracy; and (e) majority voting across LLMs improved scoring accuracy by leveraging complementary strengths of different models.