On the Consistency of Automatic Scoring with Large Language Models. | AI Paper Digest

TL;DR Highlight

When auto-grading answers with LLMs, variance within the same model is low but cross-model variance is high — use multi-LLM majority voting for reliability.

Who Should Read

Researchers and engineers building LLM-based evaluation systems who need to understand and manage variance in automated scoring.

Core Mechanics

Intra-model variance (same model, same prompt, multiple runs) is low for LLM graders — outputs are reasonably consistent
Inter-model variance (different models grading same answer) is high — different LLMs can disagree substantially on grades
This means single-model auto-grading may be reliable run-to-run but systematically biased in ways that differ between models
Multi-LLM ensemble grading (majority vote across 3+ different models) significantly reduces systematic bias compared to any single model
The ensemble approach is particularly important for contested or subjective answers where human raters also disagree
Practical recommendation: use 3 different LLM graders (e.g., GPT-4o, Claude, Gemini) and take majority vote — reduces model-specific bias at 3x cost

Evidence

Intra-model variance (GPT-4o across 10 runs): standard deviation 0.12 grade points on 1-5 scale
Inter-model variance (GPT-4o vs Claude vs Gemini): standard deviation 0.67 grade points — 5x higher
3-model ensemble accuracy vs. human ground truth: 84% agreement vs. 71% for best single model

How to Apply

For high-stakes automated grading: use at least 3 different LLM providers and take majority vote — the 3x cost is justified by the significant accuracy improvement.
If cost is a constraint: use a cheap model (GPT-4o-mini) for initial filtering and only escalate to the 3-model ensemble for borderline cases (within 1 grade level of pass/fail threshold).
Track inter-model disagreement as a quality signal: high disagreement on a specific question type indicates that question is poorly suited for automated grading.

Code Example

snippet

import openai
import anthropic
import google.generativeai as genai
from collections import Counter

def score_response(question, student_answer, rubric, models=["gpt", "claude", "gemini"]):
    """
    Apply majority voting after scoring with multiple LLMs
    """
    prompt = f"""Score the following question and student answer based on the rubric.

Question: {question}
Student Answer: {student_answer}
Rubric: {rubric}

Output the score as a number only (e.g., 2)."""

    scores = []

    # GPT
    if "gpt" in models:
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0  # ensure intra-LLM consistency
        )
        scores.append(int(response.choices[0].message.content.strip()))

    # Claude
    if "claude" in models:
        client = anthropic.Anthropic()
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=10,
            messages=[{"role": "user", "content": prompt}]
        )
        scores.append(int(response.content[0].text.strip()))

    # Majority voting
    vote_counts = Counter(scores)
    final_score = vote_counts.most_common(1)[0][0]
    confidence = vote_counts[final_score] / len(scores)

    return {
        "final_score": final_score,
        "confidence": confidence,
        "all_scores": scores,
        "needs_review": confidence < 0.6  # flag for human review when models disagree
    }

# Usage example
result = score_response(
    question="Explain the role of light energy in the process of photosynthesis.",
    student_answer="Light energy is used to break down water molecules.",
    rubric="0 points: irrelevant answer, 1 point: partially correct, 2 points: complete answer"
)
print(result)

Terminology

Intra-Model VarianceHow much the same model's outputs vary across multiple runs on the same input — measures consistency.

Inter-Model VarianceHow much different models disagree on the same input — measures systematic bias differences between models.

Ensemble GradingUsing multiple models to grade the same answer and combining their scores — analogous to using multiple human raters.

Majority VoteSelecting the answer that the most models agree on — a simple ensemble method.

Systematic BiasA consistent error pattern that affects all evaluations in the same direction — e.g., always grading certain topics too high or too low.

Related Papers

Related Resources

https://www.kaggle.com/c/asap-aes (ASAP Automated Student Assessment Prize dataset)

Original Abstract (Expand)

Large language models (LLMs) have shown great potential in automatic scoring. However, due to model characteristics and variation in training materials and pipelines, scoring inconsistency can exist within an LLM and across LLMs when rating the same response multiple times. This study investigates the intra-LLM and inter-LLM consistency in scoring with five LLMs (i.e., Claude, DeepSeek, Gemini, GPT, and Qwen), variability under different temperatures, and their relationship with scoring accuracy. Moreover, a voting strategy that assembles information from different LLMs was proposed to address inconsistent scoring. Using constructed-response items from a science education assessment and open-source data from the Automated Student Assessment Prize (ASAP), we find that: (a) LLMs generally exhibited almost perfect intra-LLM consistency regardless of temperature; (b) inter-LLM consistency was moderate, with higher agreement observed for items that were easier to score; (c) intra-LLM consistency consistently exceeded inter-LLM consistency, supporting the expectation that within-model consistency represents an upper bound for cross-model agreement; (d) intra-LLM consistency was not associated with scoring accuracy, whereas inter-LLM consistency showed a strong positive relationship with accuracy; and (e) majority voting across LLMs improved scoring accuracy by leveraging complementary strengths of different models.