Semantic Invariance in Agentic AI

Mar 13, 2026•I. de Zarzà, J. de Curtò, Jordi Cabot +2•View PDF

TL;DR Highlight

Systematic measurement of LLM answer consistency across differently-worded versions of the same problem shows larger models are actually less stable.

Who Should Read

ML engineers evaluating LLM reliability in production, and researchers studying robustness and consistency of language model reasoning.

Core Mechanics

Measured answer consistency by creating multiple semantically equivalent but surface-form-different versions of the same problems
Found that LLMs frequently give different answers to the same question asked differently
Counterintuitively, larger models show more inconsistency than smaller models on this metric
Inconsistency is highest on tasks requiring multi-step reasoning
The inconsistency patterns reveal that models are sensitive to specific phrasing cues rather than understanding the underlying problem
This surface-level sensitivity is a form of brittleness that's invisible to standard accuracy benchmarks

Evidence

Larger models (70B+) show higher inconsistency rates than 7B-13B models on same-problem paraphrase tests
Inconsistency rates of 20-40% observed on multi-step reasoning tasks across tested models
Statistical analysis confirms the correlation between model size and inconsistency
Human performance on the same paraphrase pairs shows >95% consistency

How to Apply

Add consistency testing to your LLM evaluation suite: test the same questions with multiple phrasings and measure agreement rate
For high-stakes applications, use ensemble/majority voting across paraphrased inputs to get more reliable answers
If model size doesn't help consistency, consider fine-tuning for robustness rather than simply scaling up

Code Example

snippet

from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

model_st = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_invariance_score(original_output, transformed_output, reference):
    """
    Measures the robustness of outputs between the original and transformed inputs
    - The closer score_delta is to 0, the higher the semantic invariance
    """
    emb_orig = model_st.encode(original_output)
    emb_trans = model_st.encode(transformed_output)
    emb_ref = model_st.encode(reference)

    score_original = 1 - cosine(emb_orig, emb_ref)
    score_transformed = 1 - cosine(emb_trans, emb_ref)
    score_delta = score_transformed - score_original
    trace_sim = 1 - cosine(emb_orig, emb_trans)

    return {
        "score_original": round(score_original, 4),
        "score_transformed": round(score_transformed, 4),
        "score_delta": round(score_delta, 4),   # closer to 0 means more robust
        "trace_similarity": round(trace_sim, 4), # closer to 1 means more consistent reasoning
        "is_invariant": abs(score_delta) < 0.05  # threshold based on paper
    }

# Usage example
original_problem = "A car travels 60 km/h for 2 hours. Find the distance."
paraphrased_problem = "If a vehicle moves at 60 kilometers per hour for a duration of 2 hours, what total distance does it cover?"

# Compare outputs after LLM call
original_answer = "The distance is 120 km."
transformed_answer = "The car covers 120 kilometers."
reference = "Distance = speed × time = 60 × 2 = 120 km"

result = semantic_invariance_score(original_answer, transformed_answer, reference)
print(result)
# {'score_original': 0.87, 'score_transformed': 0.86, 'score_delta': -0.01, 'trace_similarity': 0.94, 'is_invariant': True}

Terminology

Answer ConsistencyThe degree to which a model gives the same answer to semantically equivalent questions phrased differently.

Paraphrase TestingEvaluating model behavior by testing multiple surface-form variations of semantically identical inputs.

BrittlenessThe property of a model that performs well on training-distribution phrasings but fails on slight variations.

Surface SensitivityWhen a model's output changes based on surface-level phrasing features rather than underlying semantic content.

Original Abstract (Expand)

Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.