Semantic Invariance in Agentic AI
TL;DR Highlight
Systematic measurement of LLM answer consistency across differently-worded versions of the same problem shows larger models are actually less stable.
Who Should Read
ML engineers evaluating LLM reliability in production, and researchers studying robustness and consistency of language model reasoning.
Core Mechanics
- Measured answer consistency by creating multiple semantically equivalent but surface-form-different versions of the same problems
- Found that LLMs frequently give different answers to the same question asked differently
- Counterintuitively, larger models show more inconsistency than smaller models on this metric
- Inconsistency is highest on tasks requiring multi-step reasoning
- The inconsistency patterns reveal that models are sensitive to specific phrasing cues rather than understanding the underlying problem
- This surface-level sensitivity is a form of brittleness that's invisible to standard accuracy benchmarks
Evidence
- Larger models (70B+) show higher inconsistency rates than 7B-13B models on same-problem paraphrase tests
- Inconsistency rates of 20-40% observed on multi-step reasoning tasks across tested models
- Statistical analysis confirms the correlation between model size and inconsistency
- Human performance on the same paraphrase pairs shows >95% consistency
How to Apply
- Add consistency testing to your LLM evaluation suite: test the same questions with multiple phrasings and measure agreement rate
- For high-stakes applications, use ensemble/majority voting across paraphrased inputs to get more reliable answers
- If model size doesn't help consistency, consider fine-tuning for robustness rather than simply scaling up
Code Example
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine
model_st = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_invariance_score(original_output, transformed_output, reference):
"""
Measures the robustness of outputs between the original and transformed inputs
- The closer score_delta is to 0, the higher the semantic invariance
"""
emb_orig = model_st.encode(original_output)
emb_trans = model_st.encode(transformed_output)
emb_ref = model_st.encode(reference)
score_original = 1 - cosine(emb_orig, emb_ref)
score_transformed = 1 - cosine(emb_trans, emb_ref)
score_delta = score_transformed - score_original
trace_sim = 1 - cosine(emb_orig, emb_trans)
return {
"score_original": round(score_original, 4),
"score_transformed": round(score_transformed, 4),
"score_delta": round(score_delta, 4), # closer to 0 means more robust
"trace_similarity": round(trace_sim, 4), # closer to 1 means more consistent reasoning
"is_invariant": abs(score_delta) < 0.05 # threshold based on paper
}
# Usage example
original_problem = "A car travels 60 km/h for 2 hours. Find the distance."
paraphrased_problem = "If a vehicle moves at 60 kilometers per hour for a duration of 2 hours, what total distance does it cover?"
# Compare outputs after LLM call
original_answer = "The distance is 120 km."
transformed_answer = "The car covers 120 kilometers."
reference = "Distance = speed × time = 60 × 2 = 120 km"
result = semantic_invariance_score(original_answer, transformed_answer, reference)
print(result)
# {'score_original': 0.87, 'score_transformed': 0.86, 'score_delta': -0.01, 'trace_similarity': 0.94, 'is_invariant': True}Terminology
Original Abstract (Expand)
Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.