Agentic AI에서의 Semantic Invariance: LLM 추론 에이전트의 robustness 평가

Semantic Invariance in Agentic AI

Mar 13, 2026•I. de Zarzà, J. de Curtò, Jordi Cabot +2•View PDF

TL;DR Highlight

같은 문제를 다르게 표현했을 때 LLM이 얼마나 일관된 답을 내는지 체계적으로 측정했더니, 큰 모델이 오히려 더 불안정했다.

Who Should Read

LLM을 의사결정 시스템이나 멀티 에이전트 파이프라인에 도입하려는 AI 엔지니어. 어떤 모델이 실제 배포 환경에서 안정적으로 동작하는지 판단 기준이 필요한 개발자.

Core Mechanics

모델 크기가 클수록 더 안정적이라는 통념이 틀렸음 — Qwen3-30B-A3B(실제 활성 파라미터 3B)가 훨씬 큰 모델들보다 robustness 1위
같은 문제를 8가지 방식으로 변형(paraphrase, 사실 순서 바꾸기, 길게/짧게, 학술적/비즈니스 맥락, 대조 표현)해서 출력이 얼마나 달라지는지 측정
대조 표현(contrastive) 변형이 모든 모델을 무너뜨리는 유일한 공통 약점 — gpt-oss-120b는 점수가 최대 -0.449까지 떨어짐
모델 패밀리별 취약 패턴이 다름 — Hermes는 대조 표현에 약하고, DeepSeek-R1은 사실 순서 바꾸기에 민감하고, gpt-oss는 전반적으로 불안정
Qwen3-30B-A3B: stability rate 79.6%, semantic similarity 0.914로 가장 일관된 추론
문제 확장(expand) 변형 시 Qwen3는 성능이 소폭 오르지만 gpt-oss/DeepSeek은 오히려 떨어짐 — 프롬프트 길이 전략이 모델마다 달라야 함

Evidence

Qwen3-30B-A3B: MAD(평균 점수 변동폭) 0.049, stability rate 79.6%, semantic similarity 0.914 — 7개 모델 중 robustness 1위
gpt-oss-120b: contrastive 변형에서 score delta -0.449, stability rate 27.0% — 가장 불안정한 모델
DeepSeek-R1-0528: reorder facts 변형에서 delta -0.171, 구조적 변형에 특히 취약
Qwen3-30B vs gpt-oss 모델 간 robustness 차이 통계적으로 유의미 (Mann-Whitney U, p < 0.001)

How to Apply

모델 선택 시 벤치마크 정확도만 보지 말고, 같은 프롬프트를 paraphrase/reorder/expand 버전으로 몇 가지 만들어서 출력 일관성을 직접 테스트해볼 것 — 특히 Hermes 계열은 대조 표현이 포함된 입력에서 점수가 급락함
멀티 에이전트 시스템 설계 시 모델마다 취약 패턴이 다르므로, 중요 추론 경로에는 Qwen3 계열을 쓰고 Hermes는 대조 컨텍스트가 없는 단순 태스크에 배치하는 식으로 역할 분리
사용자 입력이 자유로운 서비스라면 contrastive/misleading 맥락이 자연스럽게 들어올 수 있으므로, 입력 전처리 단계에서 불필요한 대조 표현을 제거하거나 시스템 프롬프트로 무시하도록 명시하는 게 실질적 방어책

Code Example

snippet

from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

model_st = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_invariance_score(original_output, transformed_output, reference):
    """
    원본 출력과 변형된 입력에 대한 출력의 robustness를 측정
    - score_delta가 0에 가까울수록 semantic invariance가 높음
    """
    emb_orig = model_st.encode(original_output)
    emb_trans = model_st.encode(transformed_output)
    emb_ref = model_st.encode(reference)

    score_original = 1 - cosine(emb_orig, emb_ref)
    score_transformed = 1 - cosine(emb_trans, emb_ref)
    score_delta = score_transformed - score_original
    trace_sim = 1 - cosine(emb_orig, emb_trans)

    return {
        "score_original": round(score_original, 4),
        "score_transformed": round(score_transformed, 4),
        "score_delta": round(score_delta, 4),   # 0에 가까울수록 robust
        "trace_similarity": round(trace_sim, 4), # 1에 가까울수록 일관된 추론
        "is_invariant": abs(score_delta) < 0.05  # 논문 기준 threshold
    }

# 사용 예시
original_problem = "A car travels 60 km/h for 2 hours. Find the distance."
paraphrased_problem = "If a vehicle moves at 60 kilometers per hour for a duration of 2 hours, what total distance does it cover?"

# LLM 호출 후 출력 비교
original_answer = "The distance is 120 km."
transformed_answer = "The car covers 120 kilometers."
reference = "Distance = speed × time = 60 × 2 = 120 km"

result = semantic_invariance_score(original_answer, transformed_answer, reference)
print(result)
# {'score_original': 0.87, 'score_transformed': 0.86, 'score_delta': -0.01, 'trace_similarity': 0.94, 'is_invariant': True}

Terminology

Semantic Invariance같은 뜻의 문장을 다르게 표현해도 모델이 동일한 답을 내는 성질. 사람이라면 '두 시간 동안 60km/h로 달렸어'나 '시속 60킬로로 2시간 이동했어'나 같은 거리를 계산하듯이.

Metamorphic Testing정답을 모르는 상황에서도 '입력이 이렇게 바뀌면 출력도 이렇게 바뀌어야 한다'는 관계를 정의해서 버그를 잡는 테스트 기법. 예: 소금을 2배 넣으면 짠맛도 2배여야 한다.

MoE (Mixture of Experts)모델 전체가 항상 다 켜지는 게 아니라, 입력에 따라 필요한 전문가 부분만 골라서 활성화하는 아키텍처. 레스토랑에 셰프가 여럿인데 메뉴에 따라 담당 셰프만 나오는 것과 비슷.

MAD (Mean Absolute Delta)변형된 입력에 대한 출력 점수 변화의 평균 크기. 낮을수록 입력이 조금 바뀌어도 답이 안 흔들린다는 뜻.

Contrastive Transformation원래 문제에 '반대 상황' 또는 '흔한 오해'를 붙여서 일부러 헷갈리게 만드는 변형. 모든 모델이 이 변형에 취약했음.

Stability Rate전체 테스트 중 출력 변화가 0.05 미만인 비율. 79.6%면 10번 중 약 8번은 표현이 달라져도 일관된 답을 냈다는 뜻.

Semantic Similarity Score두 텍스트의 의미가 얼마나 비슷한지 0~1 사이 숫자로 나타낸 것. 문장을 벡터로 변환한 뒤 코사인 유사도로 계산.

Original Abstract (Expand)

Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.