Large Language Model을 활용한 보험 청구 자동화

Claim Automation using Large Language Model

Feb 18, 2026•Zhengda Mo, Zhiyu Quan, Eli O'Donohue +1•View PDF

TL;DR Highlight

GPT-5보다 작은 8B 모델이 LoRA 파인튜닝만으로 보증 청구 처리 정확도 92%를 달성하며 상용 LLM을 전부 앞질렀다.

Who Should Read

보험·금융처럼 규제된 도메인에서 LLM을 운영 파이프라인에 도입하려는 ML/백엔드 엔지니어. 도메인 특화 텍스트 생성 작업에 프롬프트 엔지니어링과 파인튜닝 중 무엇을 선택할지 고민하는 개발자에게도 유용하다.

Core Mechanics

DeepSeek-R1-Distill-Llama-8B를 200만 건 자동차 보증 청구 데이터로 LoRA 파인튜닝하면, GPT-5.2·GPT-4.1·GPT-4o-mini·Claude Haiku 4.5·Gemini-2.5-Flash 등 상용 LLM을 BERT 유사도 기준 모두 능가
프롬프트 엔지니어링(prompt engineering)만으로는 정해진 출력 포맷 준수율이 6.5%에 불과했지만, LoRA 파인튜닝 후 100%로 상승 — 구조화 자동화 파이프라인에 즉시 연결 가능
사람 평가 기준 정확도: 비파인튜닝 모델들은 56~64% 수준에서 정체, 파인튜닝 모델은 81.5% (고품질 데이터 서브셋 기준 92%)로 질적으로 다른 성능 구간 진입
타이어 측벽 손상 2,953건 중 '수리' 예측 오류: 비파인튜닝 DeepSeek+Prompt 262건 vs 파인튜닝 모델 1건 — 도메인 운영 규칙을 실제로 내재화했음을 입증
평가 지표 비교 결과, BLEU·편집거리 같은 표면 유사도보다 BERT 코사인 유사도와 LLM-as-a-Judge(ChatGPT-4o-mini)가 사람 판단과 가장 높은 상관관계 (Spearman ρ 0.733, 0.724)
엔드투엔드 대신 '중간 단계 작업(청구 교정 조치 생성)'만 LLM에 할당하는 모듈형 설계로 감사 가능성(auditability)과 규제 적합성 확보

Evidence

포맷 준수율: 파인튜닝(M4) 100% vs Qwen-Instruct+Prompt(M3) 86.5% vs DeepSeek+Prompt(M2) 6.5%
정확도(Acc Valid): 파인튜닝 81.5% vs 최고 비파인튜닝(DeepSeek+Prompt) 64.4%; HQ 서브셋에서는 92.0% vs 71.2%
BERT 코사인 유사도 평균(전체 1500건): 파인튜닝 0.869 > Gemini-2.5-Flash 0.799 > GPT-4o-mini 0.787 > Claude Haiku 4.5 0.757 > GPT-4.1 0.749 > GPT-5.2 0.719
κ=0.77 기준 고품질 예측 비율: 파인튜닝 79.1% vs Gemini-2.5-Flash 68.5% vs GPT-5.2 47.2%

How to Apply

규제 도메인에서 외부 API LLM 대신 로컬 오픈소스 모델(Llama·DeepSeek 계열)을 HuggingFace PEFT+TRL SFTTrainer로 LoRA 파인튜닝하는 방식을 검토할 것 — r=32, α=32, lr=6e-5, 1 epoch, AdamW, FP16 설정이 이 논문의 기본값
LLM 역할을 파이프라인 전체가 아닌 '구조화된 중간 출력 단위(예: 교정 조치 텍스트 생성)'로 좁게 정의하면, 검증·감사가 쉬워지고 포맷 100% 보장이 가능해짐
LLM 출력 품질 평가 시 BLEU 대신 BERT 코사인 유사도(all-mpnet-base-v2)나 LLM-as-a-Judge를 기본 지표로 사용할 것 — 사람 판단과 상관관계가 훨씬 높음

Code Example

snippet

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# 1. 베이스 모델 로드
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="float16")

# 2. LoRA 설정 (논문 기본값)
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

# 3. 학습 데이터 포맷: instruction(Complaint+Cause) + response(Correction)
def format_claim(example):
    return {
        "text": f"""You are given a warranty claim description.
Your task: Output ONLY the corrective action.
Claim description: {example['complaint']} {example['cause']}
corrective action include: {example['correction']}"""
    }

# 4. SFTTrainer로 학습 (응답 토큰만 손실 계산)
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset.map(format_claim),
    args=SFTConfig(
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        learning_rate=6e-5,
        num_train_epochs=1,
        fp16=True,
        max_seq_length=2048,
    ),
)
trainer.train()

Terminology

LoRA모델 전체(수십억 파라미터)를 재학습하지 않고, 작은 행렬 A·B 두 개만 끼워서 학습하는 기법. 안경 렌즈만 교체해 시력을 교정하듯, 원본 모델은 그대로 두고 어댑터만 바꿈.

SFT (Supervised Fine-Tuning)정답 예시(입력→출력 쌍)를 보여주고 그대로 따라하게 학습시키는 방식. 학교에서 예제 풀이를 보고 따라 푸는 것과 동일.

BERTScoreBERT가 만든 문장 임베딩으로 예측 텍스트와 정답의 의미 유사도를 측정하는 지표. 단어가 달라도 뜻이 같으면 높은 점수를 줌.

LLM-as-a-JudgeLLM이 다른 LLM의 출력을 채점하는 평가 방식. 사람 대신 GPT나 Claude가 채점관 역할을 하는 것.

LoRA rank (r)LoRA에서 학습할 저랭크 행렬의 크기. r=32면 작은 32차원 행렬만 학습. r이 작을수록 파라미터 수가 적고 빠르지만 표현력도 제한됨.

RoPE (Rotary Position Embedding)트랜스포머가 단어 순서를 이해하도록 위치 정보를 회전 변환으로 임베딩에 녹여넣는 기법. LLaMA·DeepSeek 계열이 사용.

BLEURT인간 평가 점수를 학습한 신경망 평가 모델. 예측 텍스트와 정답을 입력하면 사람이 줄 법한 점수를 예측해줌.

Related Resources

Original Abstract (Expand)

While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters'decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.