LLM으로 텍스트 구간(Span) 레이블링하는 전략 비교와 LOGITMATCH

Strategies for Span Labeling with Large Language Models

Jan 23, 2026•D. Semin, Ondrej Dusek, Zdenek Kasner•View PDF

TL;DR Highlight

LLM으로 NER·문법 오류 탐지 같은 텍스트 구간 레이블링 할 때 XML 태깅·인덱싱·JSON 매칭 중 뭘 써야 하는지, 그리고 매칭 오류를 근본적으로 없애는 LOGITMATCH까지 실험으로 정리한 논문.

Who Should Read

LLM으로 NER, 오류 감지, 정보 추출 파이프라인을 만드는 백엔드/ML 엔지니어. 특히 LLM 출력을 파싱해서 텍스트의 특정 구간을 찾아야 할 때 어떤 프롬프트 포맷을 써야 할지 고민하는 개발자에게 바로 적용 가능하다.

Core Mechanics

LLM 기반 Span 레이블링 전략을 세 가지로 분류: XML 태그로 감싸는 Tagging, 문자 위치를 숫자로 출력하는 Indexing, JSON에 텍스트 내용을 그대로 출력하는 Matching
LLM은 텍스트를 정확히 복사 못함 — 오타 수정·대소문자 변경 등으로 Matching 시 원본과 불일치 발생. Tagging도 동일한 문제가 있으나 heuristic post-processing으로 어느 정도 커버
LLM은 문자 인덱스를 직접 계산 못함 — Indexing 방식은 단어 경계를 무시한 잘못된 인덱스를 자주 생성. 입력 텍스트에 번호를 직접 삽입(INDEX-ENRICHED)하면 개선되지만 모델 성능에 악영향을 주기도 함
LOGITMATCH: JSON 디코딩 중 'text' 필드 생성 시 어휘를 입력 토큰으로만 제한하는 constrained decoding(생성 토큰 제한 기법). 파인튜닝 없이 vLLM LogitsProcessor로 구현 가능
XML 태깅이 전반적으로 가장 안정적인 방법 — 특히 GEC(문법 오류 교정)에서 일관적으로 우위. Matching보다 토큰을 더 많이 쓰는 단점 있음
structured output 강제가 항상 유리하지 않음 — 포맷을 강제하면 모델이 자발적으로 생성하던 chain-of-thought가 막혀 오히려 성능 저하 발생

Evidence

직접 인덱스 예측(INDEX)은 오픈 LLM 기준 전 태스크에서 F1 24% 이하로 전략 중 최하위 (Qwen3-8B NER: 17.8%, Llama-3.3-70B NER: 23.3%)
입력에 번호 삽입(INDEX-ENRICHED)으로 NER 성능 21~45%p 향상 (Qwen3-8B: 17.8→39.6, Llama-3.3-70B: 23.3→59.3)
CPL 태스크(동일 구간 중복 출현)에서 occurrence_index 추가 시 매칭 성능 30~40%p 향상 (Qwen3-8B MATCH 30.6 → MATCH-OCC 73.4)
Qwen3-8B에서 추론 활성화(Think 모드) 시 LOGITMATCH NER hard F1: 71.4→84.2, GEC: 15.8→35.8로 대폭 향상

How to Apply

기본 Span 레이블링이 필요하면 XML 태깅부터 써라 — 안정성이 가장 높고 GEC처럼 정확한 구간 경계가 중요한 태스크에서 특히 유리. 프롬프트에 '입력 전체를 복사하되 해당 구간을 태그로 감싸라' 명시 필수
JSON Matching 방식을 쓰는데 같은 단어가 여러 번 나오는 경우(예: 로그 파싱, 반복 패턴 탐지)라면 occurrence 필드를 추가해라 — CPL 태스크에서 30~40%p 향상 확인
로컬 LLM(vLLM)을 쓴다면 LOGITMATCH LogitsProcessor 도입 검토 — 비표준 토크나이징 텍스트(NLP 전처리된 입력 등)에서 Matching 방식의 오정렬 문제를 근본적으로 제거 가능

Code Example

snippet

# 세 가지 전략 프롬프트 예시

# 1. XML Tagging — 가장 안정적
tagging_prompt = """
Extract named entities (PER, ORG, LOC) from the text.
Surround spans with XML tags. Copy the ENTIRE input text including non-tagged parts.

Example:
Input: Turing was born in London.
Output: <entity type="PER">Turing</entity> was born in <entity type="LOC">London</entity>.

Input: {input_text}
Output:"""

# 2. JSON Matching — 토큰 효율적
matching_prompt = """
Extract named entities (PER, ORG, LOC) from the text.
Return a valid JSON array only. Use exact text from input.

Example:
Input: Turing was born in London.
Output: [{"text": "Turing", "label": "PER"}, {"text": "London", "label": "LOC"}]

Input: {input_text}
Output:"""

# 3. JSON Matching + occurrence — 중복 구간 대응
matching_occ_prompt = """
Extract named entities. Include occurrence index to disambiguate repeated spans.

Example:
Input: The Paris agreement was signed in Paris.
Output: [{"text": "Paris", "label": "ORG", "occurrence": 1},
         {"text": "Paris", "label": "LOC", "occurrence": 2}]

Input: {input_text}
Output:"""

# LOGITMATCH — vLLM LogitsProcessor로 구현 (로컬 LLM 전용)
# https://github.com/semindan/span_labeling 참고

Terminology

Span Labeling텍스트에서 특정 구간(예: '서울', '문법 오류 부분')을 찾아 레이블을 붙이는 작업. NER이나 오류 감지가 대표적인 예.

Constrained DecodingLLM이 다음 토큰을 고를 때 허용된 토큰 목록만 선택하도록 제한하는 기법. 마치 사지선다에서 보기 이외의 답을 쓰지 못하게 막는 것과 비슷.

LogitsProcessorLLM이 토큰을 생성할 때 각 후보 토큰의 점수(logit)를 조작하는 컴포넌트. 특정 토큰을 금지하거나 강제하는 데 사용.

BIO Tags텍스트의 각 단어를 B(구간 시작), I(구간 내부), O(구간 외부)로 태깅하는 고전적인 방식. BERT 같은 인코더 모델에서 주로 씀.

NERNamed Entity Recognition. 텍스트에서 사람 이름, 조직, 장소 같은 고유 명사를 찾아내는 태스크.

structured outputLLM이 반드시 특정 JSON 스키마나 포맷에 맞는 결과만 출력하도록 강제하는 기능. vLLM, Guidance 같은 프레임워크에서 지원.

vLLMLLM을 빠르게 서빙하기 위한 오픈소스 프레임워크. PagedAttention 기법으로 GPU 메모리를 효율적으로 관리함.

Related Resources

https://github.com/semindan/span_labeling

Original Abstract (Expand)

Large language models (LLMs) are increasingly used for text analysis tasks, such as named entity recognition or error detection. Unlike encoder-based models, however, generative architectures lack an explicit mechanism to refer to specific parts of their input. This leads to a variety of ad-hoc prompting strategies for span labeling, often with inconsistent results. In this paper, we categorize these strategies into three families: tagging the input text, indexing numerical positions of spans, and matching span content. To address the limitations of content matching, we introduce LogitMatch, a new constrained decoding method that forces the model's output to align with valid input spans. We evaluate all methods across four diverse tasks. We find that while tagging remains a robust baseline, LogitMatch improves upon competitive matching-based methods by eliminating span matching issues and outperforms other strategies in some setups.