FMBench: LLM의 Markdown 출력 포맷 정확도 벤치마크

FMBench: Adaptive Large Language Model Output Formatting

Feb 6, 2026•Yaoting Wang, Yun Zhou, Henghui Ding•View PDF

TL;DR Highlight

LLM이 Markdown을 얼마나 잘 지키는지 측정하는 벤치마크 + SFT→GRPO 파인튜닝으로 포맷 준수율을 높이는 방법 제안.

Who Should Read

LLM 응답을 Markdown으로 파싱하거나 렌더링하는 서비스를 운영하는 백엔드/AI 엔지니어. 특히 챗봇, 문서 자동화, 도구 연동 파이프라인에서 포맷 오류로 골머리를 앓는 개발자.

Core Mechanics

LLM은 내용은 맞아도 Markdown 포맷을 자주 틀림 — 깨진 리스트, 비정상 테이블, 불일치 헤딩, 닫히지 않은 코드 블록이 대표적
FMBench: 1,100개 샘플(학술·법률·비즈니스 등 8개 도메인)로 구성된 Markdown 포맷 전용 벤치마크 공개. train 800 / test 300 분리.
평가 지표 2개: 의미 보존(BERTScore-F1) + 구조 준수(structural reward). 이 둘이 트레이드오프 관계임을 실험으로 확인.
SFT(LoRA, rank=8)만으로도 의미 점수가 크게 오르고, 그 위에 GRPO(강화학습)를 추가하면 구조 점수까지 함께 개선됨.
OpenPangu-1B/7B, Qwen3-1.7B/8B 4개 모델 실험 결과, 모델이 클수록 의미·구조 동시 최적화 효과가 뚜렷함.
순수 GRPO(SFT 없이)는 작은 모델에서 의미 점수를 오히려 떨어뜨릴 수 있어, SFT→GRPO 순서가 중요함.

Evidence

Qwen3-8B: SFT+GRPO로 Semantic 0.9347→0.9507, Structure 0.9700→0.9708 달성
OpenPangu-1B: SFT+GRPO로 Semantic 0.9300→0.9482, Structure 0.9535→0.9603
ablation에서 SFT 단독은 구조 점수를 거의 안 바꾸고(0.9535→0.9602), GRPO 추가 시 구조가 추가 개선됨
순수 GRPO(SFT 없이)는 OpenPangu-1B의 Semantic을 0.9300→0.9308로 거의 못 올리지만, SFT 후 GRPO는 0.9482까지 끌어올림

How to Apply

LLM 출력 포맷이 깨지는 문제를 겪고 있다면, FMBench 데이터셋(GitHub)을 활용해 자체 모델의 Markdown 준수율을 먼저 측정해보면 된다.
파인튜닝이 가능한 환경이라면 SFT(LoRA rank=8, lr=2e-5, 1 epoch) → GRPO 순서로 적용. 의미 품질은 SFT가, 구조 안정성은 GRPO가 담당하는 역할 분리를 염두에 둘 것.
파인튜닝 없이 프롬프트 엔지니어링만 쓴다면, 논문이 지적한 대로 'code-fence 균형, list 중첩 유효성, table 컬럼 일관성' 같은 체크리스트를 post-processing 단계에서 validator로 검증하는 방식으로 보완할 수 있다.

Code Example

snippet

# FMBench 스타일 Markdown 구조 검증 예시 (Python)
import re

def check_markdown_structure(text: str) -> dict:
    issues = []
    
    # 코드 블록 균형 체크
    code_fences = re.findall(r'^```', text, re.MULTILINE)
    if len(code_fences) % 2 != 0:
        issues.append('Unbalanced code fences')
    
    # 헤딩 계층 일관성 체크
    headings = re.findall(r'^(#{1,6})\s', text, re.MULTILINE)
    levels = [len(h) for h in headings]
    for i in range(1, len(levels)):
        if levels[i] - levels[i-1] > 1:
            issues.append(f'Heading level jump: h{levels[i-1]} -> h{levels[i]}')
    
    # 리스트 들여쓰기 체크 (기본)
    list_items = re.findall(r'^(\s*)[\-\*\+]\s', text, re.MULTILINE)
    indent_levels = [len(s) for s in list_items]
    for i in range(1, len(indent_levels)):
        if indent_levels[i] - indent_levels[i-1] > 2:
            issues.append('Excessive list indent jump')
    
    return {'valid': len(issues) == 0, 'issues': issues}

# 사용 예
result = check_markdown_structure(llm_output)
if not result['valid']:
    print('Format issues:', result['issues'])
    # 재생성 또는 후처리 트리거

Terminology

SFT정답 예시를 보여주고 따라 하게 학습시키는 방법(Supervised Fine-Tuning). 학교에서 교사가 풀이 과정을 보여주면 학생이 따라 푸는 것과 같음.

GRPO여러 후보 답변을 동시에 생성하고 상대적 품질로 보상을 매기는 강화학습 기법(Group Relative Policy Optimization). 시험 답안 여러 개를 비교해 가장 잘 쓴 것에 점수를 주는 방식.

LoRA모델 전체를 재학습하지 않고 작은 어댑터 행렬만 추가로 학습하는 파라미터 효율적 파인튜닝 기법. 안경 교체만으로 시력을 교정하듯, 모델 본체는 그대로 두고 어댑터만 교체함.

BERTScore두 텍스트의 의미적 유사도를 BERT 임베딩으로 측정하는 평가 지표. 단순 단어 일치가 아니라 '의미가 얼마나 비슷한지'를 벡터 거리로 계산함.

RLHF사람의 선호 피드백을 보상 신호로 삼아 모델을 학습시키는 강화학습(Reinforcement Learning from Human Feedback). ChatGPT 같은 어시스턴트 모델이 '사람이 좋아하는 답변'을 하도록 만드는 핵심 기법.

constrained decoding토큰 생성 시 문법 규칙에 맞는 토큰만 선택하도록 강제하는 추론 기법. JSON 스키마나 문법을 무조건 지키게 하지만 속도가 느려지고 자연스러운 표현이 억제될 수 있음.

structural reward생성된 텍스트의 구조적 올바름(헤딩 계층, 리스트 중첩, 테이블 형식 등)을 수치로 환산한 보상 신호. 강화학습에서 '포맷을 잘 지켰는지'를 판단하는 점수판.

Related Resources

https://github.com/FudanCVL/FMBench

Original Abstract (Expand)

Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.