DS²-INSTRUCT: 특정 도메인에 특화된 LLM Instruction Tuning 데이터 자동 생성 프레임워크

DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

Mar 13, 2026•Ruiyao Xu, Noelle I. Samia, Han Liu•View PDF

TL;DR Highlight

태스크 정의만 넣으면 금융·의학·수학 등 전문 도메인 파인튜닝 데이터를 사람 도움 없이 자동으로 만들어주는 프레임워크

Who Should Read

법률, 의료, 금융 등 특정 도메인 전용 LLM을 파인튜닝해야 하는 ML 엔지니어 또는 도메인 특화 챗봇을 개발하는 팀. 학습 데이터가 부족하거나 직접 어노테이션할 전문가를 구하기 어려운 상황에 처한 개발자.

Core Mechanics

태스크 설명(task definition)만 있으면 seed 데이터나 도메인 문서 없이 zero-shot으로 instruction 데이터셋 생성 가능
키워드를 기초 개념(prerequisite)과 심화 개념(advanced) 양방향으로 확장하고, BM25로 The Pile 문서를 검색해 키워드 풀을 보강함
Bloom's Taxonomy(기억→이해→적용→분석→평가→창조 6단계 인지 수준 분류 체계)를 적용해 단순 암기 문제부터 창의적 문제까지 다양한 난이도의 질문 자동 생성
Self-consistency filtering(같은 질문에 여러 번 답변 생성해서 일관성 높은 것만 남기는 품질 필터)으로 애매하거나 잘못된 instruction-response 쌍 자동 제거
Qwen2.5-72B-Instruct로 데이터 생성 후 Llama-3.1-8B, Mistral-7B, Qwen2.5-7B에 LoRA(파라미터 일부만 학습하는 경량 파인튜닝 기법)로 파인튜닝해서 7개 도메인 벤치마크 전반에서 기존 방법 대비 일관되게 성능 향상
데이터 6,000개 기준에서도 성능 곡선이 포화되지 않아서 더 많은 데이터로 추가 개선 여지 있음

Evidence

Llama-3.1-8B 기준 zero-shot 평균 14.34% → DS²-INSTRUCT 파인튜닝 후 42.83%로 약 3배 향상, 기존 최고 방법(InstructMix 25.94%) 대비 +16.89%p
Mistral-7B 기준 zero-shot 평균 9.70% → DS²-INSTRUCT 파인튜닝 후 37.83%, 기존 최고(InstructMix 30.29%) 대비 +7.54%p
Qwen2.5-7B 기준 DS²-INSTRUCT 평균 54.95%로 기존 최고(InstructMix 53.07%) 대비 +1.88%p, 7개 도메인 모두에서 1위
생성 데이터 품질 평가(500샘플 수동 검토): GSM8K 기준 유효 instruction 96.83%, 유효 response 91.26%, 도메인 적합성 94.91%

How to Apply

도메인 특화 LLM을 만들어야 할 때, 태스크 설명 텍스트 하나를 작성하고 DS²-INSTRUCT 파이프라인을 돌려 수천 개의 instruction-response 쌍을 생성한 뒤 LoRA(r=8, α=16)로 파인튜닝하면 됨. 어노테이터 없이도 됨.
키워드 생성 단계에서 50개 시드 키워드 → 100회 양방향 확장 → BM25 검색 보강 순서로 진행하면 도메인 커버리지가 크게 넓어짐. 소규모 도메인이라면 이터레이션 수를 줄여도 무방.
Self-consistency 필터링 threshold를 τ=3/5(5번 중 3번 이상 같은 답)으로 설정하면 노이즈 데이터를 자동 제거할 수 있음. 정확도가 중요한 의료·금융 도메인일수록 threshold를 높이면 품질 ↑, 데이터 양 ↓ 트레이드오프를 조절 가능.

Code Example

snippet

# DS²-INSTRUCT 핵심 프롬프트 패턴 예시

# 1단계: 초기 키워드 생성
initial_keyword_prompt = """
Task Context: You are an expert in {domain}.
Task Description: {task_description}

Instructions: Generate 50 core keywords that represent the most essential concepts for this task.
Requirements:
- List exactly 50 core concepts separated by commas
- Use underscores for multi-word concepts (e.g., asset_valuation)
- Provide only the comma-separated list without any other text
Core Keywords:
"""

# 2단계: 양방향 키워드 확장
bidirectional_expansion_prompt = """
Task Context: You are an expert in the domain related to: {task_description}
Sample Keywords: {sampled_keywords}

Instructions: Based on the sample keywords, generate new concepts in two directions:
1. Prerequisite Concepts: fundamental concepts learners must understand BEFORE the sample keywords
2. Advanced Concepts: specialized topics that BUILD UPON the sample keywords

Requirements:
- Generate 5 concepts for each direction
- Use underscores for multi-word concepts
- Provide comma-separated lists
"""

# 3단계: Bloom's Taxonomy 기반 instruction 생성
bloom_levels = {
    "Remembering": "recall of factual knowledge, definitions, basic concepts",
    "Understanding": "conceptual understanding, explanation of relationships",
    "Applying": "practical use of methods, real-world application",
    "Analyzing": "breaking down complex ideas, identifying patterns",
    "Evaluating": "critical judgment, validation, justification of decisions",
    "Creating": "original thinking, synthesis of ideas, novel applications"
}

instruction_gen_prompt = """
Task Description: {task_description}
Keyword: {keyword}
Question Type: {cognitive_level} - {cognitive_level_description}

Generate a high-quality question that precisely targets the keyword and question type.
Directly output the question only.
Generated Question:
"""

# 4단계: Self-consistency 필터링
# N=5번 답변 생성 후 threshold τ=3/5 이상 일치하는 것만 유지
def self_consistency_filter(instruction, model, N=5, tau=0.6):
    responses = [model.generate(instruction) for _ in range(N)]
    answers = [extract_answer(r) for r in responses]
    from collections import Counter
    most_common_answer, count = Counter(answers).most_common(1)[0]
    vote_ratio = count / N
    if vote_ratio >= tau:
        return most_common_answer  # 고품질 데이터로 유지
    else:
        return None  # 필터링으로 제거

Terminology

Instruction TuningLLM에게 '질문-답변' 형식의 예제를 대량으로 보여줘서 지시를 잘 따르도록 추가 학습시키는 방법. 기본 언어모델을 챗봇처럼 만드는 과정.

LoRA모델 전체를 다시 학습하지 않고 작은 어댑터 레이어만 끼워서 학습하는 파인튜닝 기법. 학습 비용을 크게 줄일 수 있음.

Bloom's Taxonomy교육학에서 인지 수준을 6단계(기억→이해→적용→분석→평가→창조)로 나눈 분류 체계. 이 논문에서는 다양한 난이도의 문제를 자동으로 만들기 위해 활용.

Self-Consistency같은 질문을 여러 번 모델에 던져서 답변이 일관되면 신뢰도 높다고 판단하는 기법. 틀린 답은 매번 달라질 가능성이 높다는 아이디어에서 출발.

BM25문서 검색에서 키워드 빈도와 문서 길이를 고려해 관련성을 계산하는 고전적인 랭킹 알고리즘. 구현이 간단하고 실용적이어서 아직도 많이 쓰임.

Zero-Shot특정 태스크에 대한 예제나 학습 데이터 없이 바로 수행하는 방식. 사람으로 치면 처음 보는 문제를 아무 힌트 없이 푸는 것.

Self-InstructLLM이 스스로 instruction 데이터를 생성해서 자신을 개선하는 방법. 소수의 사람이 쓴 seed 예제에서 시작해 LLM이 더 많은 예제를 만들어냄.

Related Resources

DS²-INSTRUCT GitHub

Original Abstract (Expand)

Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS$^2$-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.