Language Model 코드 생성을 위한 Type-Constrained Decoding

Type-Constrained Code Generation with Language Models

Apr 12, 2025•Niels Mündler, Jingxuan He, Hao Wang +3•View PDF

TL;DR Highlight

LLM이 TypeScript 코드 생성할 때 타입 시스템 규칙을 토큰 단위로 강제 적용해서 컴파일 에러를 절반 이상 줄이는 기법.

Who Should Read

LLM 기반 코드 자동완성, 코드 번역, 코드 수리 도구를 만드는 개발자. 특히 TypeScript 코드 생성 파이프라인에서 컴파일 에러율을 낮추고 싶은 엔지니어.

Core Mechanics

LLM 코드 생성 컴파일 에러의 94%는 문법(syntax) 오류가 아닌 타입 에러가 원인 — 기존 syntax 기반 constrained decoding으로는 6%밖에 못 잡음
prefix automata(부분 입력이 항상 완성 가능한 상태를 유지하는 오토마타)와 type inhabitation search(어떤 타입을 만들 수 있는지 탐색)를 조합해 생성 중 토큰이 타입 규칙을 위반하면 실시간으로 차단
토큰 샘플링 루프에 completion engine을 끼워 타입 위반 토큰은 확률 0으로 마스킹, 99.4% 케이스에서 1회 만에 유효 토큰 통과
Gemma 2 (2B/9B/27B), DeepSeek Coder 33B, CodeLlama 34B, Qwen2.5 32B 등 6개 오픈소스 모델에서 일관되게 효과 확인
코드 번역(translation) 태스크에서 특히 강력 — Gemma 2 9B 기준 HumanEval 번역 컴파일 에러 78.7% 감소
TypeScript의 까다로운 타입 추론 문제(빈 배열의 never[] 타입, 누락된 return statement 등)도 자동으로 교정

Evidence

HumanEval 코드 합성 기준 컴파일 에러 평균 75.3% 감소, MBPP 기준 52.1% 감소 (syntax-only 방식은 각각 9%, 4.8%에 불과)
코드 수리(repair) 태스크 pass@1 평균 37% 향상 — Gemma 2 2B는 79.4%, CodeLlama 34B는 56.9% 향상
코드 합성 pass@1 평균 3.5%, 번역 5.0% 향상, 런타임 오버헤드는 평균 39~52% 수준
Claude 3.5 Sonnet도 Rust 컴파일 에러율 27%, DeepSeek R1은 39% 수준 — 강한 모델도 타입 에러에서 자유롭지 않음을 외부 연구에서 확인

How to Apply

LLM 추론 루프의 토큰 샘플링 단계에 completion engine을 삽입해 타입 위반 토큰 확률을 0으로 설정하는 방식으로 통합 — 추가 LLM 호출 없이 적용 가능
코드 번역 파이프라인(Python→TypeScript 등)에 적용하면 번역 중 발생하는 API 시그니처 불일치, 누락 인자 등 타입 에러를 생성 시점에 차단
오픈소스 구현체(https://github.com/eth-sri/type-constrained-code-generation)를 기존 transformers 기반 추론 코드에 플러그인 형태로 연결 가능

Code Example

snippet

# 개념적 pseudocode: type-constrained decoding 핵심 루프
# (논문 Algorithm 1 기반)

def constrained_generate(llm, prompt, completion_engine):
    s = ""  # 현재까지 생성된 코드
    while True:
        logits = llm(prompt + s)  # 다음 토큰 확률 분포
        while True:
            token = sample(logits)  # 토큰 샘플링
            if completion_engine(s + token):  # 타입 규칙 만족하는 prefix인지 체크
                break
            elif token == EOS and s in target_language:
                break
            else:
                logits[token] = 0  # 위반 토큰 마스킹 후 재샘플링
                normalize(logits)
        if token == EOS:
            break
        s = s + token
    return s

# completion_engine 핵심: 현재 부분 코드가 well-typed 완성 가능한지 판단
# → prefix automata가 non-empty state set을 반환하면 True

Terminology

Constrained DecodingLLM이 토큰을 생성할 때 특정 규칙을 어기는 토큰은 아예 선택 못 하도록 실시간으로 막는 기법. 자동완성에서 문법 오류나 타입 에러가 나올 토큰을 미리 차단하는 것.

Type Inhabitation'이 타입을 만족하는 표현식이 존재하는가?'를 판단하는 문제. 예: string 타입이 필요할 때 현재 number 변수에서 .toString()을 붙이면 string을 만들 수 있다는 걸 찾아내는 것.

Prefix Automaton현재까지 입력된 부분 문자열이 나중에 올바른 완성으로 이어질 수 있는지 추적하는 오토마타. 막힌 상태(dead state)가 없어서 항상 유효한 완성이 존재함을 보장.

pass@1LLM이 코드를 한 번 생성했을 때 테스트를 통과하는 비율. 여러 번 시도하지 않고 첫 번째 생성물의 정확도를 측정하는 지표.

Type Environment (Γ)현재 스코프에 선언된 변수명과 그 타입의 매핑 테이블. 예: {x: number, name: string}. 타입 체커가 이걸 보고 변수 사용이 올바른지 판단.

prefix language어떤 언어 L에 속하는 완성된 문자열의 앞부분(prefix)들의 집합. 타입 검사기는 완성된 코드만 판단하지만, constrained decoding은 prefix language를 실시간으로 판단해야 함.

Related Resources

Original Abstract (Expand)

Large language models (LLMs) have achieved notable success in code generation. However, they still frequently produce uncompilable output because their next-token inference procedure does not model formal aspects of code. Although constrained decoding is a promising approach to alleviate this issue, it has only been applied to handle either domain-specific languages or syntactic features of general-purpose programming languages. However, LLMs frequently generate code with typing errors, which are beyond the domain of syntax and generally hard to adequately constrain. To address this challenge, we introduce a type-constrained decoding approach that leverages type systems to guide code generation. For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code. We formalize our approach on a foundational simply-typed language and extend it to TypeScript to demonstrate practicality. Our evaluation on the HumanEval and MBPP datasets shows that our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks across LLMs of various sizes and model families, including state-of-the-art open-weight models with more than 30B parameters. The results demonstrate the generality and effectiveness of our approach in constraining LLM code generation with formal rules of type systems.