Type-Constrained Code Generation with Language Models

Apr 12, 2025•Niels Mündler, Jingxuan He, Hao Wang +3•View PDF

TL;DR Highlight

Enforcing TypeScript type system rules token-by-token during LLM code generation, cutting compilation errors by more than half.

Who Should Read

Developers building LLM-based code autocomplete, code translation, or code repair tools. Engineers specifically looking to reduce compile error rates in TypeScript code generation pipelines.

Core Mechanics

94% of LLM code generation compile errors are type errors, not syntax errors — existing syntax-based constrained decoding only catches 6%
Combining prefix automata (ensuring partial input always stays completable) with type inhabitation search (exploring what expressions can satisfy a type) enables real-time type-aware token filtering
Compilation errors reduced by 75.3% on HumanEval and 52.1% on MBPP (syntax-only approaches: 9% and 4.8% respectively)
Works across code synthesis, translation, and repair tasks with consistent improvements

Evidence

HumanEval code synthesis: average 75.3% compile error reduction; MBPP: 52.1% reduction (syntax-only: 9% and 4.8%)
Code repair task pass@1 improved 37% on average — Gemma 2 2B improved 79.4%, CodeLlama 34B improved 56.9%
Code synthesis pass@1 improved 3.5% on average, translation pass@1 improved across all tested models

How to Apply

Integrate a completion engine into the token sampling step of your LLM inference loop that sets type-violating token probabilities to zero — no additional LLM calls needed.
Apply to code translation pipelines (e.g., Python→TypeScript) to block type errors like API signature mismatches and missing arguments at generation time.
Open-source implementation available for integration with existing LLM serving infrastructure.

Code Example

snippet

# Conceptual pseudocode: type-constrained decoding core loop
# (based on paper Algorithm 1)

def constrained_generate(llm, prompt, completion_engine):
    s = ""  # code generated so far
    while True:
        logits = llm(prompt + s)  # next token probability distribution
        while True:
            token = sample(logits)  # token sampling
            if completion_engine(s + token):  # check if prefix satisfies type rules
                break
            elif token == EOS and s in target_language:
                break
            else:
                logits[token] = 0  # mask violating token and resample
                normalize(logits)
        if token == EOS:
            break
        s = s + token
    return s

# completion_engine core: determines whether current partial code can be completed as well-typed
# → returns True if prefix automata returns a non-empty state set

Terminology

Constrained DecodingA technique that blocks the LLM from selecting tokens that violate certain rules in real-time during generation. Like autocomplete that prevents syntax or type errors from ever being suggested.

Type InhabitationThe problem of determining 'does an expression satisfying this type exist?' For example: when a string type is needed, checking if any in-scope variable or expression can produce one.

Related Resources

Original Abstract (Expand)

Large language models (LLMs) have achieved notable success in code generation. However, they still frequently produce uncompilable output because their next-token inference procedure does not model formal aspects of code. Although constrained decoding is a promising approach to alleviate this issue, it has only been applied to handle either domain-specific languages or syntactic features of general-purpose programming languages. However, LLMs frequently generate code with typing errors, which are beyond the domain of syntax and generally hard to adequately constrain. To address this challenge, we introduce a type-constrained decoding approach that leverages type systems to guide code generation. For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code. We formalize our approach on a foundational simply-typed language and extend it to TypeScript to demonstrate practicality. Our evaluation on the HumanEval and MBPP datasets shows that our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks across LLMs of various sizes and model families, including state-of-the-art open-weight models with more than 30B parameters. The results demonstrate the generality and effectiveness of our approach in constraining LLM code generation with formal rules of type systems.