Diagnosing CFG Interpretation in LLMs
TL;DR Highlight
LLMs frequently lose semantic meaning despite syntactically correct output when exposed to novel grammar rules.
Who Should Read
Backend/ML engineers curious about how reliably LLMs can handle structured outputs like JSON schemas, function signatures, and DSLs. Specifically, developers concerned with the stability of tool-calling or code generation pipelines.
Core Mechanics
- LLM performance degrades hierarchically: Syntax → Behavior → Semantics. While LLMs can somewhat follow grammar rules, they frequently fail to maintain logical consistency.
- Performance plummets non-linearly with increasing nesting depth. With Mimo-V2-flash, semantic accuracy (SCR) dropped from 39% at depth=2 to 0.5% at depth=20.
- Model size doesn't guarantee better performance. GPT-5-mini outperformed the larger GPT-5.2 (90% vs. 60% BER) on Goal-Conditioned Generation Tasks.
- Using 'Alien' lexicons (random tokens like v_xkqm instead of meaningful keywords like loop) drastically reduces performance, indicating LLMs rely on keyword semantics rather than pure grammar inference.
- Chain-of-Thought (CoT) reasoning is essential but not a panacea. Without CoT, SVR falls below 10% even at depth=2, but even with CoT, SCR collapses to 0.5% at depth=20.
- Increasing few-shot examples doesn't help with deep recursion and can even be detrimental. 1-shot performance is often worse than 0-shot, and 5-shot frequently fails to surpass 0-shot.
Evidence
- DeepSeek-V3.2, the strongest open-source model, achieved only 39.5% Instruction-to-Code SCR with an 'Alien' lexicon and depth=10. Qwen3-8B scored 0% SCR.
- GPT-5-mini exhibited 65% SVR and 9.23% CSCR (the proportion of syntactically and semantically correct outputs) on Task 3, indicating poor semantic alignment.
- Recursion depth ablation: SVR dropped from 42% and SCR from 39% at depth=2 to SVR 10% and SCR 0.5% at depth=20 (Mimo-V2-flash).
- Adding an else branch to every if statement (raising Else-branch probability to 1.0) reduced SVR from 21.5% to 13.0%. Switching to S-expr style halved SCR from 9% to 4.5%.
How to Apply
- When designing LLM interactions with dynamically defined JSON schemas or function signatures, minimize nesting depth. Structural error probability increases non-linearly with depth, making multi-stage calls preferable to complex schemas.
- Always enable Chain-of-Thought (CoT) when prompting LLM tool-calling agents with custom DSLs or new grammars. Without CoT, SVR drops below 10% even at shallow depths; include 'think step by step' instructions for structured output requests.
- When defining keywords for new grammars, prioritize words the model already knows (loop, if, return). Using entirely new tokens can lead to performance degradation as the model fails to infer keyword semantics (Natural vs Alien: SVR 24.5% vs 21.5%, BER 21.5% vs 17.5%).
Code Example
# Prompt pattern for making LLMs follow a new grammar in ROBOGRID style
# CoT activation + explicit EBNF provision are key
prompt = """
You are a strict code generator. Think step by step before generating code.
You are given a NEW programming language definition (EBNF):
start: stmt+
stmt: action_stmt | loop | if_stmt
action_stmt: DO action END
loop: LOOP INT TIMES LBR stmt+ RBR
if_stmt: IF cond THEN LBR stmt+ RBR (ELSE LBR stmt+ RBR)?
action: MOVE MOVE_DIR INT? | TURN TURN_DIR | GRAB ITEM
cond: HOLDING ITEM
Terminal mappings:
DO: "exec" END: "stop"
LOOP: "repeat" TIMES: "times"
IF: "when" THEN: "then"
LBR: "{" RBR: "}"
Instructions:
{natural_language_instruction}
First, break down the instruction into a tree structure (AST).
Then, translate each node into the grammar terminals above.
Finally, output the complete valid program.
"""
# Key design principles:
# 1. Explicitly state EBNF at the beginning of the prompt
# 2. Induce CoT ('Think step by step' or step-by-step instructions)
# 3. Limit nesting depth to a maximum of 5 (depth>10 results in almost 0 SCR)
# 4. Keep keywords close to natural language (avoid Alien tokens)Terminology
Original Abstract (Expand)
As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.