From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
TL;DR Highlight
Gemma 4-31B achieves 90.91% success in formal verification, mathematically proving LLM-generated code with 100% certainty.
Who Should Read
Backend/systems developers who must guarantee the correctness of LLM-generated code, or engineers evaluating AI code generation automation for high-reliability software development in fields like aerospace, cryptography, and security.
Core Mechanics
- LLMs almost universally fail to generate valid Dafny code from natural language descriptions alone (Contextless Prompting)—including GPT-OSS-120B, which achieved 0% success.
- Providing a Method Signature (function name and I/O types) as a hint dramatically reverses success rates—GPT-OSS-120B improved from 0% to 63.64%, and Qwen 3.5-9B reached 72.73%.
- Self-Healing (feeding Dafny verification error messages back to the LLM for iterative correction) is the most powerful strategy—Gemma 4-31B achieved 90.91% even in a contextless state.
- Integration of the uDebug community test suite as a dual verification layer prevents 'Vacuous Verification' (fooling the verifier with trivial specs), simultaneously performing formal verification and real-world edge case testing.
- Pre-training data density is more important than model size—31B Gemma 4 outperformed 120B GPT-OSS in contextless scenarios. Dafny-related GitHub repositories are extremely sparse, with only about 779.
- Error types fall into three categories: Syntax Error (structural grammar errors) → Semantic/Type Error (type mismatches) → Verification Failure (compiles but SMT solver cannot prove). Providing a Signature reduces Syntax Errors while increasing Verification Errors.
Evidence
- "Gemma 4-31B: Achieved verify@5 of 90.91% (10 out of 11 problems solved) with Self-Healing (Contextless-based) at T=0.2 and T=0.6.\nGPT-OSS-120B: Improved from 0% in Contextless to 63.64% with Signature Prompting, and to 81.82% with Signature-Guided Self-Healing.\nQwen 3.5-9B (9B small model): Achieved verify@5 of 72.73% with Signature Prompting alone—outperforming 30B and 120B models.\nAuthors invested approximately 350 person-hours (300 hours development + 50 hours conflict resolution) in building the NL2VC-60 dataset—demonstrating the realistic difficulty of writing Dafny formal specs."
How to Apply
- "When generating algorithm code with LLMs, include the function signature (parameter types, return type) in the prompt—success rates increase by orders of magnitude compared to contextless prompting.\nLoop a formal verifier like Dafny into your code generation pipeline, and configure a Self-Healing agent that automatically retries up to 10 times, feeding verification failure messages directly back to the LLM.\nEven after passing formal verification, there’s a risk of being tricked by trivial specs, so add a community-based edge case test suite like uDebug as a dual verification layer—essential for high-reliability software."
Code Example
# Self-Healing prompt structure example (Python pseudo-code)
import subprocess
def self_healing_dafny(problem_desc: str, method_signature: str, llm_client, max_iter=10):
# Step 1: Signature Prompt로 초기 코드 생성
prompt = f"""You are an expert in Dafny. Output ONLY raw Dafny code.
Generate one Dafny source file for the following task.
Task Description: {problem_desc}
Method Signature: {method_signature}"""
code = llm_client.generate(prompt)
for attempt in range(max_iter):
# Step 2: Dafny 검증기 실행
result = subprocess.run(
["dafny", "verify", "temp.dfy"],
input=code, capture_output=True, text=True
)
if "0 errors" in result.stdout:
print(f"✅ Verified at attempt {attempt+1}")
return code # 검증 성공
# Step 3: 에러 메시지를 피드백으로 Self-Healing
error_msg = result.stdout + result.stderr
heal_prompt = f"""The previous Dafny code failed verification with the following errors:
{error_msg}
Previous code:
{code}
Please repair the code to satisfy all specifications. Output ONLY the raw fixed Dafny code."""
code = llm_client.generate(heal_prompt)
print("❌ Failed after max iterations")
return NoneTerminology
Related Papers
Tendril – a self-extending agent that builds and registers its own tools
Tendril demonstrates a self-extending AI agent pattern by dynamically writing and registering tools when needed, creating a growing repository of capabilities with each session.
Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
Dirac cuts API costs 64.8% and achieves 65.2% on TerminalBench-2 with efficient context management.
EvanFlow – A TDD driven feedback loop for Claude Code
EvanFlow automates code brainstorming, TDD, and validation in Claude Code with 16 skills triggered by a single prompt.
An AI agent deleted our production database. The agent's confession is below
Cursor AI Agent가 Railway 프로덕션 데이터베이스와 백업까지 통째로 삭제한 사고 사례로, AI Agent에 과도한 권한을 줄 때의 위험성과 엔지니어링 통제의 중요성을 보여준다.
Show HN: A Karpathy-style LLM wiki your agents maintain (Markdown and Git)
WUPHF builds a shared knowledge base using a Git-based Markdown Wiki, enabling multiple AI agents—including Claude and Codex—to autonomously divide and execute tasks.
Agentic AI systems violate the implicit assumptions of database design
AI Agents shatter a 40-year assumption—that databases only accept deterministic queries from humans—and this post details specific defensive patterns to mitigate the resulting risks.
Related Resources
Original Abstract (Expand)
Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91\% verification success rate, while GPT-OSS 120B rose from zero to 81.82\% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.