Leanstral: Open-source agent for trustworthy coding and formal proof engineering
TL;DR Highlight
Mistral released Leanstral, an Apache 2.0 AI agent for Lean 4 mathematical formal proofs, achieving comparable or better performance than Claude Sonnet at 1/15th the cost.
Who Should Read
Developers and researchers applying AI code generation to mission-critical software or mathematical research where accuracy verification is a bottleneck. Researchers in math/CS interested in Lean 4 or formal verification.
Core Mechanics
- Human review of AI-generated code is the biggest engineering speed bottleneck. Leanstral is an agent that solves this by proving mathematically rigorous specifications alongside code generation.
- Lean 4 is a proof assistant that can express both complex mathematical structures (like perfectoid spaces) and software specifications (like Rust code properties). Leanstral is the first open-source code agent designed specifically for Lean 4.
- MoE (Mixture of Experts) architecture with 120B total parameters but only 6B activated during inference for high efficiency. Weights are fully released under Apache 2.0 license.
- Published FLTEval, a new benchmark evaluating real Fermat's Last Theorem project PRs — completing formal proofs and correctly defining new mathematical concepts, moving beyond competition math problem solving.
- Outstanding efficiency vs open-source models. GLM5-744B-A40B and Kimi-K2.5-1T-32B score only 16.6 and 20.1 on FLTEval; Leanstral surpasses both with just pass@1. Strongest OSS competitor Qwen3.5-397B-A17B achieves 25.4 at pass@4, while Leanstral hits 26.3 at pass@2.
- Dominant cost efficiency vs Claude family. Leanstral pass@2 ($36) scores 2.6 points higher than Claude Sonnet ($549), and pass@16 ($290) scores 8 points higher. Claude Opus 4.6 leads at 39.6 but costs $1,650 — 92x Leanstral's cost.
- Trained for maximum performance with lean-lsp-mcp (Lean Language Server Protocol wrapped as MCP), and can attach arbitrary MCPs using Mistral Vibe as scaffold.
- Real-world case: Leanstral solved a compilation issue in Lean 4.29.0-rc6 by self-diagnosing a definitional equality problem where a `def` type alias blocked `rw` tactic pattern matching, and proposed changing `def` to `abbrev`.
Evidence
- Formal verification was praised as a structural solution for AI coding. Verification suites accumulate as executable documentation expressing 'how code should behave' — consuming zero context tokens when code is correct, making them stronger than markdown specs.
- Counter-argument that formal verification doesn't solve AI code's fundamental problems. Functions matching specs can be proven, but security requirements nobody writes in specs — like 'don't hardcode database credentials,' 'don't leave CORS open,' 'add auth to admin routes' — are outside formal verification's scope.
- Cost-performance interpretation was questioned. Leanstral is 10x cheaper than Haiku but also lower performing — in accuracy-critical tasks, does 'cheaper but worse' matter? However, Opus not performing great on this benchmark was seen as hopeful, suggesting scaling Leanstral could potentially beat Opus.
- Skepticism about practicality. In real software shops, even property-based testing is hard to adopt, let alone formal proofs. Also noted that AI-generated proofs would be hard to read and inelegant, and the requirement for humans to verify that proofs are specifying the right thing doesn't disappear.
- Criticism of Mistral's strategic direction — focusing on non-mainstream academic areas while falling behind on frontier models. Counter-argued that companies like Mistral matter for model alignment diversity. Positive confirmation that weights are genuinely Apache 2.0 open-source.
How to Apply
- If repetitive proof writing is a bottleneck in Lean 4 math libraries or formal verification projects, connect Leanstral with lean-lsp-mcp in agent mode for comparable or better proof completion rates at 1/15th Claude Sonnet's cost at pass@2-4.
- When Lean version upgrades cause sudden compilation failures, have Leanstral write test code reproducing the failure environment first, then diagnose the cause — useful for tracking breaking changes without domain expertise.
- For mission-critical software needing accuracy guarantees with AI code generation, write core business logic specs in Lean 4 and build a pipeline where Leanstral proves implementations satisfy specs — reducing the human review bottleneck. Note: security requirements must be explicitly included in specs to be verifiable.
- Currently accessible via Mistral Vibe and free API endpoints, so you can experiment with Lean 4 at zero cost. The community also suggested testing whether ensemble passes across models (e.g., Leanstral → Qwen → Leanstral) outperform repeated same-model passes.
Terminology
Related Papers
Show HN: adamsreview – better multi-agent PR reviews for Claude Code
Claude Code에서 최대 7개의 병렬 서브 에이전트가 각각 다른 관점으로 PR을 리뷰하고, 자동 수정까지 해주는 오픈소스 플러그인이다. 기존 /review나 CodeRabbit보다 실제 버그를 더 많이 잡는다고 주장하지만 커뮤니티에서는 복잡도와 실효성에 대한 회의론도 나왔다.
How Fast Does Claude, Acting as a User Space IP Stack, Respond to Pings?
Claude Code에게 IP 패킷을 직접 파싱하고 ICMP echo reply를 구성하도록 시켜서 실제로 ping에 응답하게 만든 실험으로, 'Markdown이 곧 코드이고 LLM이 프로세서'라는 아이디어를 네트워크 스택 수준까지 밀어붙인 재미있는 사례다.
Show HN: Git for AI Agents
AI 코딩 에이전트(Claude Code 등)가 수행한 모든 툴 호출을 자동으로 추적하고, 어떤 프롬프트가 어느 코드 줄을 작성했는지 blame까지 가능한 버전 관리 도구다.
Principles for agent-native CLIs
AI 에이전트가 CLI 도구를 더 잘 사용할 수 있도록 설계하는 원칙들을 정리한 글로, 에이전트가 CLI를 도구로 활용하는 빈도가 높아지면서 이 설계 방식이 실용적으로 중요해지고 있다.
Agent-harness-kit scaffolding for multi-agent workflows (MCP, provider-agnostic)
여러 AI 에이전트가 서로 역할을 나눠 협업할 수 있도록 조율하는 scaffolding 도구로, Vite처럼 설정 없이 빠르게 멀티 에이전트 파이프라인을 구성할 수 있다.
Show HN: Tilde.run – Agent sandbox with a transactional, versioned filesystem
AI 에이전트가 실제 프로덕션 데이터를 건드려도 롤백할 수 있는 격리된 샌드박스 환경을 제공하는 도구로, GitHub/S3/Google Drive를 하나의 버전 관리 파일시스템으로 묶어준다.