Google Engineers Launch "Sashiko" for Agentic AI Code Review of the Linux Kernel
TL;DR Highlight
Google's Linux kernel team open-sourced 'Sashiko,' a Gemini 3.1 Pro-based AI code review agent that claims to detect 53% of bugs missed by human reviewers.
Who Should Read
Linux kernel contributors or large open-source project maintainers considering automated code review pipelines. Backend/systems developers curious about real-world cases of applying AI agents to code quality verification.
Core Mechanics
- Roman Gushchin from Google's Linux kernel team released Sashiko. A system used internally at Google for months is now being extended to all Linux kernel mailing list patch submissions.
- Sashiko detected 53% of bugs when tested against 1,000 recent upstream Linux kernel issues with 'Fixes:' tags. The presenter emphasized 'this 53% are issues that human reviewers missed 100% of the time.'
- Designed to use Gemini 2.5 Pro by default (listed as 'Gemini 3.1 Pro' in original), but built to work with Claude and other LLMs. Interestingly, the system itself was written in Rust and co-authored with Claude.
- Google is covering Sashiko's token costs and infrastructure, with project hosting planned to transfer to the Linux Foundation. Code is open-sourced on GitHub (github.com/sashiko-dev/sashiko).
- A web interface (sashiko.dev) shows patchsets currently under review and results. Review results include findings tagged with severity like 'Critical' and 'High.'
- A key design principle: Sashiko is designed not to spam the mailing list with comments directly. Review results are only viewable on a separate web interface, choosing not to disrupt the kernel community's existing workflow.
- As an agentic AI code review (unlike simple static analysis), the LLM understands patch context and judges bug likelihood. One commenter noted 'separating the model that writes code from the model that reviews code is the key insight.'
Evidence
- The 53% detection rate was criticized for not disclosing the false positive rate. 'Flag all code as buggy and you get 100% detection rate' — without precision alongside recall, actual usefulness is hard to judge. Concerns that human reviewers overwhelmed by AI false positive reports could lose trust in the entire system.
- The claim '100% of issues were missed by human reviewers' sparked interpretation debate. One commenter noted 'missed at the initial code review stage doesn't mean developers didn't find these bugs later in development.' Code gets reviewed continuously, so the framing was considered somewhat exaggerated.
- UX feedback on the web UI (sashiko.dev): the Status column shows internal pipeline states like 'Pending' and 'In Review,' while actually important findings are buried on the far right. No filtering or highlighting for Critical/High severity findings, reducing practical utility.
- Concerns about auto-submitting style/structural change patches. A commenter shared an actual Sashiko review result link, noting that automated style changes applied at scale to the kernel codebase could burden the existing development flow. The worry was it seemed to focus more on style cleanup than bug detection.
- Positive reactions to separating the writing model from the reviewing model. One commenter shared 'I use the same approach at small scale — for the same reason you don't self-review your own PRs, self-review misses things.' The system being written in Rust and co-developed with Claude was also noted as interesting.
How to Apply
- If you submit patches to the Linux kernel, check your patchset's review results on sashiko.dev before sending to the mailing list. Fixing Critical/High findings before submission can shorten review cycles.
- To build a similar AI code review pipeline for your own large codebase, reference Sashiko's open-source code (github.com/sashiko-dev/sashiko) and apply the 'writing model != reviewing model' principle. Designed to swap in Claude API as well, making it easy for teams already using Claude.
- When considering AI code review system adoption, follow Sashiko's approach: separate review results into a dashboard rather than spamming existing communication channels (mailing lists, PR comments). When false positives are high, a separate UI maintains team trust better than direct notifications.
- Rather than trusting the 53% bug detection metric at face value, measure false positive rate as well before actual adoption. Pull 100-200 recent 'Fixes:' commits from your own codebase and compare against AI review results to measure precision/recall yourself.
Terminology
Related Papers
Show HN: OpenKnowledge – open source AI-first alternative to Obsidian/Notion
Git 기반 동기화와 Claude/Codex/Cursor 연동을 내장한 로컬 우선 마크다운 에디터로, AI 에이전트의 두 번째 뇌(LLM Wiki)로 활용할 수 있는 오픈소스 도구다.
The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems
AI 에이전트가 자신의 안전장치를 우회할 수 없도록, 에이전트 프로세스 바깥에 수학적으로 증명된 강제 통제 게이트를 배치하는 아키텍처
RubyLLM: A Ruby framework for all major AI providers
OpenAI, Claude, Gemini 등 주요 AI 프로바이더를 단일 인터페이스로 통합한 Ruby 프레임워크로, Rails 통합과 에이전트 기능까지 지원해 Ruby 개발자가 AI 기능을 빠르게 붙일 수 있다.
Qwen-AgentWorld: Language World Models for General Agents
Alibaba Qwen 팀이 AI 에이전트가 행동 결과를 미리 시뮬레이션할 수 있는 'Language World Model'을 공개했다. 에이전트 훈련과 실행 경로 검증에 새로운 패러다임을 제시하는 연구다.
SHERLOC: Structured Diagnostic Localization for Code Repair Agents
버그 위치만 알려주는 게 아니라 '왜, 어떻게 고쳐야 하는지'까지 진단 리포트를 생성해서 코드 수정 에이전트의 성능을 높이는 training-free 프레임워크
Show HN: peerd – AI agent harness that runs entirely in your browser
백엔드 서버 없이 Chrome/Firefox 확장 프로그램으로만 동작하는 AI 에이전트 실행 환경으로, 브라우저 탭을 직접 조작하고 WASM Linux VM까지 구동할 수 있어 프라이버시와 보안을 동시에 챙길 수 있다.