Google Engineers Launch "Sashiko" for Agentic AI Code Review of the Linux Kernel

TL;DR Highlight

Google's Linux kernel team open-sourced 'Sashiko,' a Gemini 3.1 Pro-based AI code review agent that claims to detect 53% of bugs missed by human reviewers.

Who Should Read

Linux kernel contributors or large open-source project maintainers considering automated code review pipelines. Backend/systems developers curious about real-world cases of applying AI agents to code quality verification.

Core Mechanics

Roman Gushchin from Google's Linux kernel team released Sashiko. A system used internally at Google for months is now being extended to all Linux kernel mailing list patch submissions.
Sashiko detected 53% of bugs when tested against 1,000 recent upstream Linux kernel issues with 'Fixes:' tags. The presenter emphasized 'this 53% are issues that human reviewers missed 100% of the time.'
Designed to use Gemini 2.5 Pro by default (listed as 'Gemini 3.1 Pro' in original), but built to work with Claude and other LLMs. Interestingly, the system itself was written in Rust and co-authored with Claude.
Google is covering Sashiko's token costs and infrastructure, with project hosting planned to transfer to the Linux Foundation. Code is open-sourced on GitHub (github.com/sashiko-dev/sashiko).
A web interface (sashiko.dev) shows patchsets currently under review and results. Review results include findings tagged with severity like 'Critical' and 'High.'
A key design principle: Sashiko is designed not to spam the mailing list with comments directly. Review results are only viewable on a separate web interface, choosing not to disrupt the kernel community's existing workflow.
As an agentic AI code review (unlike simple static analysis), the LLM understands patch context and judges bug likelihood. One commenter noted 'separating the model that writes code from the model that reviews code is the key insight.'

Evidence

The 53% detection rate was criticized for not disclosing the false positive rate. 'Flag all code as buggy and you get 100% detection rate' — without precision alongside recall, actual usefulness is hard to judge. Concerns that human reviewers overwhelmed by AI false positive reports could lose trust in the entire system.
The claim '100% of issues were missed by human reviewers' sparked interpretation debate. One commenter noted 'missed at the initial code review stage doesn't mean developers didn't find these bugs later in development.' Code gets reviewed continuously, so the framing was considered somewhat exaggerated.
UX feedback on the web UI (sashiko.dev): the Status column shows internal pipeline states like 'Pending' and 'In Review,' while actually important findings are buried on the far right. No filtering or highlighting for Critical/High severity findings, reducing practical utility.
Concerns about auto-submitting style/structural change patches. A commenter shared an actual Sashiko review result link, noting that automated style changes applied at scale to the kernel codebase could burden the existing development flow. The worry was it seemed to focus more on style cleanup than bug detection.
Positive reactions to separating the writing model from the reviewing model. One commenter shared 'I use the same approach at small scale — for the same reason you don't self-review your own PRs, self-review misses things.' The system being written in Rust and co-developed with Claude was also noted as interesting.

How to Apply

If you submit patches to the Linux kernel, check your patchset's review results on sashiko.dev before sending to the mailing list. Fixing Critical/High findings before submission can shorten review cycles.
To build a similar AI code review pipeline for your own large codebase, reference Sashiko's open-source code (github.com/sashiko-dev/sashiko) and apply the 'writing model != reviewing model' principle. Designed to swap in Claude API as well, making it easy for teams already using Claude.
When considering AI code review system adoption, follow Sashiko's approach: separate review results into a dashboard rather than spamming existing communication channels (mailing lists, PR comments). When false positives are high, a separate UI maintains team trust better than direct notifications.
Rather than trusting the 53% bug detection metric at face value, measure false positive rate as well before actual adoption. Pull 100-200 recent 'Fixes:' commits from your own codebase and compare against AI review results to measure precision/recall yourself.

Terminology

Agentic AIAn AI system that doesn't just answer questions but autonomously plans and executes multi-step tasks. In the code review context, it autonomously reads patches, finds relevant context, and judges bug likelihood.

upstreamRefers to the official main Linux kernel repository (managed by Linus Torvalds). 'Sending a patch upstream' means proposing changes to the official kernel.

Fixes: tagA special tag in Linux kernel commit messages that specifies which previous commit's bug this commit fixes. Sashiko's benchmark measured whether AI could preemptively identify the source code of commits with this tag.

recallThe proportion of total bugs that AI actually caught. 53% recall means it found 53 out of 100 bugs. Higher is better, but can come with more false positives — must be evaluated alongside precision.

precisionThe proportion of AI's 'this is a bug' judgments that are actually bugs. Low precision means lots of false positives, wasting developer time on false alarms.

LKMLLinux Kernel Mailing List. The core communication channel for Linux kernel development — patch submissions, reviews, and discussions all happen through this mailing list.