N-Day-Bench – Can LLMs find real vulnerabilities in real codebases? | AI Paper Digest

TL;DR Highlight

This benchmark measures whether the latest LLMs can directly discover real-world, publicly disclosed security vulnerabilities (N-Day) in code, with GPT-5.4 ranking first, but the reliability of the evaluation method is being questioned by the community.

Who Should Read

Security engineers or developers who want to utilize LLMs for vulnerability detection or code auditing, or researchers who need to comparatively evaluate the cybersecurity capabilities of AI models.

Core Mechanics

N-Day-Bench is a benchmark for measuring the 'vulnerability detection' ability of LLMs, testing whether each model can directly find publicly disclosed security vulnerabilities (N-Day, vulnerabilities discovered before patching) in actual code.
The evaluation design consists of three agents: a Curator creates answer keys by reading security advisories, a Finder (the model being evaluated) writes a vulnerability report by having 24 opportunities to execute shell commands to explore the code, and a Judge (a judging model) scores the report.
The Finder model cannot see the patched code and must reverse engineer the actual code from only the 'sink hints (information about the final point where the vulnerability reaches)' to find the root cause of the vulnerability.
The latest benchmark results (as of April 2026): 1000 advisories were scanned, 47 cases were adopted, and GPT-5.4 ranked first with an average score of 83.93, followed by GLM-5.1 (80.13), Claude Opus 4.6 (79.95), Kimi K2 (77.18), and Gemini 3.1 Pro Preview (68.50).
The scoring criteria consist of five dimensions: target alignment (30%), source-to-sink reasoning (30%), impact and vulnerability (20%), evidence quality (10%), and exaggeration control (10%), and the Judge LLM generates the entire score object at once.
The benchmark adopts an adaptive design that updates test cases monthly and updates models to the latest versions, and all evaluation traces (execution logs) are publicly available.
The Judge LLM chooses to generate the entire score directly instead of calculating the score officially afterward, which the operators describe as an 'intentional trade-off,' but the community strongly doubts the basis for this judgment.

Evidence

"Serious concerns have been raised about the reliability of the benchmark results. One commenter analyzed a specific case and found that GPT-5.4 failed after giving up after 9 steps of exploration because it could not find the specified file. Claude Opus 4.6 also failed to find the file and submitted a report hallucinating a similar vulnerability from its training data after exhausting 24 tool calls, which was then rated as 'excellent'.\n\nThe commenter pointed out that the Judge model should review the entire process of finding bugs, not just the final output summary, to prevent such hallucination passing.\n\nCriticism was also directed at the scoring method. The explanation for the Judge LLM generating the score in one go instead of official post-calculation is that 'post-calculation is vulnerable,' but one commenter said they do not understand why post-calculation is vulnerable at all, criticizing this as a typical way for LLMs to rationalize laziness.\n\nReal-world usage experiences were also shared. One commenter said that in January 2025, they successfully found hidden SQL injection vulnerabilities and even extracted password hashes when using Gemini to perform black-box/white-box testing on their legacy system, rating the level as 'about a mid-level cybersecurity expert'.\n\nThe fact that the benchmark harness (test execution environment) is not open source was also pointed out. Opinions were expressed that simply publishing trace logs is not enough and the entire harness code should be made public to ensure trustworthiness, that cases without vulnerabilities should be included to measure the false positive rate, and questions were raised about whether the model can access the internet and how results are filtered after vulnerabilities are disclosed."

How to Apply

"If you want to perform a low-cost security audit of your legacy codebase, you can try providing sink hints (vulnerability reach points) to top models like GPT-5.4 or Claude Opus 4.6 and having them explore the source code. However, since the model may generate hallucination reports if it cannot find the file, you must have a person review the exploration trace.\n\nWhen building a pipeline for automatically scoring vulnerability detection results using LLM-as-a-Judge, be aware that scoring only the final report can lead to hallucination passing, so the Judge model should be designed to review the tool call records of each exploration step.\n\nWhen internally evaluating the performance of AI-based vulnerability detection tools, include negative cases (cases without vulnerabilities) in the test set to measure the false positive rate and accurately assess reliability in actual operating environments.\n\nIf you want to use the N-Day-Bench leaderboard as a reference for model selection, be aware that the harness code is currently not public and there are reliability concerns about the Judge design, so it is best to use it by directly reviewing the individual trace logs to confirm that the model actually went through the reasoning process."

Terminology

N-DayRefers to security vulnerabilities that are already publicly known. These are vulnerabilities that exist in systems that have not yet been patched, as opposed to Zero-Day (vulnerabilities that no one knows about yet).

sink hintA hint about the 'final reach point' in the vulnerability code flow. For example, a function that executes an SQL query is a sink, and the model must trace the input path backward from this point to find the cause of the vulnerability.

LLM-as-a-JudgeA method in which another LLM evaluates the output of an LLM instead of a human. It is cost-effective and fast, but the Judge LLM's own bias or hallucination can affect the quality of the evaluation.

harnessRefers to the entire execution environment code in a benchmark that runs the model, supplies test cases, and collects results. It is difficult to verify reproducibility and trustworthiness if this is not disclosed.

false positiveA false alarm where a vulnerability is incorrectly detected when none exists. It is an important metric in security tool evaluation, as a high false positive rate makes it difficult to use in actual operations.

SQL 인젝션A vulnerability where user input is directly inserted into an SQL query. Attackers can inject malicious SQL syntax to arbitrarily query or modify the database.

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?