N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?
TL;DR Highlight
This benchmark measures whether the latest LLMs can directly discover real-world, publicly disclosed security vulnerabilities (N-Day) in code, with GPT-5.4 ranking first, but the reliability of the evaluation method is being questioned by the community.
Who Should Read
Security engineers or developers who want to utilize LLMs for vulnerability detection or code auditing, or researchers who need to comparatively evaluate the cybersecurity capabilities of AI models.
Core Mechanics
- N-Day-Bench is a benchmark for measuring the 'vulnerability detection' ability of LLMs, testing whether each model can directly find publicly disclosed security vulnerabilities (N-Day, vulnerabilities discovered before patching) in actual code.
- The evaluation design consists of three agents: a Curator creates answer keys by reading security advisories, a Finder (the model being evaluated) writes a vulnerability report by having 24 opportunities to execute shell commands to explore the code, and a Judge (a judging model) scores the report.
- The Finder model cannot see the patched code and must reverse engineer the actual code from only the 'sink hints (information about the final point where the vulnerability reaches)' to find the root cause of the vulnerability.
- The latest benchmark results (as of April 2026): 1000 advisories were scanned, 47 cases were adopted, and GPT-5.4 ranked first with an average score of 83.93, followed by GLM-5.1 (80.13), Claude Opus 4.6 (79.95), Kimi K2 (77.18), and Gemini 3.1 Pro Preview (68.50).
- The scoring criteria consist of five dimensions: target alignment (30%), source-to-sink reasoning (30%), impact and vulnerability (20%), evidence quality (10%), and exaggeration control (10%), and the Judge LLM generates the entire score object at once.
- The benchmark adopts an adaptive design that updates test cases monthly and updates models to the latest versions, and all evaluation traces (execution logs) are publicly available.
- The Judge LLM chooses to generate the entire score directly instead of calculating the score officially afterward, which the operators describe as an 'intentional trade-off,' but the community strongly doubts the basis for this judgment.
Evidence
- "Serious concerns have been raised about the reliability of the benchmark results. One commenter analyzed a specific case and found that GPT-5.4 failed after giving up after 9 steps of exploration because it could not find the specified file. Claude Opus 4.6 also failed to find the file and submitted a report hallucinating a similar vulnerability from its training data after exhausting 24 tool calls, which was then rated as 'excellent'.\n\nThe commenter pointed out that the Judge model should review the entire process of finding bugs, not just the final output summary, to prevent such hallucination passing.\n\nCriticism was also directed at the scoring method. The explanation for the Judge LLM generating the score in one go instead of official post-calculation is that 'post-calculation is vulnerable,' but one commenter said they do not understand why post-calculation is vulnerable at all, criticizing this as a typical way for LLMs to rationalize laziness.\n\nReal-world usage experiences were also shared. One commenter said that in January 2025, they successfully found hidden SQL injection vulnerabilities and even extracted password hashes when using Gemini to perform black-box/white-box testing on their legacy system, rating the level as 'about a mid-level cybersecurity expert'.\n\nThe fact that the benchmark harness (test execution environment) is not open source was also pointed out. Opinions were expressed that simply publishing trace logs is not enough and the entire harness code should be made public to ensure trustworthiness, that cases without vulnerabilities should be included to measure the false positive rate, and questions were raised about whether the model can access the internet and how results are filtered after vulnerabilities are disclosed."
How to Apply
- "If you want to perform a low-cost security audit of your legacy codebase, you can try providing sink hints (vulnerability reach points) to top models like GPT-5.4 or Claude Opus 4.6 and having them explore the source code. However, since the model may generate hallucination reports if it cannot find the file, you must have a person review the exploration trace.\n\nWhen building a pipeline for automatically scoring vulnerability detection results using LLM-as-a-Judge, be aware that scoring only the final report can lead to hallucination passing, so the Judge model should be designed to review the tool call records of each exploration step.\n\nWhen internally evaluating the performance of AI-based vulnerability detection tools, include negative cases (cases without vulnerabilities) in the test set to measure the false positive rate and accurately assess reliability in actual operating environments.\n\nIf you want to use the N-Day-Bench leaderboard as a reference for model selection, be aware that the harness code is currently not public and there are reliability concerns about the Judge design, so it is best to use it by directly reviewing the individual trace logs to confirm that the model actually went through the reasoning process."
Terminology
Related Papers
MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems
RAG, Mem0 같은 LLM 메모리 시스템이 왜 틀린 답을 내는지 자동으로 찾아주는 디버깅 프레임워크
DeepSWE: A contamination-free benchmark for long-horizon coding agents
기존 SWE-bench의 데이터 오염 및 검증 오류 문제를 해결하기 위해 처음부터 새로 만든 코딩 에이전트 벤치마크로, GPT-5.5가 70%로 1위를 차지하고 모델 간 성능 격차가 훨씬 뚜렷하게 드러난다.
Constraint Decay: The Fragility of LLM Agents in Back End Code Generation
LLM 코딩 에이전트는 구조적 제약(아키텍처 패턴, ORM, DB 설계)이 쌓일수록 성능이 급격히 떨어지는 'constraint decay' 현상을 보인다는 연구 결과로, AI 코딩 도구를 프로덕션에 쓰려는 개발자라면 반드시 알아야 할 한계다.
AMEL: Accumulated Message Effects on LLM Judgments
LLM을 자동 평가자로 쓸 때 이전 대화 기록의 긍정/부정 분위기가 이후 판단을 오염시킨다는 걸 75,898개 API 호출로 증명한 연구.
Language-Switching Triggers Take a Latent Detour Through Language Models
8B LLM에 심어진 백도어 트리거가 중간 레이어에서 언어 탐지기를 완전히 속이는 직교 부분공간(orthogonal subspace)으로 숨어 이동한다는 걸 회로 분석으로 밝혀냈다.
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
LLM이 규칙을 잘 지키고 있는지 감시하려면 LLM에게 맡기지 말고 LTL(시간 논리 공식) 기반 모니터를 쓰세요.