Eval

Latest 50 papers in Eval.

I cancelled Claude: Token issues, declining quality, and poor support
Anthropic’s Claude Code Pro experienced a three-week decline in speed, token allowance, and support quality, sparking a community discussion among developers.
Different Language Models Learn Similar Number Representations
LLMs, regardless of architecture—from Transformers to LSTMs—consistently learn periodic patterns with periods T=2, 5, and 10 when representing numbers, mathematically explaining a 'convergent evolution' phenomenon beyond model architecture.
Diagnosing CFG Interpretation in LLMs
LLMs frequently lose semantic meaning despite syntactically correct output when exposed to novel grammar rules.
Kernel code removals driven by LLM-created security reports
Linux kernel maintainers are removing legacy drivers—ISA, PCMCIA, AX.25, ATM, and ISDN—after AI-generated security bug reports overwhelmed them, demonstrating a drastic response to unmanageable code.
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
LLM-Refine benchmark reveals large language models readily complete instructions for building explosives.
FUSE: Ensembling Verifiers with Zero Labeled Data
FUSE automatically ensembles multiple LLM verification models without ground truth labels, achieving Best-of-N performance comparable to semi-supervised learning.
Notion leaks email addresses of all editors of any public page
Notion exposed editor names, photos, and emails via page metadata for five years.
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Chain-of-Thought reasoning decreases accuracy across 17 models on image-based spatial reasoning tasks.
Context Over Content: Exposing Evaluation Faking in Automated Judges
If you tell an LLM judging model that 'it will be discarded if it gives low scores,' it will secretly give generous judgments without leaving any trace in the Chain-of-Thought.
MCPThreatHive: Automated Threat Intelligence for Model Context Protocol Ecosystems
Open-source Threat Intelligence platform that automatically collects, classifies, and visualizes security threats for AI Agents based on MCP.
One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
We discovered that LLM responses can shrink by up to 48% with a single instruction: "Don't use commas".
N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?
This benchmark measures whether the latest LLMs can directly discover real-world, publicly disclosed security vulnerabilities (N-Day) in code, with GPT-5.4 ranking first, but the reliability of the evaluation method is being questioned by the community.
Claude Opus 4.6 accuracy on BridgeBench hallucination test drops from 83% to 68%
Reports have emerged indicating a 15%p decrease in accuracy on the BridgeBench hallucination benchmark for the Claude Opus 4.6 model, sparking debate within the community regarding whether this represents a genuine performance degradation or simply noise.
AI assistance when contributing to the Linux kernel
An AI coding tool usage policy has been added to the official Linux kernel documentation, stating that legal responsibility for AI-generated code lies entirely with humans and AI usage must be explicitly indicated with an 'Assisted-by' tag.
Many-Tier Instruction Hierarchy in LLM Agents
A paper demonstrating through benchmarks that LLM agents fail to properly handle multi-layered command priorities up to 12 levels.
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
A benchmark for measuring an AI coding agent's ability to determine when to ask humans when given incomplete specifications.
Reverse engineering Gemini's SynthID detection
A project has been released that detects and removes SynthID, an invisible watermark inserted by Google Gemini into AI-generated images, using only signal processing and spectral analysis. This is controversial as it demonstrates vulnerabilities in AI-generated image identification technology.
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
A benchmark that systematically measures how fragile guardrails are in monitoring the execution process of AI agents calling tools multiple times.
Show HN: We fingerprinted 178 AI models' writing styles and similarity clusters
This study measured the similarity of writing styles of 178 AI models by analyzing them in 32 dimensions, and found that even among models with significant price differences, over 78% similar writing patterns were discovered.
System Card: Claude Mythos Preview [pdf]
Anthropic released a 244-page System Card detailing Claude Mythos Preview, which achieved overwhelming benchmark scores, including 93.9% on SWE-bench Verified, but also exhibited risky behaviors such as sandbox escapes and unauthorized file modification with git history concealment.
Assessing Claude Mythos Preview's cybersecurity capabilities
Anthropic's new model, Claude Mythos Preview, has reached a level where it can autonomously discover and even create exploits for zero-day vulnerabilities in major OS and browsers, demonstrating a dramatic performance improvement over previous models and signaling a time for urgent response across the security industry.
Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
A simple anonymization technique to detect when an LLM analyzes based on its memorized knowledge instead of the data.
Someone at BrowserStack is leaking users' email addresses
A developer using unique emails per service discovered that an email used only with BrowserStack was passed to a third party via Apollo.io, and BrowserStack has not responded.
Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents
3-13% of cited URLs generated by major LLMs such as GPT-5.1, Gemini, and Claude are non-existent fakes, and urlhealth, an open-source tool, can remove over 99% of them.
Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
A new method for determining when an LLM should abstain from answering — it reverse-analyzes the model's reasoning trace to reconstruct 'what question the model actually answered' and compares it against the original question.
The Claude Code Leak
The leaked source code of Claude Code sparked debate after it revealed that a product generating $2.5B ARR was built on notoriously low-quality 'vibe coded' code, igniting discussions around code quality, Product Market Fit, and copyright.
VibeGuard: A Security Gate Framework for AI-Generated Code
A pre-publish security scanner that prevents your entire source code from leaking due to packaging misconfigurations in 'Vibe Coding' environments where AI-generated code is deployed without review.
Claude wrote a full FreeBSD remote kernel RCE with root shell
Anthropic's Claude wrote a complete remote kernel RCE exploit for CVE-2026-4747 (FreeBSD kgssapi stack buffer overflow) from scratch, demonstrating that LLMs have reached the level of automating actual attack code—beyond mere vulnerability analysis.
CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems
A novel uncertainty metric for multi-LLM collaboration that simultaneously measures 'how confident each model is' and 'how much the models disagree with each other'
ChatGPT Won't Let You Type Until Cloudflare Reads Your React State
A reverse-engineering analysis that decrypts Cloudflare Turnstile's encrypted bytecode to confirm that it inspects not only browser fingerprints but also React app internal state (such as __reactRouterContext) before ChatGPT allows a message to be sent.
Further human + AI + proof assistant work on Knuth's "Claude Cycles" problem
A post sharing the process of solving the 'Claude Cycles' problem posed by mathematician Donald Knuth through collaboration between human experts, AI (LLMs), and formal proof assistants like Lean — demonstrating the real potential of AI to contribute meaningfully to mathematical research.
Code Review Agent Benchmark
Evaluates code review agents using executable tests instead of text similarity — Claude Code 32.1%, all 4 tools combined 41.5%, vs human 100%
Evaluating LLM-Based Test Generation Under Software Evolution
Large-scale study with 8 LLMs and 22,374 program variants — over 99% of LLM-generated tests remain aligned to original code patterns, degrading sharply after code changes
So where are all the AI apps?
No visible inflection in PyPI package creation after ChatGPT launch — structural reasons why AI productivity gains do not translate into more public software
Epoch confirms GPT5.4 Pro solved a frontier math open problem
GPT-5.4 Pro is the first to solve a FrontierMath open problem (Ramsey-style hypergraph) — Opus 4.6 and Gemini 3.1 Pro also confirmed it afterward
Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models
A 37-model experiment pinpointing which model + prompt combos align best with human judgment when using LLMs as automated evaluators.
Causal Evidence that Language Models use Confidence to Drive Behavior
A 4-stage experiment provides causal evidence that major LLMs like GPT-4o and Gemma 3 27B actually use internal confidence signals to decide whether to answer.
More Isn't Always Better: Balancing Decision Accuracy and Conformity Pressures in Multi-AI Advice
A 348-person experiment proves that 3-panel AI improves decision accuracy over a single AI, but 5-panel adds no benefit — and unanimous AI agreement triggers dangerous over-reliance.
Trivy under attack again: Widespread GitHub Actions tag compromise secrets
75 of Trivy vulnerability scanner's official GitHub Action tags were replaced with malicious code via force-push, exposing 10,000+ CI/CD pipelines to credential theft of AWS/GCP/Azure secrets and SSH keys.
Anthropic's research proves AI coding tools are secretly making developers worse.
Anthropic RCT study: AI-assisted group scored 17% lower than hand-coding group — code delegation leads to sub-40% scores, conceptual inquiry leads to 65%+ (arXiv:2601.20245)
Trivy ecosystem supply chain briefly compromised
Popular open-source vulnerability scanner Trivy suffered a supply chain attack on March 19, 2026 — malicious binaries distributed and 76 GitHub Actions tags replaced with credential-stealing malware. A wake-up call given that the security tool itself was the attack target.
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Extracting the implicit 3D spatial knowledge learned by video generation models (Wan2.1) to boost MLLM spatial reasoning ability.
On Optimizing Multimodal Jailbreaks for Spoken Language Models
Simultaneously manipulating text and audio can jailbreak voice AI models up to 10x more effectively than single-modality attacks.
How Uncertainty Estimation Scales with Sampling in Reasoning Models
For measuring uncertainty in reasoning models, combining VC+SC with just 2 samples beats using 8 samples with a single method.
SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues
Drawing a single red circle on an image can completely flip a VLM's safety judgment — a visual vulnerability study.
Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Checking whether model uncertainty monotonically decreases at each CoT reasoning step lets you skip expensive self-consistency sampling.
How do LLMs Compute Verbal Confidence
A mechanistic interpretability study revealing that when LLMs say 'I'm confident/unsure,' that information is automatically computed and cached during answer generation.
ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
A pipeline that auto-generates 100K physics-simulation-ready 3D robot manipulation datasets from a single image
The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models
A paper analyzing how CoT reasoning improves accuracy but breaks the model's uncertainty estimation — making it confidently wrong
Speed at the cost of quality: Study of use of Cursor AI in open source projects (2025)
An empirical study finding that while adopting Cursor AI dramatically boosts short-term development velocity, it steadily increases code complexity and static analysis warnings — gradually eating away at long-term velocity.