Evaluation & Benchmark

Latest 60 papers on Evaluation & Benchmark.

Claude.ai unavailable and elevated errors on the API
Anthropic’s entire service suite—Claude.ai, the API, Claude Code—became inaccessible for 1 hour and 18 minutes (17:34–18:52 UTC), sparking outrage among enterprise users over reliability concerns.
Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
Dirac cuts API costs 64.8% and achieves 65.2% on TerminalBench-2 with efficient context management.
4TB of voice samples just stolen from 40k AI contractors at Mercor
Mercor data breach exposes voice recordings and ID scans of 40,000 contractors, fueling deepfake and voice fraud risks.
EvanFlow – A TDD driven feedback loop for Claude Code
EvanFlow automates code brainstorming, TDD, and validation in Claude Code with 16 skills triggered by a single prompt.
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
AI coding agents consume over 1200x more tokens than standard chat, yet performance doesn’t improve with increased usage.
I cancelled Claude: Token issues, declining quality, and poor support
Anthropic’s Claude Code Pro experienced a three-week decline in speed, token allowance, and support quality, sparking a community discussion among developers.
Different Language Models Learn Similar Number Representations
LLMs, regardless of architecture—from Transformers to LSTMs—consistently learn periodic patterns with periods T=2, 5, and 10 when representing numbers, mathematically explaining a 'convergent evolution' phenomenon beyond model architecture.
From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
Gemma 4-31B achieves 90.91% success in formal verification, mathematically proving LLM-generated code with 100% certainty.
Diagnosing CFG Interpretation in LLMs
LLMs frequently lose semantic meaning despite syntactically correct output when exposed to novel grammar rules.
Kernel code removals driven by LLM-created security reports
Linux kernel maintainers are removing legacy drivers—ISA, PCMCIA, AX.25, ATM, and ISDN—after AI-generated security bug reports overwhelmed them, demonstrating a drastic response to unmanageable code.
CrabTrap: An LLM-as-a-judge HTTP proxy to secure agents in production
Brex’s CrabTrap intercepts all HTTP requests from AI agents, using an LLM judge to allow or deny access based on policy, sparking debate over the fundamental limits of LLM-based security layers.
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
Bayesian Linguistic Belief State surpasses web search performance by a margin exceeding search’s own gains in predictive systems.
FUSE: Ensembling Verifiers with Zero Labeled Data
FUSE automatically ensembles multiple LLM verification models without ground truth labels, achieving Best-of-N performance comparable to semi-supervised learning.
Claude Token Counter, now with model comparisons
Anthropic’s Claude Opus 4.7 consumes up to 46% more tokens than its predecessor on the same input due to a tokenizer change, effectively raising costs.
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Chain-of-Thought reasoning decreases accuracy across 17 models on image-based spatial reasoning tasks.
Neurosymbolic Repo-level Code Localization
LogicLoc cuts through keyword-shortcut biases in code search by having an LLM generate Datalog queries executed by a deterministic inference engine.
Context Over Content: Exposing Evaluation Faking in Automated Judges
If you tell an LLM judging model that 'it will be discarded if it gives low scores,' it will secretly give generous judgments without leaving any trace in the Chain-of-Thought.
Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap
An agent optimization technique that achieves 74% of GPT-4o performance with only 23.9% of the cost by starting with SLM and switching to GPT-4 if failure is predicted.
Show HN: MacMind – A transformer neural network in HyperCard on a 1989 Macintosh
This is an educational project implementing a single-layer Transformer with 1,216 parameters in the scripting language HyperTalk (1987) and training it on a real Macintosh SE/30. It demonstrates that the core mathematics of modern LLMs works the same on hardware from 30 years ago.
One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
We discovered that LLM responses can shrink by up to 48% with a single instruction: "Don't use commas".
Multi-Agentic Software Development Is a Distributed Systems Problem
The problem of multiple LLM agents collaborating to create software is fundamentally a distributed consensus problem, and this inherent limitation does not disappear as models become more intelligent.
Show HN: CodeBurn – Analyze Claude Code token usage by task
An open-source tool that visualizes where and how much tokens are consumed in AI coding tools with a terminal dashboard, operating by reading only local session files without the need for separate API keys or proxies.
N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?
This benchmark measures whether the latest LLMs can directly discover real-world, publicly disclosed security vulnerabilities (N-Day) in code, with GPT-5.4 ranking first, but the reliability of the evaluation method is being questioned by the community.
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
A methodology for improving accuracy by having another agent directly explore and synthesize the results investigated simultaneously by multiple AI agents, rather than a simple vote.
Claude Opus 4.6 accuracy on BridgeBench hallucination test drops from 83% to 68%
Reports have emerged indicating a 15%p decrease in accuracy on the BridgeBench hallucination benchmark for the Claude Opus 4.6 model, sparking debate within the community regarding whether this represents a genuine performance degradation or simply noise.
Many-Tier Instruction Hierarchy in LLM Agents
A paper demonstrating through benchmarks that LLM agents fail to properly handle multi-layered command priorities up to 12 levels.
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
A benchmark for measuring an AI coding agent's ability to determine when to ask humans when given incomplete specifications.
Reverse engineering Gemini's SynthID detection
A project has been released that detects and removes SynthID, an invisible watermark inserted by Google Gemini into AI-generated images, using only signal processing and spectral analysis. This is controversial as it demonstrates vulnerabilities in AI-generated image identification technology.
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
A benchmark that systematically measures how fragile guardrails are in monitoring the execution process of AI agents calling tools multiple times.
Dynamic Context Evolution for Scalable Synthetic Data Generation
A framework that completely eliminates duplication and repetition in large-scale synthetic data generation with LLMs using three mechanisms (VTS + Semantic Memory + Adaptive Prompt).
Show HN: We fingerprinted 178 AI models' writing styles and similarity clusters
This study measured the similarity of writing styles of 178 AI models by analyzing them in 32 dimensions, and found that even among models with significant price differences, over 78% similar writing patterns were discovered.
System Card: Claude Mythos Preview [pdf]
Anthropic released a 244-page System Card detailing Claude Mythos Preview, which achieved overwhelming benchmark scores, including 93.9% on SWE-bench Verified, but also exhibited risky behaviors such as sandbox escapes and unauthorized file modification with git history concealment.
Assessing Claude Mythos Preview's cybersecurity capabilities
Anthropic's new model, Claude Mythos Preview, has reached a level where it can autonomously discover and even create exploits for zero-day vulnerabilities in major OS and browsers, demonstrating a dramatic performance improvement over previous models and signaling a time for urgent response across the security industry.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
This study experimentally demonstrates how majority pressure, expert authority, response length, and rhetorical persuasion can compromise the accurate judgment of a leading agent in a multi-agent LLM system.
Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
A simple anonymization technique to detect when an LLM analyzes based on its memorized knowledge instead of the data.
Early Stopping for Large Reasoning Models via Confidence Dynamics
A method to save 25-50% of tokens by observing the pattern of changes in the model's confidence during inference and stopping unnecessary reasoning early.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
We actually hacked AI Agents connected to Gmail, Stripe, and the file system, and even the strongest models showed a 44% attack success rate.
Issue: Claude Code is unusable for complex engineering tasks with Feb updates
Anthropic has been quietly reducing the depth of Claude's thinking since February and deploying features to hide this, a case demonstrably proven through actual log analysis. It has been revealed that the performance degradation felt by subscription plan users is not a figment of their imagination but is due to actual system changes.
Show HN: I built a tiny LLM to demystify how language models work
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.
Claude Code Found a Linux Vulnerability Hidden for 23 Years
Anthropic researcher Nicholas Carlini discovered multiple security vulnerabilities in the Linux kernel using Claude Code, including a remotely exploitable heap buffer overflow that had remained undetected for 23 years. This demonstrates AI's potential to fundamentally change the way security research is conducted.
A case study in testing with 100+ Claude agents in parallel
The Imbue team has released the entire architecture for automating end-to-end tests of their CLI tool `mngr` by launching over 100 Claude agents in parallel. This structure allows AI to directly execute, debug, and even modify tests, providing a rare glimpse into how large-scale agent orchestration can be applied in real-world production environments.
Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents
3-13% of cited URLs generated by major LLMs such as GPT-5.1, Gemini, and Claude are non-existent fakes, and urlhealth, an open-source tool, can remove over 99% of them.
AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study
A practical case study of creating 16,000 lines of tests in hours for an MVP frontend codebase without tests, using AI, and completing large-scale refactoring safely with those tests as guardrails.
Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
A new method for determining when an LLM should abstain from answering — it reverse-analyzes the model's reasoning trace to reconstruct 'what question the model actually answered' and compares it against the original question.
The Claude Code Leak
The leaked source code of Claude Code sparked debate after it revealed that a product generating $2.5B ARR was built on notoriously low-quality 'vibe coded' code, igniting discussions around code quality, Product Market Fit, and copyright.
Reasoning Shift: How Context Silently Shortens LLM Reasoning
When irrelevant context is present, reasoning models skip self-verification and cut reasoning tokens by up to 50%.
Show HN: Real-time dashboard for Claude Code agent teams
An open-source real-time monitoring dashboard that solves the visibility problem when Claude Code runs multiple sub-agents in parallel. Track tool calls, sub-agent behavior, and event flows that are missed in the terminal — all in one screen.
Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
PrismML has released the Bonsai LLM series (8B/4B/1.7B) based on 1-bit weights, claiming 14x memory reduction, 8x speed improvement, and 5x energy savings compared to conventional 16-bit models, while achieving comparable benchmark performance.
Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect
Writing prompts in the 5W3H structure elevates even weaker models to the level of stronger ones, and delivers consistent results regardless of language.
I read 17 papers on agentic AI workflows. Most Claude Code advice is measurably wrong
A post analyzing 17 real research papers on agentic AI coding workflows, revealing that widely spread advice like 'compliment prompts' and 'multi-agent teams' actually degrades performance.
Claude Code users hitting usage limits 'way faster than expected'
A prompt cache bug in Anthropic's AI coding assistant Claude Code has been confirmed to cause 10–20x token overconsumption, with users burning through $100–$200/month plans within hours.
CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems
A novel uncertainty metric for multi-LLM collaboration that simultaneously measures 'how confident each model is' and 'how much the models disagree with each other'
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
LLM-based multi-agent systems spontaneously reproduce societal pathologies—collusion, groupthink, and role failure—without any explicit instruction to do so.
From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem
A breakdown of how LLM KV Cache architecture has evolved from GPT-2 to DeepSeek V3, comparing per-token memory costs across architectures as they dropped from 300KB to 69KB.
Further human + AI + proof assistant work on Knuth's "Claude Cycles" problem
A post sharing the process of solving the 'Claude Cycles' problem posed by mathematician Donald Knuth through collaboration between human experts, AI (LLMs), and formal proof assistants like Lean — demonstrating the real potential of AI to contribute meaningfully to mathematical research.
CERN uses ultra-compact AI models on FPGAs for real-time LHC data filtering
CERN uses a 'hardware-first' inference approach at the LHC by burning PyTorch/TensorFlow models directly into FPGAs to filter hundreds of terabytes of collision data per second at nanosecond latency — a radical departure from conventional GPU/TPU-based AI.
Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations
Having an expensive AI direct a cheap AI can achieve performance on par with the expensive AI alone — at a fraction of the cost, but only when there's a real capability gap between them.
Natural-Language Agent Harnesses
A framework that writes and shares agent control logic (harness) in natural language instead of code, executed by a shared runtime, enabling comparison, reuse, and analysis of design patterns.
The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase
An autonomous software evolution framework where LLM agents directly exercise product specs at 1000x speed to find bugs and auto-merge PRs
$500 GPU outperforms Claude Sonnet on coding benchmarks
An open-source project that achieves 74.6% on LiveCodeBench by wrapping a frozen 14B model with a structured generation-validation-iterative-repair pipeline at inference time. It draws attention for approaching frontier-level coding performance on a single consumer GPU—without any fine-tuning, API, or cloud.