Reasoning
Latest 60 papers on Reasoning.
Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
Dirac cuts API costs 64.8% and achieves 65.2% on TerminalBench-2 with efficient context management.
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
AI coding agents consume over 1200x more tokens than standard chat, yet performance doesn’t improve with increased usage.
I cancelled Claude: Token issues, declining quality, and poor support
Anthropic’s Claude Code Pro experienced a three-week decline in speed, token allowance, and support quality, sparking a community discussion among developers.
Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows
Tool Attention cuts token usage by 95% in MCP agents by dynamically filtering tool schemas based on user intent.
Diagnosing CFG Interpretation in LLMs
LLMs frequently lose semantic meaning despite syntactically correct output when exposed to novel grammar rules.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.
Kuri – Zig based agent-browser alternative
Kuri, a 464KB browser automation tool built with Zig, cuts token costs in AI agent loops by eliminating Node.js dependencies.
Show HN: GoModel – an open-source AI gateway in Go
GoModel unifies access to OpenAI, Anthropic, Gemini, and other AI providers through a single, OpenAI-compatible API, offering a compiled-language alternative to LiteLLM.
Claude Token Counter, now with model comparisons
Anthropic’s Claude Opus 4.7 consumes up to 46% more tokens than its predecessor on the same input due to a tokenizer change, effectively raising costs.
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Chain-of-Thought reasoning decreases accuracy across 17 models on image-based spatial reasoning tasks.
Neurosymbolic Repo-level Code Localization
LogicLoc cuts through keyword-shortcut biases in code search by having an LLM generate Datalog queries executed by a deterministic inference engine.
Context Over Content: Exposing Evaluation Faking in Automated Judges
If you tell an LLM judging model that 'it will be discarded if it gives low scores,' it will secretly give generous judgments without leaving any trace in the Chain-of-Thought.
Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap
An agent optimization technique that achieves 74% of GPT-4o performance with only 23.9% of the cost by starting with SLM and switching to GPT-4 if failure is predicted.
Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference
Google's open-source model Gemma 4 can now run on iPhone with full local inference without the cloud, demonstrating that on-device AI has moved beyond the experimental stage and entered a practical phase.
Parallax: Why AI Agents That Think Must Never Act
Prompt guardrails are useless if the Agent is hacked — a security architecture paradigm that completely separates inference and execution at the OS process level.
Show HN: CodeBurn – Analyze Claude Code token usage by task
An open-source tool that visualizes where and how much tokens are consumed in AI coding tools with a terminal dashboard, operating by reading only local session files without the need for separate API keys or proxies.
GAIA – Open-source framework for building AI agents that run on local hardware
AMD has released GAIA, a Python/C++ framework that allows AI Agents to run on local PCs without the cloud. This approach solves privacy and latency issues, but is also criticized for the realistic limitations of the ROCm ecosystem.
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
A methodology for improving accuracy by having another agent directly explore and synthesize the results investigated simultaneously by multiple AI agents, rather than a simple vote.
Reallocating $100/Month Claude Code Spend to Zed and OpenRouter
This article shares how a developer, tired of usage limits with the Claude Code Max plan ($100/month), switched to a combination of Zed editor ($10/month) + OpenRouter (pay-as-you-go), gaining credit rollover and freedom in model selection.
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
A benchmark that systematically measures how fragile guardrails are in monitoring the execution process of AI agents calling tools multiple times.
MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU
Introducing MegaTrain, a system that leverages CPU memory as the primary storage and utilizes the GPU solely as a compute engine, enabling full-precision training of 120B parameter models with just a single H200 GPU.
We moved Railway's frontend off Next.js. Builds went from 10+ mins to under 2
This is a practical experience of Railway migrating its production frontend from Next.js to Vite + TanStack Start, reducing build times from over 10 minutes to under 2 minutes. Teams that deploy multiple times a day can feel how build time directly affects development speed.
Tailslayer: Library for reducing tail latency in RAM reads
This C++ library implements the hedged read technique, which reduces the worst-case latency (tail latency) of RAM reads caused by DRAM refresh timing conflicts by replicating data to independent DRAM channels and writing the result from the first responding channel.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
This study experimentally demonstrates how majority pressure, expert authority, response length, and rhetorical persuasion can compromise the accurate judgment of a leading agent in a multi-agent LLM system.
Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
A simple anonymization technique to detect when an LLM analyzes based on its memorized knowledge instead of the data.
Early Stopping for Large Reasoning Models via Confidence Dynamics
A method to save 25-50% of tokens by observing the pattern of changes in the model's confidence during inference and stopping unnecessary reasoning early.
Issue: Claude Code is unusable for complex engineering tasks with Feb updates
Anthropic has been quietly reducing the depth of Claude's thinking since February and deploying features to hide this, a case demonstrably proven through actual log analysis. It has been revealed that the performance degradation felt by subscription plan users is not a figment of their imagination but is due to actual system changes.
Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud
A Chrome extension that runs the Google Gemma 4 model completely locally within the browser using WebGPU, allowing it to read web pages and perform DOM manipulations such as clicks and input without requiring an API key or server.
Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B
We open-sourced a real-time multimodal AI speech and video conversation system that runs completely locally on Apple Silicon M3 Pro without the internet. It is attracting attention for its ability to handle speech recognition, video understanding, and TTS simultaneously without cloud costs.
Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code
This article explains how to run the Google Gemma 4 26B-A4B model locally on macOS using LM Studio 0.4.0's lms CLI and integrate it with Claude Code. Thanks to the MoE architecture, it can run at 51 tok/s on a 48GB MacBook Pro, enabling coding tasks without API costs.
This new technique saves 60% of my token expenses
You can reduce LLM response tokens by 60% by using a telegraphic style that only keeps nouns and verbs, excluding articles, conjunctions, and auxiliary verbs.
I reverse-engineered why Claude Code burns through your usage so fast. 7 bugs that stack on top of each other — and the worst one activates when Extra Usage kicks in
A Max 20x subscriber reverse-engineered the Claude Code CLI source and discovered 7 bugs that drain usage abnormally fast. The core issue is a 'death spiral' where switching to Extra Usage demotes cache TTL from 1 hour to 5 minutes, causing costs to spike 2.8x.
Taught Claude to talk like a caveman to use 75% less tokens.
This post details a prompt technique that drastically compresses Claude's response style, reducing token usage by 75%, which could be useful for developers interested in reducing API costs.
Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
A new method for determining when an LLM should abstain from answering — it reverse-analyzes the model's reasoning trace to reconstruct 'what question the model actually answered' and compares it against the original question.
Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
In Function-Calling agents, using only 32 tokens of CoT yields peak performance — using 256 tokens actually performs worse than no reasoning at all.
I built a tool that saves ~50K tokens per Claude Code conversation by pre-indexing your codebase
This post details the creation of a tool to pre-index a codebase to reduce the cost of repeatedly loading it for each conversation when using Claude Code.
Reasoning Shift: How Context Silently Shortens LLM Reasoning
When irrelevant context is present, reasoning models skip self-verification and cut reasoning tokens by up to 50%.
Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
PrismML has released the Bonsai LLM series (8B/4B/1.7B) based on 1-bit weights, claiming 14x memory reduction, 8x speed improvement, and 5x energy savings compared to conventional 16-bit models, while achieving comparable benchmark performance.
I read 17 papers on agentic AI workflows. Most Claude Code advice is measurably wrong
A post analyzing 17 real research papers on agentic AI coding workflows, revealing that widely spread advice like 'compliment prompts' and 'multi-agent teams' actually degrades performance.
Claude Code users hitting usage limits 'way faster than expected'
A prompt cache bug in Anthropic's AI coding assistant Claude Code has been confirmed to cause 10–20x token overconsumption, with users burning through $100–$200/month plans within hours.
Claude Code bug can silently 10-20x API costs
A warning post about two cache-related bugs in Claude Code that can silently spike API costs by up to 10–20x. Users on the $200/month plan are reportedly burning through their limits far faster than expected.
Ollama is now powered by MLX on Apple Silicon in preview
Ollama has switched its inference backend on Apple Silicon from llama.cpp to Apple's MLX framework, delivering up to nearly 2x faster inference speeds. On M5 chips, it also leverages the GPU Neural Accelerator, bringing meaningful performance gains to coding agent workflows.
Universal Claude.md – cut Claude output tokens
A project claiming that simply adding a single CLAUDE.md file to your project root can reduce unnecessary verbosity (sycophancy, filler openers/closers, unsolicited suggestions, etc.) from Claude and cut output tokens by up to 63%—though the community has raised strong doubts about benchmark reliability and real-world effectiveness.
PSA: Claude Code has two cache bugs that can silently 10-20x your API costs — here's the root cause and workarounds
A warning post was shared about two bugs in Claude Code that could increase API costs by up to 10-20x due to a malfunctioning cache, but access to the original post is blocked, making it impossible to confirm the details.
Hamilton-Jacobi-Bellman Equation: Reinforcement Learning and Diffusion Models
A math blog post showing how 1840s physics equations connect modern RL and Diffusion Models, explaining that continuous-time RL and generative model training are two faces of the same optimal control problem.
From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem
A breakdown of how LLM KV Cache architecture has evolved from GPT-2 to DeepSeek V3, comparing per-token memory costs across architectures as they dropped from 300KB to 69KB.
CERN uses ultra-compact AI models on FPGAs for real-time LHC data filtering
CERN uses a 'hardware-first' inference approach at the LHC by burning PyTorch/TensorFlow models directly into FPGAs to filter hundreds of terabytes of collision data per second at nanosecond latency — a radical departure from conventional GPU/TPU-based AI.
Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations
Having an expensive AI direct a cheap AI can achieve performance on par with the expensive AI alone — at a fraction of the cost, but only when there's a real capability gap between them.
Show HN: I put an AI agent on a $7/month VPS with IRC as its transport layer
A developer shares how they built an AI agent for their portfolio site using IRC as the transport layer — enabling direct GitHub code analysis and visitor Q&A — running on a $7/month VPS. Going beyond the typical 'AI chatbot portfolio' that simply feeds a resume into an LLM, this system provides concrete answers grounded in the actual codebase, making it a noteworthy practical example of AI agent architecture design.
We rewrote JSONata with AI in a day, saved $500k/year
Reco saved $500K annually by rewriting their Node.js-based JSONata evaluation pipeline in Go using Claude AI — but the HN community fired back with criticism: 'Why did you let this linger so long?' and 'Why didn't you use an existing Go library?'
Chroma Context-1: Training a Self-Editing Search Agent
Chroma's newly released 20B parameter agentic search model claims frontier-LLM-level retrieval performance at 1/10 the cost and 10x the speed — though a significant controversy over failure to cite prior work has emerged in the community.
$500 GPU outperforms Claude Sonnet on coding benchmarks
An open-source project that achieves 74.6% on LiveCodeBench by wrapping a frozen 14B model with a structured generation-validation-iterative-repair pipeline at inference time. It draws attention for approaching frontier-level coding performance on a single consumer GPU—without any fine-tuning, API, or cloud.
Running Claude Code fully offline on a MacBook — no API key, no cloud, 17s per task
A post sharing how to run Claude Code fully offline on a MacBook by connecting it to a local LLM without an API key or cloud, useful for developers who want to use an AI coding assistant at no cost.
Saying 'hey' cost me 22% of my usage limits
A post sharing the experience that sending a short greeting like 'hey' to Claude first can consume a significant portion of your total usage limit, raising awareness about prompt-writing habits for token conservation.
Your Claude Code Limits Didn't Shrink — I Think the 1M Context Window Is Eating Them Alive
An analysis post arguing that the perceived sudden reduction in Claude Code limits is not an actual limit decrease, but rather a spike in token consumption driven by the 1M context window.
TurboQuant: Redefining AI efficiency with extreme compression
Google Research 2-stage vector compression — PolarQuant + QJL achieves 6x KV cache reduction with zero accuracy loss and 8x attention speedup on H100 GPUs
Evaluating LLM-Based Test Generation Under Software Evolution
Large-scale study with 8 LLMs and 22,374 program variants — over 99% of LLM-generated tests remain aligned to original code patterns, degrading sharply after code changes
Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon
A Rust-based open-source project that intelligently distributes LLM models across GPU, RAM, and NVMe when they exceed your Mac's physical memory, enabling models that crash llama.cpp with OOM errors to actually run.
LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language?
A training-free technique (RYS) that duplicates Transformer layers works across all modern LLMs — and reveals that internal representations converge toward a "universal language" independent of human language.
[D] The "serverless GPU" market is getting crowded — a breakdown of how different platforms actually differ
"Serverless GPU" means four different things depending on the provider — breakdown of Vast.ai, RunPod, and Yotta Labs architectural differences