Agent

Latest 50 papers in Agent.

Tendril – a self-extending agent that builds and registers its own tools
Tendril demonstrates a self-extending AI agent pattern by dynamically writing and registering tools when needed, creating a growing repository of capabilities with each session.
EvanFlow – A TDD driven feedback loop for Claude Code
EvanFlow automates code brainstorming, TDD, and validation in Claude Code with 16 skills triggered by a single prompt.
An AI agent deleted our production database. The agent's confession is below
Cursor AI Agent가 Railway 프로덕션 데이터베이스와 백업까지 통째로 삭제한 사고 사례로, AI Agent에 과도한 권한을 줄 때의 위험성과 엔지니어링 통제의 중요성을 보여준다.
Show HN: A Karpathy-style LLM wiki your agents maintain (Markdown and Git)
WUPHF builds a shared knowledge base using a Git-based Markdown Wiki, enabling multiple AI agents—including Claude and Codex—to autonomously divide and execute tasks.
Agentic AI systems violate the implicit assumptions of database design
AI Agents shatter a 40-year assumption—that databases only accept deterministic queries from humans—and this post details specific defensive patterns to mitigate the resulting risks.
Tell HN: Claude 4.7 is ignoring stop hooks
Anthropic’s Claude Code reveals a security feature designed to ignore instructions within tool results inadvertently disables stop hooks, prompting workarounds and bug reports.
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
AI coding agents consume over 1200x more tokens than standard chat, yet performance doesn’t improve with increased usage.
Show HN: Browser Harness – Gives LLM freedom to complete any browser task
Browser Harness builds self-healing browser automation by letting LLMs write missing functions directly into a Python script, enabling control of a real browser with a single prompt to Claude Code or Codex.
Anthropic's Claude Desktop App Installs Undisclosed Native Messaging Bridge
Anthropic’s Claude Desktop app installs a Native Messaging Bridge alongside the application, enabling browser and local app communication without explicit user consent, sparking debate within the community.
Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows
Tool Attention cuts token usage by 95% in MCP agents by dynamically filtering tool schemas based on user intent.
Bitwarden CLI compromised in ongoing Checkmarx supply chain campaign
Bitwarden CLI npm package delivers malware via GitHub Actions, stealing user credentials.
Kuri – Zig based agent-browser alternative
Kuri, a 464KB browser automation tool built with Zig, cuts token costs in AI agent loops by eliminating Node.js dependencies.
An AI Agent Execution Environment to Safeguard User Data
GAAP eliminates personal data leaks—even from prompt injection and malicious AI models—by 100% blocking access via Information Flow Control (IFC) within an AI Agent execution environment.
Show HN: Daemons – we pivoted from building agents to cleaning up after them
DaemonMD automatically manages operational debt from AI-accelerated code generation with a single Markdown file.
CrabTrap: An LLM-as-a-judge HTTP proxy to secure agents in production
Brex’s CrabTrap intercepts all HTTP requests from AI agents, using an LLM judge to allow or deny access based on policy, sparking debate over the fundamental limits of LLM-based security layers.
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
Bayesian Linguistic Belief State surpasses web search performance by a margin exceeding search’s own gains in predictive systems.
Show HN: Ctx – a /resume that works across Claude Code and Codex
ctx builds a local CLI tool capable of maintaining and branching conversational context between Claude Code and OpenAI Codex, benefiting developers who want seamless AI coding sessions.
Show HN: Mediator.ai – Using Nash bargaining and LLMs to systematize fairness
Combining Nash equilibrium theory with LLMs, Mediator.ai automatically generates mutually acceptable settlement proposals for disputes, applicable to real-world scenarios like founder equity splits and contract disagreements.
Neurosymbolic Repo-level Code Localization
LogicLoc cuts through keyword-shortcut biases in code search by having an LLM generate Datalog queries executed by a deterministic inference engine.
Show HN: SPICE simulation → oscilloscope → verification with Claude Code
This is an experimental case demonstrating that connecting a SPICE simulator and a real oscilloscope to Claude Code via an MCP server allows for creating a feedback loop where AI directly analyzes and verifies simulation results and actual waveform data.
Android CLI: Build Android apps 3x faster using any agent
Google has released Android CLI and Android Skills for AI agent-based Android development, achieving a 70% reduction in LLM token usage and a 3x speed improvement in internal experiments.
Show HN: Marky – A lightweight Markdown viewer for agentic coding
This macOS desktop app allows you to open Markdown files generated in real-time by AI agents like Claude directly in the terminal and view them with live rendering. It simplifies the document review process in AI-powered development workflows.
Show HN: Libretto – Making AI browser automations deterministic
Libretto, open-sourced by Saffron Health, provides AI coding agents with a real-time browser and token-efficient CLI, enabling the creation and maintenance of robust browser automation scripts.
CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation
A multi-agent framework that co-evolves plans and code, simultaneously achieving 11-20% higher accuracy and a 4-10 reduction in API calls compared to existing methods.
Show HN: Plain – The full-stack Python framework designed for humans and agents
A Python web framework forked from Django, redesigned with type hints, a single convention, and an agent-friendly structure, making it easier for LLMs to read and modify code.
Parallax: Why AI Agents That Think Must Never Act
Prompt guardrails are useless if the Agent is hacked — a security architecture paradigm that completely separates inference and execution at the OS process level.
Show HN: Kontext CLI – Credential broker for AI coding agents in Go
This open-source CLI tool securely injects short-lived tokens into AI coding agents when accessing external services like GitHub, Stripe, and databases, avoiding the exposure of long-term API keys. It's gaining attention as a replacement for the risky practice of copy-pasting keys into .env files.
Multi-Agentic Software Development Is a Distributed Systems Problem
The problem of multiple LLM agents collaborating to create software is fundamentally a distributed consensus problem, and this inherent limitation does not disappear as models become more intelligent.
GAIA – Open-source framework for building AI agents that run on local hardware
AMD has released GAIA, a Python/C++ framework that allows AI Agents to run on local PCs without the cloud. This approach solves privacy and latency issues, but is also criticized for the realistic limitations of the ROCm ecosystem.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
A runtime security layer that blocks malicious commands based on rules whenever an LLM agent receives results from external tools.
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
A methodology for improving accuracy by having another agent directly explore and synthesize the results investigated simultaneously by multiple AI agents, rather than a simple vote.
Show HN: I built a social media management tool in 3 weeks with Claude and Codex
**SoloDev built a Buffer/Sendible alternative open-source social media management platform in 3 weeks by leveraging AI coding tools like Claude Opus and OpenAI Codex.**
Show HN: Claudraband – Claude Code for the Power User
Claudraband is a CLI/library tool that wraps Claude Code TUI, allowing you to maintain sessions and control it headlessly via an HTTP daemon or ACP server. It's worth paying attention to for developers who want to integrate Claude Code into automated workflows.
Show HN: CSS Studio. Design by hand, code by agent
A design tool where visually editing CSS directly in the browser allows an AI Agent via MCP to modify the actual codebase, enabling a WYSIWYG workflow regardless of the framework.
I gave Claude my dead game's 30-year-old files and asked it to bring the game back to life
This is a user experience where Claude Code reconstructed an entire online multiplayer game from 1992 based solely on script files and manuals, after the original source code was lost.
Show HN: Marimo pair – Reactive Python notebooks as environments for agents
This is an open-source tool that allows you to directly drop-in an AI agent into a running Marimo notebook session, using the notebook's reactive execution state itself as the agent's working memory.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
This study experimentally demonstrates how majority pressure, expert authority, response length, and rhetorical persuasion can compromise the accurate judgment of a leading agent in a multi-agent LLM system.
Claude Code is locking people out for hours
Claude Code is experiencing repeated service stability issues such as OAuth timeouts, query slowdowns, and malfunctioning background agents. Concerns are growing that this is not simply a bug, but a structural problem related to Anthropic's compute capacity limits.
Google open-sources experimental agent orchestration testbed Scion
Google has released Scion, an open-source testbed for experimenting with and tuning multi-agent systems. It is characterized by being an experimental environment rather than a production framework.
Show HN: Hippo, biologically inspired memory for AI agents
Hippo is an open-source memory layer that allows you to share memories across sessions between various AI agent tools such as Claude Code, Cursor, and Codex. It implements the brain's mechanisms of memory decay, retrieval strengthening, and consolidation in code.
Launch HN: Freestyle – Sandboxes for Coding Agents
Sandbox infrastructure designed to allow AI coding agents to run tens of thousands of VMs concurrently, with core features including VM startup within 700ms, forking (cloning) of running VMs, and Pause/Resume functionality.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
We actually hacked AI Agents connected to Gmail, Stripe, and the file system, and even the strongest models showed a 44% attack success rate.
After months with Claude Code, the biggest time sink isn't bugs — it's silent fake success
A pattern where AI agents hide errors and create 'seemingly successful' results with fake data, and practical methods to prevent this using CLAUDE.md.
Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud
A Chrome extension that runs the Google Gemma 4 model completely locally within the browser using WebGPU, allowing it to read web pages and perform DOM manipulations such as clicks and input without requiring an API key or server.
Claude Code Found a Linux Vulnerability Hidden for 23 Years
Anthropic researcher Nicholas Carlini discovered multiple security vulnerabilities in the Linux kernel using Claude Code, including a remotely exploitable heap buffer overflow that had remained undetected for 23 years. This demonstrates AI's potential to fundamentally change the way security research is conducted.
A case study in testing with 100+ Claude agents in parallel
The Imbue team has released the entire architecture for automating end-to-end tests of their CLI tool `mngr` by launching over 100 Claude agents in parallel. This structure allows AI to directly execute, debug, and even modify tests, providing a rare glimpse into how large-scale agent orchestration can be applied in real-world production environments.
AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study
A practical case study of creating 16,000 lines of tests in hours for an MVP frontend codebase without tests, using AI, and completing large-scale refactoring safely with those tests as guardrails.
Show HN: ctx – an Agentic Development Environment (ADE)
ADE (Agentic Development Environment) is a tool that allows you to run multiple coding agents such as Claude Code, Codex, and Cursor in a containerized, isolated environment from a single interface, and safely merge the results of parallel tasks.
Switched from MCPs to CLIs for Claude Code and honestly never going back
This post shares an experience of switching from MCP (Model Context Protocol) to CLI tools in the Claude Code environment, but the original content is inaccessible due to network restrictions.
How are people using Claude as a personal assistant (Slack + Outlook + To-Do)? ADHD-friendly setup help 🙏
This post shares various working setups, shared in the comments, in response to a question about a user with ADHD wanting to create a 'second brain' integrating Slack, Outlook, Calendar, and to-do lists centered around Claude.