Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Apr 23, 2026•Anuj Sadani, Deepak Kumar•View PDF

TL;DR Highlight

Tool Attention cuts token usage by 95% in MCP agents by dynamically filtering tool schemas based on user intent.

Who Should Read

Backend/AI engineers experiencing token cost explosions or performance degradation from context pollution when deploying LangGraph or MCP-based AI agents in production, especially teams operating multi-tool agents connected to 10+ MCP servers.

Core Mechanics

MCP’s architecture re-injects the entire JSON schema of all connected tools every conversation turn, wasting 15k-55k tokens per turn even with just 4-6 servers connected—dubbed 'Tools Tax'.
As 'Tools Tax' accumulates, LLM inference quality collapses sharply once context utilization exceeds 70%, leading to tool parameter hallucinations, confusion between similar tools, and loss of multi-step plan memory.
Tool Attention is a middleware composed of three components: (1) an ISO (Intent-Schema Overlap) score calculating similarity using sentence-transformers embeddings of user intent and tool descriptions, (2) a state-based gating function checking preconditions (authentication, previous tool output existence, etc.), and (3) a two-stage Lazy Schema Loader that only loads schemas for selected top-k tools.
Phase-1 maintains only short summaries (≤60 tokens) of all tools in context, while Phase-2 injects the full JSON schema only for the top-k tools that pass the gating function. The summary pool achieves an 84% cache hit rate.
If the model calls a tool mistakenly dropped by the gating function, an 'after_model' hook returns a deterministic 'tool_not_available' error, and 78% of turns with this gate triggered see the model recover normally in the next turn.
As a security byproduct, Tool Attention defends against Tool Poisoning Attacks (hijacking agent control with malicious tool descriptions) by automatically gating out tools with low cosine similarity between intent and description.

Evidence

"Benchmarking with 120 tools and 6 servers directly measured a 95.0% reduction in tool tokens from 47,312 to 2,368. Effective context utilization increased 3.8x from 0.24 to 0.91."

How to Apply

"Integrate IntentRouter as a before_model hook in LangGraph agents. Index tool summaries with FAISS, embed the user query each turn, select the top-k tools, and inject only those schemas. Calibrate threshold θ (100-200 query/correct tool pairs) to maximize F1 (typically 0.22-0.32)."

Code Example

snippet

# Quick Start: Tool Attention middleware setup
from sentence_transformers import SentenceTransformer
from vector_store import ToolVectorStore
from lazy_loader import LazySchemaLoader
from intent_router import IntentRouter
from tool_attention import ToolAttention
import tiktoken

# 1. Define tool catalog (summaries written with user intent in mind)
tools = [
    {"id": "github_search_issues", "summary": "Search GitHub issues by label, assignee, and status"},
    {"id": "slack_post_message",   "summary": "Send a message to a Slack channel"},
    {"id": "db_query",             "summary": "Query the database with a SQL query"},
    # ... 120 tools total
]

# 2. Initialize components
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
store = ToolVectorStore(dim=384)
store.add_tools(tools, encoder)

loader = LazySchemaLoader(registry_path="./schemas")  # Store each tool_id.json

router = IntentRouter(
    store=store,
    encoder=encoder,
    threshold=0.28,   # Calibrate to F1 maximization point
    top_k=10
)

enc = tiktoken.get_encoding("cl100k_base")
ta = ToolAttention(
    store=store,
    loader=loader,
    router=router,
    token_counter=lambda s: len(enc.encode(s))
)

# 3. Execute each turn (before_model hook)
user_query = "Find Slack messages related to the CSAT drop last week and create a Jira ticket"

# Precondition check (e.g., authentication status check)
def precondition_check(tool_id):
    if "github_write" in tool_id:
        return agent_state.get("github_token") is not None
    return True

result = ta.before_model(user_query, precondition_check=precondition_check)

print(f"Phase-1 tokens (summary pool): {result.phase1_tokens}")
print(f"Phase-2 tokens (full schema): {result.phase2_tokens}")
print(f"Selected tools: {result.active_ids}")
# → Inject ~2.4k tokens instead of the full 47k

# 4. Hallucination gate after model response
requested_tool = model_response.get("tool_call")
error = ta.after_model(result.active_ids, requested_tool)
if error:
    # Return structured error to the model → 78% automatic recovery in the next turn
    return {"error": "tool_not_available", "available": result.active_ids}

Terminology

MCP (Model Context Protocol)A standard interface connecting AI agents to external tool servers. Like a USB port, it allows any agent that meets the MCP specification to use any tool server.

Tools TaxThe unnecessary token cost that accumulates as MCP re-transmits the specifications of all connected tools every conversation turn. A structural waste that grows with conversation length.

KV CacheA cache where transformers store previously computed results in GPU memory. A large number of tool schema tokens increases the cache size, causing GPU memory pressure and response latency.

ISO Score (Intent-Schema Overlap)A semantic similarity score between user intent and tool description. Measures how close two sentences are in vector space using cosine similarity.

Lazy Schema LoadingA pattern of loading data only when needed. Instead of loading all tool schemas upfront, it loads only the tools likely to be used in that turn.

Tool Poisoning AttackA security vulnerability where malicious instructions are secretly inserted into tool descriptions, causing the agent to act according to the attacker's intentions simply by viewing the tool.

FAISSA fast vector similarity search library created by Facebook. A tool that quickly finds the most similar vectors from thousands to millions of vectors.

sentence-transformersAn open-source library that transforms sentences into vectors with meaning. A collection of models trained to make vectors of similar sentences point in similar directions.

Related Resources

Original Abstract (Expand)

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the "Attention Is All You Need" paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention