Anthropic's Claude Adds Three Beta Tools to Cut Context and Boost Accuracy

Anthropic's Claude Developer Platform introduces three beta features to let Claude discover, orchestrate, and learn tools on demand. They reduce context bloat, cut token costs, and boost accuracy in multi-step workflows.

Anthropic's Claude Adds Three Beta Tools to Cut Context and Boost Accuracy

TL;DR

  • New beta features: Tool Search Tool, Programmatic Tool Calling, Tool Use Examples — on-demand discovery, code-based orchestration, and in-tool input examples.
  • Tool Search Tool: mark tools with defer_loading: true to exclude from initial prompt; support regex, BM25, embeddings/custom search; example context down ~77K → ~8.7K tokens (~85% token reduction); Opus 4 accuracy 49% → 74%, Opus 4.5 79.5% → 88.1%.
  • Programmatic Tool Calling: Claude emits Python that runs in a sandboxed Code Execution environment, intermediate tool responses handled in runtime and not injected back; token usage ~43,588 → 27,297 (~37% reduction); internal knowledge 25.6% → 28.5%, GIA 46.5% → 51.2%; lowers latency by avoiding multiple inference round-trips.
  • Tool Use Examples: include input_examples alongside input_schema to teach formats and parameter correlations; complex-parameter accuracy 72% → 90%; guidance: 1–5 realistic examples mixing minimal and full.
  • Adoption guidance: use Tool Search Tool when tool definitions exceed ~10K tokens or multi-server selection errors occur; use Programmatic Tool Calling for large datasets, multi-step or parallel workflows and to hide intermediates; use Tool Use Examples when schemas leave conventions or parameter combos ambiguous.

Anthropic adds three beta features for dynamic, efficient tool use on the Claude Developer Platform

Agents that coordinate many services face two recurring bottlenecks: large tool libraries that bloat model context, and multi-step workflows that push intermediate results into the model’s context window. Anthropic’s Claude Developer Platform now includes three beta capabilities designed to address these problems by letting Claude discover, orchestrate, and learn tool usage more efficiently: Tool Search Tool, Programmatic Tool Calling, and Tool Use Examples.

Why this matters

Large MCP deployments can easily produce tens of thousands of tokens of tool definitions before an agent starts work. In one illustrative setup, five servers produced roughly 55K tokens; Anthropic has observed tool definitions reaching 134K tokens in some cases. Aside from token cost, common failure modes include wrong tool selection and incorrect parameter usage when many similarly named tools exist. The new features approach these issues separately but are intended to be used together when appropriate.

Tool Search Tool: discover tools on demand

The Tool Search Tool avoids loading full tool libraries into Claude’s context up front. Tools can be submitted to the API with defer_loading: true, keeping them out of the initial prompt. Claude initially sees only the Tool Search Tool plus any core tools with defer_loading: false. When a capability is required, Claude searches (regex, BM25, or a custom search), and matches are expanded into full tool definitions.

Key outcomes and numbers from Anthropic’s tests:

  • Example context reduction from ~77K tokens (traditional) to ~8.7K tokens with Tool Search Tool.
  • Reported overall token usage reduction of ~85% in the tested scenario.
  • Accuracy improvements in MCP evaluations: Opus 4 rose from 49% → 74%; Opus 4.5 from 79.5% → 88.1%.

Implementation notes (as in the platform):

  • Mark tools for on-demand discovery with defer_loading: true.
  • Built-in search options include regex and BM25; embeddings or custom strategies are supported.
  • Prompt caching remains compatible because deferred tools are excluded from the initial prompt.

Programmatic Tool Calling: orchestration via code

Programmatic Tool Calling (PTC) moves orchestration into a code execution environment, letting Claude generate Python that calls tools, handles loops, parallelizes requests, and filters results before anything is returned to the model. Intermediate tool responses are processed in the Code Execution environment; only the final script output enters Claude’s context.

Benefits and benchmark highlights:

  • Substantial token savings: average usage dropped from 43,588 → 27,297 tokens (≈37% reduction) on complex research tasks.
  • Reduced latency by eliminating multiple inference round-trips: a multi-step workflow can be executed in a single code run rather than many sequential prompts.
  • Improved accuracy: internal knowledge retrieval rose from 25.6% → 28.5%; GIA benchmarks improved from 46.5% → 51.2%.

Practical mechanics:

  • Tools opt into programmatic calling by referencing the Code Execution tool in allowed_callers.
  • Claude emits code (Python) that runs in a sandboxed Code Execution tool. When that code calls a tool, the server receives tool_use requests with a caller field; results are returned to the code runtime rather than being injected into Claude’s context.
  • Only the code_execution_tool_result (final stdout/content) is returned to the model.

A concrete example in the source shows a budget compliance workflow where 2,000+ expense line items remain outside the model’s context; Claude’s context receives only the final list of employees who exceeded budgets.

Tool Use Examples: teaching usage patterns beyond schemas

JSON Schema captures structure but not conventions: date formats, ID patterns, which optional fields are commonly populated together. Tool Use Examples embed concrete input examples in tool definitions so Claude learns format conventions, nested structure patterns, and parameter correlations.

Anthropic reports accuracy improvement on complex parameter handling from 72% → 90% after adding input examples. Guidance for examples includes using realistic data, mixing minimal and full examples, and keeping the set concise (1–5 examples per tool).

Key fields: include input_examples in tool definitions alongside the input_schema so examples travel with the tool metadata.

When to adopt which feature

Anthropic recommends treating these as targeted mitigations rather than mandatory layers:

  • Use Tool Search Tool when tool definitions consume >10K tokens, when tool selection errors occur, or for multi-server MCP systems.
  • Use Programmatic Tool Calling for large datasets, multi-step workflows, parallelizable operations, and when intermediate data should not shape model reasoning.
  • Use Tool Use Examples when schemas alone leave ambiguity around conventions or correct parameter combinations.

The features are complementary: Tool Search Tool finds the right tools; Programmatic Tool Calling executes efficiently and hides intermediate results; Tool Use Examples reduce parameter errors.

Getting started and resources

These capabilities are available in beta. The platform example shows adding the beta flag and relevant tools (Tool Search Tool, Code Execution) in the API call. For documentation and cookbooks, see:

Original article: https://www.anthropic.com/engineering/advanced-tool-use?

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community