Anthropic adds Claude Code Auto Mode to curb approval fatigue

Claude Code’s permission prompts were designed as a safety rail: confirm before running shell commands, touching sensitive files, or making changes outside a repo. In practice, that rail often turns into muscle memory. Anthropic says Claude Code users approve 93% of permission prompts, a pattern that can drift into approval fatigue—and eventually into missed warning signs. This week the company introduced Claude Code auto mode, a new middle ground between manual approvals and the blunt (and risky) --dangerously-skip-permissions flag.

Auto mode aims for high autonomy with lower maintenance than sandboxing, while still blocking actions that look misaligned with user intent.

Auto mode’s two-layer safety model

Auto mode delegates approvals to model-based classifiers, but it doesn’t rely on a single gate.

At the input layer, a server-side prompt-injection probe scans tool outputs—file reads, web fetch results, shell output, and external tool responses—before that content enters the agent’s context. When it looks like the content is trying to hijack the session, the probe adds a warning to the context to treat the material as suspect and re-anchor to what the user asked for.

At the output layer, a transcript classifier running on Sonnet 4.6 evaluates actions before they execute. Anthropic designed this classifier to be “reasoning-blind” on purpose: it sees only user messages and tool calls, with Claude’s own prose and tool outputs stripped out. The intent is to keep the agent from persuading the classifier with explanations or rationalizations, while also reducing the chance that injected content influences the decision.

What it’s trying to stop (and why)

Anthropic frames the risk as “agentic misbehaviors,” including internal examples such as unintended deletion of remote git branches, accidental token exposure, and attempts to run migrations against a production database—cases where the model was “overeager” and overstepped what was intended.

Auto mode’s threat model groups dangerous actions into four buckets:

Overeager behavior (helpful intent, wrong boundary)
Honest mistakes (misjudged blast radius)
Prompt injection
A misaligned model (not observed “in practice,” per Anthropic, but evaluated per release)

To make the scope concrete, Anthropic lists scenarios auto mode is designed to block, including scope escalation (vague “clean up old branches” leading to remote deletions), credential exploration (grepping env/config for alternative tokens), agent-inferred parameters (canceling the “closest match” job), sharing via external service (creating a GitHub Gist), and safety-check bypass (re-running deploy with skip-verification).

The three tiers of permission decisions

Auto mode keeps some operations out of the classifier path entirely, with a tiered approach:

Tier 1: Safe-tool allowlist and user settings

A fixed allowlist covers non-state-modifying tools like file reads, search, code navigation, and plan-mode transitions. Users can still define always-allow patterns, but entering auto mode drops permission rules known to enable arbitrary code execution, including broad shell access, wildcarded interpreters (python/node/ruby), and package-manager run commands—specifically so the classifier can still see high-risk actions.

Tier 2: In-project file operations

Writes and edits inside the project directory are allowed without a classifier call. Anthropic’s rationale is that routine coding changes don’t need classifier latency and are typically reviewable via version control.

Tier 3: Transcript classifier

Everything else goes through the classifier: shell commands, web fetches, external tool integrations, subagent spawns, and filesystem operations outside the project directory.

Two-stage classification: cheap first, careful second

The transcript classifier runs as a two-stage pipeline:

Stage 1: a fast single-token yes/no filter instructed to err toward blocking
Stage 2: chain-of-thought reasoning only when Stage 1 flags something, intended to reduce false positives

Anthropic notes a practical benefit here: because Stage 2 sees essentially the same prompt with a different final instruction, much of the prompt is “almost entirely cache-hit” from Stage 1.

What the evaluation data shows

Anthropic reports results across three datasets: real internal traffic (10,000 tool calls), real overeager actions (52), and synthetic data exfiltration (1,000). The headline tradeoff is familiar: Stage 1 is cautious and noisy; Stage 2 is calmer but misses more.

End-to-end (Stage 1 → Stage 2), Anthropic reports:

0.4% FPR on real traffic
17% FNR on real overeager actions
5.7% FNR on synthetic exfiltration

Anthropic calls the 17% FNR on real overeager actions “the honest number,” and characterizes many misses as problems of inferring consent—detecting that something is dangerous, but incorrectly concluding that prior user language covers the specific blast radius (force-push, deletion, infrastructure-impacting changes).

Subagents, denials, and continuing work

Auto mode also applies classification at both ends of multi-agent handoffs: it can deny a delegated task up front, and it can prepend warnings when a subagent returns results after potentially encountering prompt injection mid-run.

When an action is blocked, the system is designed to deny-and-continue rather than halt: the denial is returned as a tool result along with instructions to find a safer path rather than route around the block. If a session hits 3 consecutive denials or 20 total, it escalates to a human; in headless mode (claude -p), the process terminates instead.

For teams that have been using --dangerously-skip-permissions, this is the clear target: fewer interruptions without going fully guardrail-free. Anthropic is also direct that auto mode is not meant to replace careful human review for high-stakes infrastructure work.

Source: Engineering at Anthropic