Anthropic’s Boris Cherny shares tips for long-running Claude Opus

Anthropic’s Boris Cherny says Claude Opus is especially strong at multi-hour, even multi-day, autonomous work. He shares five tactics—from auto permissions and dynamic agent workflows to browser-based self-verification—plus notes on ROI, token usage, and caveats.

claude cover

TL;DR

  • Claude Opus 4.8 for multi-hour tasks: Thread claims strong fit for long-running work; SWE-Marathon graphic shows 26% top visible score
  • Five practices for long runs: Auto permissions; dynamic workflows; “/goal” and “/loop”; Claude Code in cloud; end-to-end self-verification
  • Web E2E testing preference: “Claude in Chrome” favored over Playwright/Chromium MCP for power and token efficiency
  • Self-verification emphasis: Dynamic workflows plus browser testing prompts to catch edge cases and UI issues
  • Cost/operations notes: Evaluate ROI vs absolute cost; use /usage to identify token-heavy skills, MCPs, plugins
  • Examples and reports: 19-hour run verified ~300 flows; uses include migrations, profiling, flaky CI detection; screenshot shows 550 flows, 14 bugs filed

Boris Cherny’s X thread argues that Claude Opus is well suited to long-running work, with the Anthropic staffer offering a set of tips for keeping the model busy for hours or days and pointing to a benchmark graphic that appears to favor Opus 4.8 on multi-hour software tasks.

Cherny lists five practices for extended autonomous runs: enable auto mode for permissions so Claude does not keep asking for approval; use dynamic workflows to orchestrate hundreds or thousands of agents; nudge the model with commands such as “/goal” or “/loop”; run Claude Code in the cloud; and give the model a way to self-verify its work end to end.

For web work, Cherny states that “Claude in Chrome” is preferable to Playwright or Chromium MCP for E2E testing, calling it “more powerful and more token-efficient.” In another reply, he adds that the most important ingredient is “self-verification” paired with dynamic workflows, with prompts aimed at testing results in a browser and looking for “edge cases and ui issues.”

The thread also includes a few caveats from other commenters. One user notes that such workflows seem more manageable when acceptance criteria are clear, while another asks about costs on enterprise accounts. Cherny responds that he thinks about the problem in terms of ROI rather than absolute cost, arguing that the same manual work can amount to “weeks or even months of engineering time.”

He also dismisses the idea that the command set needs to be manually driven, writing that those controls are “not designed for people to invoke them,” and that the model should be told what needs to happen so it can invoke the right skills itself. Later, when asked about mobile workflows, he replies simply: “Just tell claude to use a workflow.”

Cherny also tells one commenter to run “/usage” to see a breakdown of the specific skills, mcps, and plugins consuming tokens. When asked about long sessions and context issues, he states that “Context rot isn’t a thing with 4.8 imo,” though that remains his view rather than an independently verified conclusion.

Source: X post by Boris Cherny

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community