AI’s next leap: long-running agents that persist beyond chat

A recent article by Addy Osmani explores “long-running agents” that can keep working across sessions without losing state. It outlines the key architectural patterns and why today’s agents still stall when context, memory, and verification break down.

May 8, 2026

•

Harness

TL;DR

Thesis: AI agents shifting from single-chat interactions to multi-session, long-duration task continuity
Core categories: Long-horizon reasoning, long-running execution, persistent agency
Current limits: Context windows end, session state vanishes, models overstate completion certainty
Approaches surveyed: Anthropic harnesses, Cursor planner-worker-judge, Google Agent Platform
Shared architecture: Separate model loop from execution sandbox, durable history, built-in verification
Production patterns + trade-offs: Checkpointing, approval gates, layered memory, ambient processing; costs, security, drift, spec fragility

Addy Osmani’s latest post on long-running agents looks at where AI agents appear to be heading once a single chat window is no longer enough. The piece argues that the next phase is less about flashier tools and more about systems that can keep moving on a task across multiple sessions, sandboxes, and time spans without losing track of what happened before.

The article separates that idea into a few different categories, including long-horizon reasoning, long-running execution, and persistent agency. It also walks through why today’s agents still hit familiar limits: context windows run out, state disappears between sessions, and models tend to sound more certain about completion than they probably should.

From there, the post surveys how several major teams are approaching the problem. Anthropic’s harnesses, Cursor’s planner-worker-judge setup, and Google’s Agent Platform each take a slightly different path, but they seem to converge on the same broad structure: keep the model loop separate from the execution sandbox, add durable session history, and make verification part of the workflow rather than an afterthought.

The more practical part of the piece focuses on patterns that make these systems workable in production, such as checkpointing, human approval gates, layered memory, ambient processing, and coordinating multiple agents. Osmani also flags the trade-offs that still look unresolved, including cost, security, drift, and the difficulty of writing specs that survive a long autonomous run.

For readers following the shift from short-lived bots to agents that can keep working overnight or longer, the full post is worth a look. It pulls together the current thinking from practitioners and platform teams into one fairly compact overview.

Source: Addy Osmani

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community

New paper says agentic coding scaling needs smarter reuse

Joongwon Kim and coauthors argue test-time scaling for long-horizon coding agents depends less on more sampling and more on carrying forward useful rollout information. Their summary-based RTV and PDR methods boost results on SWE-Bench Verified and Terminal-Bench v2.0.

Apr 27, 2026

1 shared tag

Continue the conversation on Slack

Related Articles

New paper says agentic coding scaling needs smarter reuse