Why long-running coding agents need a harness, not prompts

In a long Twitter thread, sysls maps the failure modes that hit autonomous coding once tasks stretch across sessions and huge contexts. The takeaway: you’ll likely need a dedicated “harness” to orchestrate planning, verification, and cleanup. [https://x.com/systematicls/status/2038241033755168959?……

Why long-running coding agents need a harness, not prompts

TL;DR

  • Lifecycle failure taxonomy: Pre-task context gaps, planning misfires, execution context pressure, plan drift, complexity fear
  • Misalignment over “stupidity”: Wrong attack vectors from misread requirements; improve repo coverage and reduce contradictory state
  • Context exhaustion: “Context anxiety” degrades long sessions; use repo-aware session handoffs instead of provider summarization
  • Plan drift (“planning stickiness”): A’ shipped instead of A; downstream code quietly depends on incorrect behavior
  • Post-task rot: Verification laziness and entropy maximization; assign fresh-context agents for verification and cleanup
  • Custom harness rationale: Orchestrate specialized agents (contracts, independent verification, complexity classification, repo hygiene) with telemetry/rubrics

A long Twitter thread from sysls breaks down what tends to go wrong in long-running autonomous coding workflows, and why a “harness” (an orchestration layer around agents) increasingly looks like table stakes rather than a nice-to-have. The post, How To Solve Problems Of Long Running, Autonomous Agentic Engineering Workflows, isn’t a generic prompt list—it’s a phase-by-phase inventory of failure modes that show up once tasks span multiple sessions and millions of tokens.

A taxonomy of agent failure modes (and where they happen)

The thread organizes issues across the lifecycle of an agentic task:

  • Pre-task context gaps: Agents can start work without resolving missing or contradictory information, then propagate wrong assumptions.
  • Planning misfires: Even when “stupidity” is less common, misalignment—choosing an incorrect “attack vector” because requirements were misread—still drives incorrect implementations. sysls emphasizes ensuring the agent has broad repo coverage before planning and reducing contradictory repo state.
  • Execution under context pressure: The thread calls out context exhaustion (“context anxiety”) as a primary driver of sloppy outcomes in long sessions, and suggests session handoffs as a form of compaction—done with repo-aware fidelity rather than provider-native summarization.
  • Plan drift (“planning stickiness”): A frequent pattern is doing A’ (an approximation) instead of A, with downstream code then quietly depending on the wrong behavior.
  • Complexity fear: For large tasks, agents may dodge work by producing stubs, narrowing scope, or prematurely exiting—countered by aggressively decomposing work into small, non-daunting units.

Post-task is where things quietly rot

Two post-task items feel especially relevant to real-world repos: verification laziness (tests that “pass” but validate the wrong thing) and entropy maximization (code changes without updating docs, removing dead code, or resolving contradictions). The proposed fix: allocate fresh-context agents specifically for verification and cleanup.

Why sysls argues for a custom harness

A key claim is that native tooling (Claude Code, Codex, etc.) offers limited leverage for these behaviors, and that orchestration context can bloat the same model doing the coding. The alternative: a harness that spawns specialized agents for contracts, independent verification, complexity classification, and repo hygiene, backed by telemetry and rubrics. sysls also briefly mentions an “OpenForage Harness” intended for future open-sourcing once it’s “public-use” ready.

Replies add texture: Morgan notes the constant need to check whether agents are still working; Leo highlights checkpointing/state recovery; kai frames the whole thing as distributed-systems coordination; and others point out that Codex now has hooks.

Original source: https://x.com/systematicls/status/2038241033755168959?s=12&

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community