A long Twitter thread from sysls breaks down what tends to go wrong in long-running autonomous coding workflows, and why a “harness” (an orchestration layer around agents) increasingly looks like table stakes rather than a nice-to-have. The post, How To Solve Problems Of Long Running, Autonomous Agentic Engineering Workflows, isn’t a generic prompt list—it’s a phase-by-phase inventory of failure modes that show up once tasks span multiple sessions and millions of tokens.
A taxonomy of agent failure modes (and where they happen)
The thread organizes issues across the lifecycle of an agentic task:
- Pre-task context gaps: Agents can start work without resolving missing or contradictory information, then propagate wrong assumptions.
- Planning misfires: Even when “stupidity” is less common, misalignment—choosing an incorrect “attack vector” because requirements were misread—still drives incorrect implementations. sysls emphasizes ensuring the agent has broad repo coverage before planning and reducing contradictory repo state.
- Execution under context pressure: The thread calls out context exhaustion (“context anxiety”) as a primary driver of sloppy outcomes in long sessions, and suggests session handoffs as a form of compaction—done with repo-aware fidelity rather than provider-native summarization.
- Plan drift (“planning stickiness”): A frequent pattern is doing A’ (an approximation) instead of A, with downstream code then quietly depending on the wrong behavior.
- Complexity fear: For large tasks, agents may dodge work by producing stubs, narrowing scope, or prematurely exiting—countered by aggressively decomposing work into small, non-daunting units.
Post-task is where things quietly rot
Two post-task items feel especially relevant to real-world repos: verification laziness (tests that “pass” but validate the wrong thing) and entropy maximization (code changes without updating docs, removing dead code, or resolving contradictions). The proposed fix: allocate fresh-context agents specifically for verification and cleanup.
Why sysls argues for a custom harness
A key claim is that native tooling (Claude Code, Codex, etc.) offers limited leverage for these behaviors, and that orchestration context can bloat the same model doing the coding. The alternative: a harness that spawns specialized agents for contracts, independent verification, complexity classification, and repo hygiene, backed by telemetry and rubrics. sysls also briefly mentions an “OpenForage Harness” intended for future open-sourcing once it’s “public-use” ready.
Replies add texture: Morgan notes the constant need to check whether agents are still working; Leo highlights checkpointing/state recovery; kai frames the whole thing as distributed-systems coordination; and others point out that Codex now has hooks.
Original source: https://x.com/systematicls/status/2038241033755168959?s=12&