A small but telling case study in agent design landed on X this week: Viv Trivedy at LangChain described how changing only the harness—not the underlying model—pushed a coding agent from 52.8% to 66.5% on Terminal Bench 2.0, moving from around Top 30 to Top 5 on the leaderboard. The thread frames this work as “harness engineering”: the practical craft of turning a model’s “spiky intelligence” into something that reliably ships tasks under real constraints.
The agent in question is LangChain’s deepagents-cli, evaluated on Terminal Bench 2.0—an 89-task benchmark spanning areas like machine learning, debugging, and biology. Runs were orchestrated with Harbor using sandbox environments via Daytona, and each action was captured in LangSmith for observability.
Harness engineering, narrowed to three knobs
Rather than treat the harness as an unbounded pile of prompt tweaks and tool swapping, the approach deliberately constrained the optimization space to three levers:
- System prompt
- Tools
- Middleware (hooks around model and tool calls in the agent loop, via LangChain’s middleware)
That focus matters because agent harnesses can otherwise balloon into a brittle tangle of “clever” fixes that are hard to reason about—especially when chasing benchmark points.
Tracing as the outer loop: a Trace Analyzer Skill
The biggest meta-move here is treating failure analysis as a repeatable artifact instead of an artisanal debugging session. The team built a Trace Analyzer Skill that:
- Pulls experiment traces from LangSmith
- Spawns parallel error-analysis agents, then synthesizes findings and suggestions
- Aggregates feedback into targeted harness edits
The intent is similar to boosting: focus iteration on prior mistakes. The thread also notes a human can optionally sanity-check changes to avoid overfitting a particular task, which can cause regressions elsewhere.
For agent developers, the key point is that traces become a feedback signal, not just a postmortem. LangSmith stores not only content but also latency, token counts, and costs, which showed up in replies when asked how context usage is tracked.
The practical changes that moved the score
The thread attributes the bulk of the lift to a handful of harness patterns that show up repeatedly in real-world “async” coding agents.
Self-verification, enforced instead of implied
One common failure mode: an agent writes code, rereads it, decides it “looks right,” and exits. The harness pushes the agent into an explicit loop:
Planning & Discovery → Build → Verify → Fix
“Verify” is framed as running tests and comparing results against the task spec, not the agent’s own code. To prevent premature exits, the harness introduced a PreCompletionChecklistMiddleware that intercepts the agent before it stops and reminds it to do a verification pass—described as similar in spirit to a forced-loop pattern like the “Ralph Wiggum Loop”.
Context injection about the environment (not just the task)
Terminal Bench tasks come with directory layouts, tooling, and strict timeouts. The harness injects context up front using a LocalContextMiddleware that maps the working directory and related paths and probes for available tools (for example, Python installations). This reduces the agent’s need to “discover” basics through error-prone command fumbling.
The system prompt also emphasizes that work will be evaluated by programmatic tests and that details like file paths in the spec must be followed precisely—an attempt to reduce the kind of drift that makes solutions look plausible but fail scoring.
Doom loop detection via middleware hooks
Agents can lock into a plan and repeatedly make small edits to the same file without changing approach. To interrupt that, the harness uses LoopDetectionMiddleware to track per-file edit counts and inject a message nudging the model to reconsider once edits exceed a threshold.
This is explicitly described as a heuristic guardrail for current model quirks, with an expectation that future models may need it less. Still, multiple replies echoed the same pattern: tracking repeated edits and inserting a “step back” prompt can cut wasted steps and token spend.
Tuning reasoning budget: the “sandwich”
With Terminal Bench time limits, reasoning depth becomes a tradeoff: more reasoning can improve planning and verification but can also cause timeouts. The model used—kept fixed throughout—is gpt-5.2-codex, which offers reasoning modes (low, medium, high, xhigh).
The harness settled on an xhigh–high–xhigh “reasoning sandwich”: spend more compute upfront to plan, keep the middle cheaper during implementation, then spend more again during verification. In reported trials, running everything at xhigh scored worse due to timeouts (53.9%) versus high (63.6%), and the harness iteration ultimately landed at 66.5%.
General principles—and open questions
The thread’s closing guidance is largely about prioritization:
- Context engineering on behalf of agents to reduce search and onboarding failures
- Aggressive self-verification (especially through testing)
- Tracing as an optimization loop, not a dashboard
- Short-term guardrails for bad patterns like blind retries and doom loops
- Tailoring harnesses to models, borrowing from model-specific prompting guides like OpenAI’s Codex and Anthropic’s Claude
Replies also surfaced the natural skepticism: how much the harness generalizes beyond Terminal Bench, which changes contributed most (no ablation study was provided), and whether “reasoning sandwich” is optimal or simply workable. Viv’s response was pragmatic: if forced to pick two broadly useful changes, self-verification and context injection are the place to start.
For anyone wanting to dig into artifacts, LangChain also shared a public dataset of traces and links to Deep Agents open source projects in Python and Javascript.
Original source: https://x.com/Vtrivedy10/status/2023805578561060992?
