Addy Osmani overview on model harness

Addy Osmani’s post on X argues that coding agents should be understood less as standalone models and more as systems built from the surrounding infrastructure. In the thread, Osmani describes “agent harness engineering” as the work of tightening that scaffolding every time an agent makes a mistake, so the same failure does not recur.

Osmani writes that “agent = model + harness,” and that “if you’re not the model, you’re the harness.” In that view, the harness includes prompts, tool definitions, context rules, hooks, sandboxes, subagents, feedback loops and recovery paths. Osmani also cites other voices in the field, including HumanLayer, Anthropic’s engineering team and Birgitta Böckeler, as converging on similar conclusions about how agents actually operate.

What the harness includes

The post breaks the harness into several parts: memory files such as AGENTS.md, tools and MCP servers, filesystem and sandbox access, orchestration for subagent handoffs, hooks for deterministic checks and observability for logs, traces, cost and latency. Osmani’s point is that a raw model does not become an agent until some layer gives it state, tool execution and constraints.

That leads to a simple claim repeated throughout the thread: “A decent model with a great harness consistently beats a great model with a bad harness.” Osmani suggests that model selection matters, but that behavior is often dominated by the wrapper around the model rather than the model alone.

Failures become rules

A central theme in the post is that agent mistakes should become permanent configuration changes. If an agent ignores a convention, Osmani suggests adding it to AGENTS.md. If it runs a destructive command, the harness should block it. If it loses track of a long task, the system should split planning from execution.

Osmani also quotes HumanLayer’s line that “It’s not a model problem. It’s a configuration problem.” The thread presents that as a counter to the habit of blaming the model whenever an agent behaves badly. Instead, the suggestion is to treat failures as a signal that some rule, hook or tool boundary needs to be tightened.

A counterpoint in the replies

The thread also drew at least one skeptical reply. A commenter argued that not every agent failure looks like a deterministic bug that can be patched away, and that some decisions are closer to judgment calls — for example, whether to refactor now or ship. That response points to a limitation in the harness-first view: some outputs may reflect discretion as much as enforcement.

Osmani closes by pointing to work from Fareed Khan on Claude Code’s architecture and to Flue, a framework from @FredKSchott that Osmani described as “solid” and apparently inspired by an earlier version of the post. The thread’s broader claim is that the most consequential engineering work around coding agents may be moving away from model comparison and toward the design of the runtime around them.

Source: Addy Osmani on X

Addy Osmani overview on model harness

TL;DR

What the harness includes

Failures become rules

A counterpoint in the replies

Continue the conversation on Slack