Leaderboard: Top Agent-Model Pairings and Accuracy on terminal-bench@2.0

The terminal-bench@2.0 leaderboard lists 99 agent–model configurations ranked by accuracy on the terminal-bench@2.0 evaluation. The page aggregates recent runs and highlights which agent implementations paired with which models perform best across the benchmark’s tasks.

What the leaderboard shows

The leaderboard presents each entry with agent name, model, submission date, agent org, model org, and reported accuracy (with uncertainty). Results correspond specifically to terminal-bench@2.0 and are displayed as verified runs. A clear constraint for submissions is reiterated on the page: submissions may not modify timeouts or resources.

Top performers

The three highest-ranked entries are:

Simple Codex on GPT-5.3-Codex — 75.1% ± 2.4 (2026-02-06; OpenAI/OpenAI)
Droid on Claude Opus 4.6 — 69.9% ± 2.5 (2026-02-05; Factory/Anthropic)
Mux on GPT-5.3-Codex — 68.5% ± 2.4 (2026-02-09; Coder/OpenAI)

Beyond the top three, the listing includes many well-known agent and model pairings across a range of vendors, with OpenAI, Anthropic, and Google models appearing repeatedly among higher-ranked entries.

Running an evaluation

Example run invocations shown on the leaderboard indicate how results were produced. Two representative commands are given inline on the page:

harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5
harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Those commands illustrate the benchmark invocation pattern used for submissions and reproduceable runs.

Submission and verification

The page notes that a Terminal-Bench team member ran the evaluation and verified the results. Contact channels for submitting agent results are included as mailto links: alex@laude.org and mikeam@cs.stanford.edu.

Practical context for developers

Accuracy values include uncertainty (±), which helps to compare close results more cautiously.
The prohibition on changing timeouts/resources is an explicit ruleset constraint; entries are expected to follow a uniform evaluation environment.
The leaderboard offers a snapshot of agent–model pairings in active use on terminal-bench@2.0 and can serve as a reference when selecting combinations to reproduce or benchmark internally.

Full details and the complete table of all 99 entries are available on the original leaderboard page: https://www.tbench.ai/leaderboard/terminal-bench/2.0.