Leaderboard: Top Agent-Model Pairings and Accuracy on terminal-bench@2.0

terminal-bench@2.0 ranks 99 agent–model setups by verified accuracy, led by Simple Codex on GPT-5.3-Codex at 75.1% ±2.4. The page lists verified runs, example harbor commands, and emphasizes submissions may not change timeouts or resources.

Leaderboard: Top Agent-Model Pairings and Accuracy on terminal-bench@2.0

TL;DR

  • Leaderboard: 99 agent–model configurations ranked by accuracy on terminal-bench@2.0; entries list agent, model, submission date, agent org, model org, and accuracy (with uncertainty).
  • Submissions may not modify timeouts or resources.
  • Simple Codex on GPT-5.3-Codex — 75.1% ± 2.4 (2026-02-06; OpenAI/OpenAI)
  • Droid on Claude Opus 4.6 — 69.9% ± 2.5 (2026-02-05; Factory/Anthropic)
  • Mux on GPT-5.3-Codex — 68.5% ± 2.4 (2026-02-09; Coder/OpenAI)
  • Representative run commands: harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5 and harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5; runs verified by the Terminal-Bench team; contacts: alex@laude.org, mikeam@cs.stanford.edu; accuracy values include uncertainty (±).

The terminal-bench@2.0 leaderboard lists 99 agent–model configurations ranked by accuracy on the terminal-bench@2.0 evaluation. The page aggregates recent runs and highlights which agent implementations paired with which models perform best across the benchmark’s tasks.

What the leaderboard shows

The leaderboard presents each entry with agent name, model, submission date, agent org, model org, and reported accuracy (with uncertainty). Results correspond specifically to terminal-bench@2.0 and are displayed as verified runs. A clear constraint for submissions is reiterated on the page: submissions may not modify timeouts or resources.

Top performers

The three highest-ranked entries are:

  1. Simple Codex on GPT-5.3-Codex — 75.1% ± 2.4 (2026-02-06; OpenAI/OpenAI)
  2. Droid on Claude Opus 4.6 — 69.9% ± 2.5 (2026-02-05; Factory/Anthropic)
  3. Mux on GPT-5.3-Codex — 68.5% ± 2.4 (2026-02-09; Coder/OpenAI)

Beyond the top three, the listing includes many well-known agent and model pairings across a range of vendors, with OpenAI, Anthropic, and Google models appearing repeatedly among higher-ranked entries.

Running an evaluation

Example run invocations shown on the leaderboard indicate how results were produced. Two representative commands are given inline on the page:

  • harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5
  • harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Those commands illustrate the benchmark invocation pattern used for submissions and reproduceable runs.

Submission and verification

The page notes that a Terminal-Bench team member ran the evaluation and verified the results. Contact channels for submitting agent results are included as mailto links: alex@laude.org and mikeam@cs.stanford.edu.

Practical context for developers

  • Accuracy values include uncertainty (±), which helps to compare close results more cautiously.
  • The prohibition on changing timeouts/resources is an explicit ruleset constraint; entries are expected to follow a uniform evaluation environment.
  • The leaderboard offers a snapshot of agent–model pairings in active use on terminal-bench@2.0 and can serve as a reference when selecting combinations to reproduce or benchmark internally.

Full details and the complete table of all 99 entries are available on the original leaderboard page: https://www.tbench.ai/leaderboard/terminal-bench/2.0.

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community