Kilo benchmarks AI code review costs: $0.05 to $1.34

AI code review is drifting from novelty into routine infrastructure, and pricing questions are starting to look less theoretical. In a recent analysis of Kilo Code Reviewer, Kilo’s team put real numbers on that question by running the product against actual open-source pull requests and tracking token usage and dollars across two very different models.

The comparison lands at a useful moment: Anthropic has also been highlighting that “deep” multi-agent PR reviews can cost real money—often cited in the $15–$25 range per review—depending on how much context gets pulled in. Kilo’s setup, by contrast, shows what code review spend can look like when the same workflow is run with either a frontier model or a lower-cost alternative.

The test: real Hono PRs, two model tiers

To keep the benchmark grounded, the experiment used two real commits from Hono (forked at v4.11.4), each turned into a PR:

Small PR (338 lines, 9 files): commit 16321afd, adding getConnInfo helpers for AWS Lambda, Cloudflare Pages, and Netlify adapters, plus tests.
Large PR (598 lines, 5 files): commit 8217d9ec, adjusting JSX link element hoisting and deduplication to match React 19 semantics, with 485 lines of new tests.

Each PR was reviewed twice in Kilo Code Reviewer using Balanced review style with all focus areas enabled, swapping only the model:

Claude Opus 4.6, the higher-end (and higher-priced) option.
Kimi K2.5, an open-weight MoE model positioned as a budget pick.

What the reviews actually cost

Across the four runs, the headline number was straightforward: the most expensive review was $1.34 (Claude Opus 4.6 on the 598-line PR). The cheapest landed at $0.05 (Kimi K2.5 on that same PR).

Those totals reflect a combination of per-token pricing and how aggressively the agent pulls surrounding context.

Pricing per token (the built-in multiplier)

Kilo’s numbers list model pricing as:

Claude Opus 4.6: $5 / million input tokens and $25 / million output tokens
Kimi K2.5: $0.45 / million input tokens and $2.20 / million output tokens

That’s roughly a 10× difference in per-token costs, before accounting for behavioral differences.

Token usage: the hidden lever is context

Kilo’s reviewers aren’t simply summarizing the diff; they dispatch agents that read changes and then pull in additional files for context. The test shows just how differently models behave under the same settings.

Small PR (338 lines): context expansion vs staying on-diff

Opus 4.6 used 618,853 input tokens
Kimi K2.5 used 359,556 input tokens
Output tokens were similar (6,142 vs 5,374)

The practical difference: Opus pulled in surrounding code (including handler.ts) to understand Lambda event types, which led it to catch a missing LatticeRequestContextV2 type. Kimi stayed closer to the diff and didn’t surface that issue.

Large PR (598 lines): a wider gap

Opus 4.6 consumed 1,184,324 input tokens
Kimi K2.5 consumed 219,886 input tokens

In this case, Opus retrieved more of the JSX rendering implementation to reason about deduplication behavior, while Kimi did what Kilo describes as a lighter pass and reported no issues.

Cost per issue found: cheaper isn’t always “worse,” but it can be narrower

One of the more developer-relevant angles here is cost per issue, because raw dollars don’t mean much if the review is routinely missing what matters.

On the small PR:

Opus 4.6 ($0.73) found 2 issues, including the missing LatticeRequestContextV2 handling and an X-Forwarded-For parsing concern (trusting the first IP, which could be spoofed behind a load balancer). These required reading outside the diff.
Kimi K2.5 ($0.07) found 3 issues, focusing on defensive coding inside the diff (like a missing null check on c.env and a header edge case), plus a test assertion note.

On the large PR:

Opus 4.6 ($1.34) flagged a potential runtime error: shouldDeDupeByKey could throw a TypeError if called with an unexpected tag name because deDupeKeyMap[tagName] could return undefined.
Kimi K2.5 ($0.05) found 0 issues and recommended merging, while still describing the PR accurately and noting thorough tests.

Importantly, Kilo notes that the Opus-found issues—like the missing Lattice event type and the potential TypeError—were present in the shipped Hono release.

Monthly spend modeling for a 10-dev team

To translate per-PR costs into budgeting, Kilo modeled a 10-developer team opening 3 PRs per day (about 660 PRs/month):

A “frontier” estimate using the Opus average review cost ($1.04) lands around $660/month.
A “budget” estimate using the Kimi average ($0.06) lands around $40/month.
A mixed approach (20% frontier, 80% budget) comes out to about $165/month.

Picking models based on PR criticality

The practical guidance is less about declaring a universal winner and more about matching model behavior to the moment:

Frontier models (here, Opus 4.6) appear better at pulling broader context and catching issues that require understanding code outside the diff.
Budget models (here, Kimi K2.5) can be a low-cost way to get baseline coverage—especially where speed and price matter more than exhaustive context expansion.
A mixed policy (budget by default, frontier on merges to main or release branches) is presented as a way to control spend while reserving deeper review for higher-stakes changes.

Original source