New benchmark on AI models code review value

Factory Research tested 13 AI models on 50 real pull requests to compare accuracy and cost. GPT-5.2 and Claude Opus 4.6 lead on quality, but pricing shifts the value story—especially for Kimi K2.5 and other budget picks.

April 30, 2026

•

Benchmark

TL;DR

**Benchmark scope:** 13 AI models evaluated on 50 real PRs from Sentry, Grafana, Keycloak, Discourse, Cal.com
**Methodology:** Same prompts, high reasoning setting; scored vs human-curated bugs using an **LLM judge**
**Top results:** **GPT-5.2** and **Claude Opus 4.6** near the top on **F1**
**Cost efficiency:** **GPT-5.2** achieves top-tier F1 at lower cost per PR than Claude Opus 4.6
**Value cluster:** **Kimi K2.5** and **Gemini 3 Flash** flagged as economical with meaningful quality
**Model behavior notes:** **GPT-5.4** overly cautious; **GPT-5.5** produces too many false positives

Factory Research has published a benchmark of 13 AI models for code review that compares price and performance across 50 real pull requests from Sentry, Grafana, Keycloak, Discourse, and Cal.com. The report uses the same prompts and a high reasoning setting for every model, then scores the findings against a human-curated set of known bugs with an LLM judge.

The headline result appears straightforward enough: GPT-5.2 and Claude Opus 4.6 land near the top on F1, but GPT-5.2 does so at a lower cost per pull request. Factory Research also places Kimi K2.5 and Gemini 3 Flash in a more economical cluster, suggesting that lower-priced models can still capture a meaningful share of the quality of more expensive systems.

The report also argues that newer models are not automatically better. GPT-5.4 appears overly cautious, while GPT-5.5 seems to generate too many false positives. Factory Research further notes that open-source options, including GLM-5.1 and Kimi K2.5, compete more closely with frontier models than pricing alone might suggest.

Beyond the rankings, the full piece goes into methodology, cost efficiency, and token usage, and it links to the open-source benchmark data and evaluation scripts. It also mentions Droid Action, a GitHub Action for running AI code review on pull requests, along with the broader benchmarking setup behind Factory’s code review selection process.

Source: Factory.ai

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community