Factory Research has published a benchmark of 13 AI models for code review that compares price and performance across 50 real pull requests from Sentry, Grafana, Keycloak, Discourse, and Cal.com. The report uses the same prompts and a high reasoning setting for every model, then scores the findings against a human-curated set of known bugs with an LLM judge.
The headline result appears straightforward enough: GPT-5.2 and Claude Opus 4.6 land near the top on F1, but GPT-5.2 does so at a lower cost per pull request. Factory Research also places Kimi K2.5 and Gemini 3 Flash in a more economical cluster, suggesting that lower-priced models can still capture a meaningful share of the quality of more expensive systems.
The report also argues that newer models are not automatically better. GPT-5.4 appears overly cautious, while GPT-5.5 seems to generate too many false positives. Factory Research further notes that open-source options, including GLM-5.1 and Kimi K2.5, compete more closely with frontier models than pricing alone might suggest.
Beyond the rankings, the full piece goes into methodology, cost efficiency, and token usage, and it links to the open-source benchmark data and evaluation scripts. It also mentions Droid Action, a GitHub Action for running AI code review on pull requests, along with the broader benchmarking setup behind Factory’s code review selection process.
Source: Factory.ai