GPT-5.2 Leads in AI Code-Review Test of Three Models

Kilo benchmarked GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro on a 560-line TypeScript PR seeded with 18 issues. GPT-5.2 found the most and flagged an auth bypass; Claude was fastest with perfect security detection; Gemini exposed N+1 and CSV bugs but missed an admin check.

GPT-5.2 Leads in AI Code-Review Test of Three Models

TL;DR

  • All models detected critical vulnerabilities: both SQL injections, path traversal, CSV formula injection, and a loop-bounds error; no false positives reported.
  • Test setup: ~560-line TypeScript PR (Hono, Prisma, SQLite) with 18 planted issues across security, correctness, performance, authorization, style, and concurrency.
  • GPT-5.2 — most comprehensive: ~3 min, 13 issues; unique finds include authorization bypass in task duplication, blocking fs.writeFileSync in export, and search returning all users’ tasks; output included inline comments, severity, fixes, and a summary table.
  • Claude Opus 4.5 — fastest (~1 min), 8 issues (6 critical); perfect on planted security issues, caught pagination offset bug and snake_case naming inconsistency; output used “Suggested change” diffs and grouped severity summaries.
  • Gemini 3 Pro — ~2 min, 9 issues; flagged an N+1 query, swallowed exceptions, and CSV escaping problems, but missed an admin-authorization check on /export/all.
  • Practical comparison: frontier models excel at performance-pattern detection and deeper authorization analysis (GPT-5.2 and Gemini found N+1; GPT-5.2 found the duplication bypass); free models (e.g., Grok Code Fast 1) remained competitive on core security screening; all models missed a race condition in bulk-assign.

Kilo’s evaluation of three frontier code-review models—GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro—ran each model against the same 560-line TypeScript PR with 18 planted issues to compare detection rates, performance patterns, and output formats.

Testing methodology

The test used a TypeScript task-management API built with Hono, Prisma, and SQLite. The feature branch added user search, bulk operations, and CSV export across four new files (~560 lines). The PR contained 18 intentional issues grouped into security, correctness, performance, authorization, style, and concurrency categories. Each model reviewed with a Balanced review style and all focus areas enabled; reviews were capped at 10 minutes (none required more than 3).

Results overview

All three frontier models detected the critical vulnerabilities: both SQL injection instances, path traversal, and CSV formula injection. The models also flagged a loop-bounds error that could cause undefined array access. No false positives were reported—every flagged item was a real problem.

GPT-5.2 — most comprehensive

GPT-5.2 completed its review in about 3 minutes and found 13 issues. Notable findings unique to GPT-5.2:

  • Authorization bypass in task duplication: a bulk-duplicate endpoint allowed clients to specify an owner ID, enabling creation of tasks in other users’ accounts.
  • Blocking file write: use of fs.writeFileSync() in the export path, which blocks the Node.js event loop and can freeze request handling for large exports.
  • Search endpoint returning all tasks instead of only the current user’s tasks, exposing other users’ data.

Output format included inline comments with severity labels, an impact assessment, and recommended fixes with code examples and a summary table.

Claude Opus 4.5 — fastest, focused on security and style

Claude Opus 4.5 finished in about 1 minute and reported 8 issues (6 critical, 2 lower severity). Key observations:

  • Caught the pagination offset bug (offset computed as page * limit, causing skipped results).
  • Flagged a naming-convention inconsistency: bulk operations used snake_case (updated_count, failed_count) while the codebase otherwise used camelCase.
  • Achieved perfect detection across planted security issues.

Output used inline comments with “Suggested change” diff blocks and a grouped summary by severity for concise, actionable review notes.

Gemini 3 Pro — spotted performance anti-patterns

Gemini 3 Pro took roughly 2 minutes and found 9 issues. Highlights:

  • Detected an N+1 query pattern where assignee info was fetched per task, turning a single-page request into dozens or hundreds of queries.
  • Found swallowed exceptions in a bulk-update loop where errors were caught but not handled or logged.
  • Identified CSV escaping problems when owner names include commas, which corrupt CSV alignment.

Gemini missed a critical admin-authorization check on the /export/all endpoint (marked “admin only” in comments but lacking role enforcement), an issue other models caught.

Detection rates and category breakdown

  • Security: GPT-5.2 and Claude Opus 4.5 hit 100% on the planted security issues; Gemini missed the admin-authorization check. All three caught SQL injection and path traversal.
  • Performance: GPT-5.2 flagged two of three performance problems (N+1 and sync file writes); Gemini flagged the N+1; Claude Opus detected none.
  • Authorization & logic: GPT-5.2 demonstrated the deepest authorization analysis by finding the task-duplication bypass.

Additional findings and common misses

  • GPT-5.2 was the only model to find the task-duplication bypass beyond the planted set of issues.
  • All three models missed a race condition in bulk-assign: a check-then-update pattern where two concurrent requests (or intervening deletes) can corrupt data. Detecting this requires reasoning about interleaving operations across requests.

Frontier models vs free models

When compared to three free models tested earlier (Grok Code Fast 1, MiniMax M2, Devstral 2):

  • Grok Code Fast 1 matched Claude Opus 4.5’s detection rate (44%) and caught all five planted security vulnerabilities (both SQL injections, missing admin check, path traversal, CSV injection).
  • Frontier advantages were clear in performance pattern detection and deeper authorization analysis: GPT-5.2 and Gemini found N+1 issues and GPT-5.2 uncovered the task-duplication bypass; none of the free models detected performance problems.
  • For core security screening—SQL injection, path traversal, missing authorization—free models like Grok Code Fast 1 performed competitively with frontier models.

Practical takeaways

  • GPT-5.2 provided the widest coverage and uncovered both security and performance issues that other models missed, making it suited for thorough audits where breadth matters.
  • Claude Opus 4.5 combined speed and reliable security detection, fitting high-velocity workflows that require quick, consistent security checks.
  • Gemini 3 Pro surfaced useful performance anti-patterns but missed some authorization logic; pairing with manual review is advisable for authorization-sensitive code.
  • For basic security screening, free models can be effective and cost-efficient.

Full test details and original results are available at https://blog.kilo.ai/p/code-reviews-sota.

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community