Kilo’s evaluation of three frontier code-review models—GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro—ran each model against the same 560-line TypeScript PR with 18 planted issues to compare detection rates, performance patterns, and output formats.
Testing methodology
The test used a TypeScript task-management API built with Hono, Prisma, and SQLite. The feature branch added user search, bulk operations, and CSV export across four new files (~560 lines). The PR contained 18 intentional issues grouped into security, correctness, performance, authorization, style, and concurrency categories. Each model reviewed with a Balanced review style and all focus areas enabled; reviews were capped at 10 minutes (none required more than 3).
Results overview
All three frontier models detected the critical vulnerabilities: both SQL injection instances, path traversal, and CSV formula injection. The models also flagged a loop-bounds error that could cause undefined array access. No false positives were reported—every flagged item was a real problem.
GPT-5.2 — most comprehensive
GPT-5.2 completed its review in about 3 minutes and found 13 issues. Notable findings unique to GPT-5.2:
- Authorization bypass in task duplication: a bulk-duplicate endpoint allowed clients to specify an owner ID, enabling creation of tasks in other users’ accounts.
- Blocking file write: use of fs.writeFileSync() in the export path, which blocks the Node.js event loop and can freeze request handling for large exports.
- Search endpoint returning all tasks instead of only the current user’s tasks, exposing other users’ data.
Output format included inline comments with severity labels, an impact assessment, and recommended fixes with code examples and a summary table.
Claude Opus 4.5 — fastest, focused on security and style
Claude Opus 4.5 finished in about 1 minute and reported 8 issues (6 critical, 2 lower severity). Key observations:
- Caught the pagination offset bug (offset computed as page * limit, causing skipped results).
- Flagged a naming-convention inconsistency: bulk operations used snake_case (
updated_count,failed_count) while the codebase otherwise used camelCase. - Achieved perfect detection across planted security issues.
Output used inline comments with “Suggested change” diff blocks and a grouped summary by severity for concise, actionable review notes.
Gemini 3 Pro — spotted performance anti-patterns
Gemini 3 Pro took roughly 2 minutes and found 9 issues. Highlights:
- Detected an N+1 query pattern where assignee info was fetched per task, turning a single-page request into dozens or hundreds of queries.
- Found swallowed exceptions in a bulk-update loop where errors were caught but not handled or logged.
- Identified CSV escaping problems when owner names include commas, which corrupt CSV alignment.
Gemini missed a critical admin-authorization check on the /export/all endpoint (marked “admin only” in comments but lacking role enforcement), an issue other models caught.
Detection rates and category breakdown
- Security: GPT-5.2 and Claude Opus 4.5 hit 100% on the planted security issues; Gemini missed the admin-authorization check. All three caught SQL injection and path traversal.
- Performance: GPT-5.2 flagged two of three performance problems (N+1 and sync file writes); Gemini flagged the N+1; Claude Opus detected none.
- Authorization & logic: GPT-5.2 demonstrated the deepest authorization analysis by finding the task-duplication bypass.
Additional findings and common misses
- GPT-5.2 was the only model to find the task-duplication bypass beyond the planted set of issues.
- All three models missed a race condition in bulk-assign: a check-then-update pattern where two concurrent requests (or intervening deletes) can corrupt data. Detecting this requires reasoning about interleaving operations across requests.
Frontier models vs free models
When compared to three free models tested earlier (Grok Code Fast 1, MiniMax M2, Devstral 2):
- Grok Code Fast 1 matched Claude Opus 4.5’s detection rate (44%) and caught all five planted security vulnerabilities (both SQL injections, missing admin check, path traversal, CSV injection).
- Frontier advantages were clear in performance pattern detection and deeper authorization analysis: GPT-5.2 and Gemini found N+1 issues and GPT-5.2 uncovered the task-duplication bypass; none of the free models detected performance problems.
- For core security screening—SQL injection, path traversal, missing authorization—free models like Grok Code Fast 1 performed competitively with frontier models.
Practical takeaways
- GPT-5.2 provided the widest coverage and uncovered both security and performance issues that other models missed, making it suited for thorough audits where breadth matters.
- Claude Opus 4.5 combined speed and reliable security detection, fitting high-velocity workflows that require quick, consistent security checks.
- Gemini 3 Pro surfaced useful performance anti-patterns but missed some authorization logic; pairing with manual review is advisable for authorization-sensitive code.
- For basic security screening, free models can be effective and cost-efficient.
Full test details and original results are available at https://blog.kilo.ai/p/code-reviews-sota.