GLM-5.2 benchmark hype questioned after real-world code audit

In a post on X, Paweł Huryn says GLM-5.2 looks “clearly worse” than GPT-5.5 and Opus 4.8 in his blind-graded bug-hunt. Across 60 audits, GPT-5.5 and Opus improved with max reasoning, while GLM-5.2 didn’t.

June 21, 2026

•

GLM Benchmark

TL;DR

Claim: GLM-5.2 benchmark citations may be untested; Huryn reports it “clearly worse” than Opus 4.8 and GPT-5.5
Test design: 60 blind-graded runs on one real codebase with 21 planted bugs; 10 audits/model at high and max reasoning
Reasoning scaling: GPT-5.5 20%→31% bugs found; Opus 4.8 12%→24%; GLM-5.2 18%→16%
Token output at max: GPT-5.5 ~13k–27k; Opus 4.8 ~23k–57k; GLM-5.2 ~16k–18k
Max-only results: GPT-5.5 31% (66/210), Opus 4.8 24% (51/210), GLM-5.2 16% (34/210)
Method notes: single-agent, read-only; withheld answer key; GPT-5.5 on Codex, others on Claude Code, GLM via OpenRouter

In a post on X, Paweł Huryn argues that many people citing GLM-5.2 benchmarks may not have tested the model themselves, and he adds that it appears "clearly worse than Opus 4.8 or GPT-5.5" in his own code-audit run.

Huryn’s post quotes a larger blind-graded test he published on the same thread, where he audited three models against one real codebase containing 21 planted bugs. The setup used 60 total runs: 10 audits per model at high reasoning and 10 more at max reasoning.

In that test, GPT-5.5 and Opus 4.8 improved when reasoning effort was raised. GLM-5.2 did not. According to the figures shared in the post, GPT-5.5 went from 20% to 31% of planted bugs found per audit, while Opus 4.8 rose from 12% to 24%. GLM-5.2 moved from 18% to 16%.

Huryn also points to a token-count difference that appears to matter for the result. At max effort, GPT-5.5 generated about 13k to 27k tokens per run, and Opus 4.8 about 23k to 57k. GLM-5.2, by contrast, went from roughly 16k to 18k tokens and found about the same set of bugs.

The thread’s summary is blunt: "Reasoning effort is a lever on the closed frontier models. On the open one, turning it up changes almost nothing."

The second poster in the thread presents the same result more visually under the headline "Top of the arena. Bottom of the bug hunt." It lists GPT-5.5 at 31% (66/210), Opus 4.8 at 24% (51/210), and GLM-5.2 at 16% (34/210) in the maximum-reasoning run.

The method note attached to the graphics describes the test as "60 blind-graded runs" for the full high-versus-max comparison and "30 audits" for the max-only poster. It also notes that the audits were single-agent, read-only, and graded against a withheld answer key. GPT-5.5 ran on Codex, while Opus 4.8 and GLM-5.2 ran on Claude Code, with GLM going through an OpenRouter proxy.

Source: Paweł Huryn on X

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community

Fable 5 vs Opus 4.8: The bug-finding cost surprise

Paweł Huryn says Fable 5 can beat Opus 4.8 on audit economics, even at 2x token pricing. Across 60 metered Claude Code sessions, Fable cost more per run but surfaced a planted cross-file bug far more often—cutting expected spend to catch it.

Jun 13, 2026

1 shared tag

GLM-4.7 on Cerebras: Real-Time Coding AI at Record Speed

GLM-4.7 on Cerebras Inference Cloud boosts code generation, agent planning, and long-session reliability for developer workflows. On Cerebras hardware it hits a whopping 1000 tokens per seconds and claims up to 10× price-performance versus Claude Sonnet 4.5.

Jan 13, 2026

1 shared tag

GLM-4.6 Expands to 200K-Token Context, Improves Coding & Agents

GLM-4.6 expands context to 200K tokens and improves coding, reasoning, and agent integration. It's about 15% more token-efficient, shows gains over GLM-4.5, and is available via Z.ai API and public hubs.

Oct 2, 2025

1 shared tag

Continue the conversation on Slack

Related Articles

Fable 5 vs Opus 4.8: The bug-finding cost surprise

GLM-4.7 on Cerebras: Real-Time Coding AI at Record Speed

GLM-4.6 Expands to 200K-Token Context, Improves Coding & Agents