GLM-5.2 benchmark hype questioned after real-world code audit

In a post on X, Paweł Huryn says GLM-5.2 looks “clearly worse” than GPT-5.5 and Opus 4.8 in his blind-graded bug-hunt. Across 60 audits, GPT-5.5 and Opus improved with max reasoning, while GLM-5.2 didn’t.

GLM-5.2 benchmark hype questioned after real-world code audit

TL;DR

  • Claim: GLM-5.2 benchmark citations may be untested; Huryn reports it “clearly worse” than Opus 4.8 and GPT-5.5
  • Test design: 60 blind-graded runs on one real codebase with 21 planted bugs; 10 audits/model at high and max reasoning
  • Reasoning scaling: GPT-5.5 20%→31% bugs found; Opus 4.8 12%→24%; GLM-5.2 18%→16%
  • Token output at max: GPT-5.5 ~13k–27k; Opus 4.8 ~23k–57k; GLM-5.2 ~16k–18k
  • Max-only results: GPT-5.5 31% (66/210), Opus 4.8 24% (51/210), GLM-5.2 16% (34/210)
  • Method notes: single-agent, read-only; withheld answer key; GPT-5.5 on Codex, others on Claude Code, GLM via OpenRouter

In a post on X, Paweł Huryn argues that many people citing GLM-5.2 benchmarks may not have tested the model themselves, and he adds that it appears "clearly worse than Opus 4.8 or GPT-5.5" in his own code-audit run.

Huryn’s post quotes a larger blind-graded test he published on the same thread, where he audited three models against one real codebase containing 21 planted bugs. The setup used 60 total runs: 10 audits per model at high reasoning and 10 more at max reasoning.

In that test, GPT-5.5 and Opus 4.8 improved when reasoning effort was raised. GLM-5.2 did not. According to the figures shared in the post, GPT-5.5 went from 20% to 31% of planted bugs found per audit, while Opus 4.8 rose from 12% to 24%. GLM-5.2 moved from 18% to 16%.

Huryn also points to a token-count difference that appears to matter for the result. At max effort, GPT-5.5 generated about 13k to 27k tokens per run, and Opus 4.8 about 23k to 57k. GLM-5.2, by contrast, went from roughly 16k to 18k tokens and found about the same set of bugs.

The thread’s summary is blunt: "Reasoning effort is a lever on the closed frontier models. On the open one, turning it up changes almost nothing."

The second poster in the thread presents the same result more visually under the headline "Top of the arena. Bottom of the bug hunt." It lists GPT-5.5 at 31% (66/210), Opus 4.8 at 24% (51/210), and GLM-5.2 at 16% (34/210) in the maximum-reasoning run.

The method note attached to the graphics describes the test as "60 blind-graded runs" for the full high-versus-max comparison and "30 audits" for the max-only poster. It also notes that the audits were single-agent, read-only, and graded against a withheld answer key. GPT-5.5 ran on Codex, while Opus 4.8 and GLM-5.2 ran on Claude Code, with GLM going through an OpenRouter proxy.

Source: Paweł Huryn on X

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community