Opus 4.7’s mixed reviews expose cracks in AI evaluation

Anthropic has just rolled out Opus 4.7, and reactions are sharply split. Simone Civetta argues the divide reflects uneven workflow impact, shifting benchmarks, and prompt breakage tied to stronger instruction following. He suggests higher “effort” modes to curb under-thinking.

Opus 4.7’s mixed reviews expose cracks in AI evaluation

TL;DR

  • Mixed early reactions to Opus 4.7 framed as uneven impacts across real workflows
  • Arena-dot-ai results: Better in some areas, “much worse” in others; suggests targeted gains and tradeoffs
  • System Card MRCR v2: Steep recall drop vs Opus 4.6; metric reportedly deemed obsolete by Anthropic engineers
  • Instruction-following changes: More literal compliance can break previously effective prompts like messy CLAUDE.md
  • Adaptive thinking behavior: At “medium effort,” reasoning may be skipped; effort treated as ceiling, not floor
  • Mitigation suggested: xhigh/max effort mode for critical work, with higher plan usage; model drift requires resilient workflows

Anthropic’s latest model update, Opus 4.7, is already drawing sharply mixed reactions—and a new post from Simone Civetta (viteinfinite on X) argues the disagreement says as much about AI evaluation as it does about the release itself.

Civetta opens by juxtaposing two extremes of early sentiment: “Opus 4.7 is a disgruntled employee.” and “No, it’s the first model that truly gets what I want.” The point isn’t that either camp is definitively right, but that “divergent” feedback can be a sign that model changes are landing unevenly across real workflows.

Two charts, two stories

Civetta points to two datapoints that, together, help explain why the discourse is so split:

  • Arena-dot-ai results: Opus 4.7 appears to score better in some areas and “much worse in others,” reinforcing the idea that improvements may be targeted—or come with tradeoffs.
  • MRCR v2 recall dropping in Anthropic’s own System Card: it highlights a “steep drop” in MRCR v2 recall for Opus 4.7, noting MRCR v2 is the same evaluation Anthropic used “only weeks ago” to support Opus 4.6’s 1M-context quality.

Anthropic engineers have reportedly said MRCR v2 is now obsolete, Civetta adds. Even so, the post argues that an evaluation moving from “headline evidence” to “outdated” within weeks suggests a fragile measurement layer.

Instruction-following gains can destabilize prompts

The read is that Anthropic’s models have historically had “somewhat worse instruction following” than OpenAI’s, but a stronger ability to understand a user’s “true goals.” If Anthropic invested in tightening instruction following—as the company has indicated—Civetta argues that can cut both ways.

More steerability can also mean the model is more likely to comply with “wrong” instructions embedded in a “suboptimal system prompt” or a messy CLAUDE.md. That can create a practical regression: prompts and skills that worked well on Opus 4.6 may contain nuances Opus 4.7 now follows more literally, disrupting established workflows. Civetta notes that prompt breakage is common across model releases, and a risk Anthropic has warned about.

Adaptive thinking and the “effort” ceiling

The post also flags “adaptive thinking behavior” as a potential pitfall. At “medium effort” Opus 4.7 often skips reasoning, which can reduce reliability. The “effort parameter” is described as a ceiling rather than a floor for the thinking budget.

As a mitigation, the article cites Boris Cherny’s suggestion to use “xhigh effort mode” (or “max” for critical work) to prevent under-thinking—at the cost of increased plan usage.

Light analysis: the evaluation gap is showing

One plausible explanation for the mixed results is a disconnect between internal testing and the messiness of external usage: the author floats “dogfooding bias” and/or a push toward power efficiency that leans heavily on the thinking budget for steering. If either is true, benchmarks and system cards may look coherent while day-to-day prompt habits feel disrupted.

The conclusion is less about Opus 4.7 specifically than about operational reality: model drift is constant, and prompts, harnesses, and workflows need to be designed to absorb it.

Source: https://x.com/viteinfinite/status/2046826248769351756?s=20

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community