Opus 4.7’s mixed reviews expose cracks in AI evaluation

Anthropic’s latest model update, Opus 4.7, is already drawing sharply mixed reactions—and a new post from Simone Civetta (viteinfinite on X) argues the disagreement says as much about AI evaluation as it does about the release itself.

Civetta opens by juxtaposing two extremes of early sentiment: “Opus 4.7 is a disgruntled employee.” and “No, it’s the first model that truly gets what I want.” The point isn’t that either camp is definitively right, but that “divergent” feedback can be a sign that model changes are landing unevenly across real workflows.

Two charts, two stories

Civetta points to two datapoints that, together, help explain why the discourse is so split:

Arena-dot-ai results: Opus 4.7 appears to score better in some areas and “much worse in others,” reinforcing the idea that improvements may be targeted—or come with tradeoffs.
MRCR v2 recall dropping in Anthropic’s own System Card: it highlights a “steep drop” in MRCR v2 recall for Opus 4.7, noting MRCR v2 is the same evaluation Anthropic used “only weeks ago” to support Opus 4.6’s 1M-context quality.

Anthropic engineers have reportedly said MRCR v2 is now obsolete, Civetta adds. Even so, the post argues that an evaluation moving from “headline evidence” to “outdated” within weeks suggests a fragile measurement layer.

Instruction-following gains can destabilize prompts

The read is that Anthropic’s models have historically had “somewhat worse instruction following” than OpenAI’s, but a stronger ability to understand a user’s “true goals.” If Anthropic invested in tightening instruction following—as the company has indicated—Civetta argues that can cut both ways.

More steerability can also mean the model is more likely to comply with “wrong” instructions embedded in a “suboptimal system prompt” or a messy CLAUDE.md. That can create a practical regression: prompts and skills that worked well on Opus 4.6 may contain nuances Opus 4.7 now follows more literally, disrupting established workflows. Civetta notes that prompt breakage is common across model releases, and a risk Anthropic has warned about.

Adaptive thinking and the “effort” ceiling

The post also flags “adaptive thinking behavior” as a potential pitfall. At “medium effort” Opus 4.7 often skips reasoning, which can reduce reliability. The “effort parameter” is described as a ceiling rather than a floor for the thinking budget.

As a mitigation, the article cites Boris Cherny’s suggestion to use “xhigh effort mode” (or “max” for critical work) to prevent under-thinking—at the cost of increased plan usage.

Light analysis: the evaluation gap is showing

One plausible explanation for the mixed results is a disconnect between internal testing and the messiness of external usage: the author floats “dogfooding bias” and/or a push toward power efficiency that leans heavily on the thinking budget for steering. If either is true, benchmarks and system cards may look coherent while day-to-day prompt habits feel disrupted.

The conclusion is less about Opus 4.7 specifically than about operational reality: model drift is constant, and prompts, harnesses, and workflows need to be designed to absorb it.

Source: https://x.com/viteinfinite/status/2046826248769351756?s=20