Claude Opus 4.8 vs MiniMax M3: code audit cost showdown

Kilo Blog ran the same TypeScript code audit across Claude Opus 4.8 settings and MiniMax M3, tracking spend, speed, and issues found. The results highlight a sharp cost-vs-coverage tradeoff that may reshape how teams budget AI reviews.

June 10, 2026

•

Anthropic MiniMax Benchmark

TL;DR

Side-by-side code audit: Claude Opus 4.8 vs MiniMax M3 on one codebase, one prompt, practical review task
Cost vs capability tradeoff: MiniMax M3 produced a strong audit at much lower cost
Claude scaling: Higher reasoning settings generally pushed further, with sharply higher costs
Diminishing returns: Increased spend on Claude did not consistently yield proportionally better output
Configuration-focused comparison: Multiple Claude configurations compared against a single MiniMax run
Timing insight: Wall-clock time aligned more with token usage than model name

Kilo Blog has published a detailed comparison of Claude Opus 4.8 and MiniMax M3, testing both models on the same code audit and tracking how they performed on cost, speed, and issue detection. The setup is intentionally narrow and practical: one codebase, one prompt, and a side-by-side look at how each model handled a real review task.

The results appear to show a familiar tradeoff. MiniMax M3 delivered a surprisingly strong audit for a far lower bill, while Claude Opus 4.8 generally pushed further as its reasoning level increased. The write-up suggests that the better-performing Claude settings also came with a much steeper cost, and that extra spending did not always translate into proportionally better output.

What makes the piece interesting is how it compares multiple Claude configurations against a single MiniMax run rather than treating model choice as a simple headline metric. That approach lets the authors show where the higher-end model "wins," where it falters, and where a cheaper alternative keeps up better than expected. The full article goes into the methodology, the findings, and how the models scaled under the same workload.

There is also a useful discussion of timing and efficiency, with the article noting that wall-clock time tracked token usage more closely than raw model name. For teams weighing AI-assisted code review or other repeated auditing tasks, the comparison offers a concrete way to think about coverage versus spend.

The full breakdown, including the audit setup and the per-run results, is available in the original post on Kilo Blog.

Source: Kilo Blog

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community

MiniMax launches M3 open-weights model with 1M context

MiniMax has just rolled out M3, touting open weights, agentic coding chops, and up to 1M context via Sparse Attention. The company shared benchmark results and pricing tiers, and says weights plus a tech report should land in about 10 days.

Jun 1, 2026

1 shared tag

DeepSeek V4 Pro and Flash benchmarked against Claude Opus

A recent benchmark write-up by Darko at Kilo Blog tests DeepSeek V4 Pro and Flash against Claude Opus 4.7 and Kimi K2.6 using a heavier FlowGraph workflow. It digs into where each model shines, stumbles, and how pricing shifts the value equation.

May 20, 2026

1 shared tag

Andrej Karpathy joins Anthropic to return to R&D

Andrej Karpathy announced on X that he’s joined Anthropic, calling the next few years at the LLM frontier “especially formative.” The news sparked a wave of welcomes as the AI community weighed the talent shift.

May 20, 2026

1 shared tag

Continue the conversation on Slack

Related Articles

MiniMax launches M3 open-weights model with 1M context

DeepSeek V4 Pro and Flash benchmarked against Claude Opus

Andrej Karpathy joins Anthropic to return to R&D