DeepSeek V4 Pro and Flash benchmarked against Claude Opus

A recent benchmark write-up by Darko at Kilo Blog tests DeepSeek V4 Pro and Flash against Claude Opus 4.7 and Kimi K2.6 using a heavier FlowGraph workflow. It digs into where each model shines, stumbles, and how pricing shifts the value equation.

DeepSeek V4 Pro and Flash benchmarked against Claude Opus

TL;DR

  • Benchmark scope: DeepSeek V4 Pro and V4 Flash vs Claude Opus 4.7 and Kimi K2.6, using a heavier FlowGraph setup
  • V4 Pro performance: Stronger than Kimi K2.6, but behind Claude Opus 4.7 on tougher workflow segments
  • V4 Pro economics: Temporary pricing promotion and lower cache pricing improve cost attractiveness beyond list price
  • V4 Flash positioning: Dramatically cheaper than others, with more uneven workflow handling overall
  • Tool/agent behavior: V4 Flash tool-calling held up better than expected at its low price point
  • Failure analysis focus: Scheduling, recovery, validation, and build integrity; coordination-heavy edge cases still differentiate models

Kilo Blog’s latest benchmark write-up looks at DeepSeek V4 Pro and DeepSeek V4 Flash alongside Claude Opus 4.7 and Kimi K2.6, using the same heavier FlowGraph setup the publication used in an earlier comparison. The post paints DeepSeek’s new open-weight pair as relevant contenders, with Pro landing in the middle of the pack and Flash aiming at the ultra-low-cost end of the market.

According to the blog, V4 Pro came in with a stronger showing than Kimi K2.6, while still trailing Claude Opus 4.7 on the tougher parts of the workflow. The write-up also notes that DeepSeek’s temporary pricing promotion and lower cache pricing make the model more attractive on cost than its list price alone might suggest.

V4 Flash, meanwhile, appears to be a different kind of product altogether. The post describes it as dramatically cheaper than the other models in the comparison, but also more uneven in how it handled the workflow. Even so, the author suggests its agent/tool-calling behavior held up better than expected for such a low-price run.

The rest of the article focuses on where each model stumbled in a complex backend-building task, especially around scheduling, recovery, validation, and build integrity. Rather than treating benchmark scores as the whole story, the post argues that the gap between open-weight and proprietary models is narrowing in broad coverage, while the hardest coordination-heavy edge cases remain the real separator.

Readers interested in the detailed failure cases, the cost-per-point breakdown, and the side-by-side scoring table can find the full post here: Kilo Blog.

Source: Kilo Blog

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community