Moonshot open-sources Kimi K2.6 to power long-horizon coding

Moonshot AI has just rolled out Kimi K2.6, bringing an open-source model tuned for long-running coding, heavy tool use, and multi-agent “swarm” workflows. The company claims multi-hour runs with thousands of tool calls, plus stronger benchmarks across SWE-Bench and agentic tasks.

kimi cover

TL;DR

  • **Open-sourced Kimi K2.6:** Focus on coding performance, long-horizon tool use, and **agent swarm** workflows
  • **Availability:** Via Kimi.com, Kimi App, API, and Kimi Code
  • **Long-horizon coding demos:** 4,000+ tool calls, 12+ hours; **Zig** inference optimized to ~193 tokens/sec (~20% faster than LM Studio)
  • **exchange-core optimization:** 13 hours, 1,000+ tool calls, 4,000+ LOC changed; thread topology 4ME+2RE→2ME+1RE; +185% MT/s
  • **Agent Swarm scaling:** **300 sub-agents** and **4,000 coordinated steps** (from 100 and 1,500); targets lower latency, multi-part deliverables
  • **Benchmarks up vs K2.5:** HLE-Full w/ tools 54.0, Terminal-Bench 66.7, SWE-Bench Pro 58.6, Verified 80.2; official results via API and KVV

Moonshot AI has open-sourced its latest model, Kimi K2.6, positioning the release around coding performance, longer-running tool use, and “agent swarm” workflows. The company says K2.6 is available through Kimi.com, the Kimi App, the API, and Kimi Code.

According to Moonshot AI, K2.6 targets “long-horizon execution” tasks—jobs that require sustained multi-step work across tools—alongside stronger coding and improved coordination across multiple agents.

Long-horizon coding: multi-hour runs and performance tuning

Moonshot AI says Kimi K2.6 improves on long-horizon coding across languages including Rust, Go, and Python, and across categories like front-end work, devops, and performance optimization. On the company’s internal “Kimi Code Bench,” K2.6 is reported to show “significant improvements” over Kimi K2.5.

Two showcase examples emphasize long, tool-heavy sessions:

  • **Local model deployment and inference optimization on a Mac:** Moonshot AI says K2.6 downloaded and deployed **Qwen3.5-0.8B** locally, then implemented and optimized inference in **Zig**. Over **4,000+ tool calls**, **12+ hours** of execution, and **14 iterations**, throughput improved from about **15** to **~193 tokens/sec**, which the company says is **~20% faster than LM Studio**.
  • **Overhauling exchange-core:** K2.6 also “autonomously” optimized **exchange-core**, described as an 8-year-old open-source financial matching engine. Moonshot AI reports **13 hours** of execution, **12 optimization strategies**, **1,000+ tool calls**, and modifications across **4,000+ lines of code**. The company says K2.6 used CPU and allocation flame graphs to rework bottlenecks and changed thread topology from **4ME+2RE** to **2ME+1RE**, resulting in a **185% medium throughput** increase (0.43 → 1.24 MT/s) and a **133% performance throughput** gain (1.23 → 2.86 MT/s).

Enterprise beta feedback concentrates on reliability and tool use

Moonshot AI also included a set of enterprise beta-test quotes (presented as “randomly ordered”) that repeatedly highlight longer-running stability, more reliable tool calling, and better behavior in large codebases.

Among the claims: **anything.com** said K2.6 “runs longer-horizon tasks before hitting a wall,” **Augment Code** pointed to “surgical precision in large codebases,” and **CodeBuddy** reported internal gains including **12%** higher code generation accuracy, **18%** improved long-context stability, and **96.60%** tool invocation success rate. **Vercel** said it saw “more than 50% improvement” on a Next.js benchmark and described K2.6 as a candidate for “agentic coding and front-end generation through AI Gateway.”

Coding-driven design and simple full-stack workflows

Moonshot AI is also pitching K2.6 as capable of generating complete front-end interfaces from prompts, including interactive elements and animations such as scroll-triggered effects. The company says K2.6 can use image and video generation tools to produce coherent site assets, and extend beyond static front-end work into “simple full-stack workflows,” including authentication and database operations for lightweight use cases like transaction logging or session management.

To evaluate this, Moonshot AI says it created an internal “Kimi Design Bench” with four categories: Visual Input Tasks, Landing Page Construction, Full-Stack Application Development, and General Creative Programming.

Agent swarms: scaling to 300 sub-agents and 4,000 steps

K2.6 also updates Moonshot AI’s “Agent Swarm” system. The company says the new architecture scales horizontally to **300 sub-agents** executing across **4,000 coordinated steps simultaneously**, up from K2.5’s **100 sub-agents** and **1,500 steps**. Moonshot AI frames this as a way to reduce end-to-end latency while raising quality for multi-part deliverables, including documents, websites, slides, and spreadsheets.

Another feature described here is turning uploaded files (PDFs, spreadsheets, slides, Word documents) into reusable “Skills,” with the model preserving “structural and stylistic DNA” for future work.

Proactive agents, Claw Bench, and “Claw Groups” preview

For proactive, always-on agent scenarios, Moonshot AI says K2.6 performs well with autonomous agents such as OpenClaw and Hermes. The company says its RL infra team ran a K2.6-backed agent autonomously for **5 days**, handling monitoring, incident response, and system operations.

Moonshot AI quantifies improvements through an internal “Claw Bench,” spanning Coding Tasks, IM Ecosystem Integration, Information Research & Analysis, Scheduled Task Management, and Memory Utilization, reporting that K2.6 outperforms K2.5 in task completion rates and tool invocation accuracy.

The company also teased “Claw Groups” as a research preview, described as a shared, heterogeneous environment where multiple agents and humans collaborate, with K2.6 acting as a coordinator that assigns tasks based on tools and skills, and reassigns work when an agent stalls.

Benchmarks: K2.6 vs K2.5 and selected closed models

Moonshot AI published a benchmark table covering agentic tasks, coding, reasoning/knowledge, and vision. A few highlighted K2.6 results include:

  • **HLE-Full w/ tools:** 54.0 (vs 50.2 for K2.5)
  • **BrowseComp:** 83.2 (vs 74.9 for K2.5)
  • **DeepSearchQA (f1):** 92.5 (vs 89.0 for K2.5)
  • **Terminal-Bench 2.0 (Terminus-2):** 66.7 (vs 50.8 for K2.5)
  • **SWE-Bench Pro:** 58.6 (vs 50.7 for K2.5)
  • **SWE-Bench Verified:** 80.2 (vs 76.8 for K2.5)

Moonshot AI notes that, to reproduce “official Kimi-K2.6 benchmark results,” it recommends using the official API, and points to its Kimi Vendor Verifier (KVV) write-up: https://kimi.com/blog/kimi-vendor-verifier.

---

Source: Kimi K2.6: Advancing Open-Source Coding

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community