Qwen3.5-397B-A17B launches with multimodal, agent-ready efficiency

Qwen has just rolled out Qwen3.5-397B-A17B, its first open-weights release in the Qwen3.5 series, built for native multimodal support and agent-style workloads. The broader family adds Qwen3.5-Plus with a hybrid linear-attention + sparse MoE design and 1M-token context.

llm cover

TL;DR

  • Qwen3.5-397B-A17B released: First open-weight Qwen3.5 model; native multimodal and agent-style workloads focus
  • Model specs: 397B parameters, A17B configuration; trained “for real-world agents”
  • Architecture: Hybrid linear attention + sparse MoE; positioned for long-running generations and multi-step tasks
  • Training and performance claims: Large-scale RL environment scaling, 201 languages & dialects, 8.6x–19.0x decoding throughput vs Qwen3-Max

Qwen3.5-397B-A17B is now out as the first open-weight release in the Qwen3.5 series, positioning the lineup around native multimodal capabilities and a systems-level focus on agent-style workloads. Alongside the open-weights drop, the broader family also includes Qwen3.5‑Plus, described as a native vision-language model built on a hybrid linear-attention + sparse MoE design aimed at improving inference efficiency, with an extended 1M-token context window for long-context multimodal reasoning and agent workflows.

What’s shipping in Qwen3.5-397B-A17B

Qwen’s announcement centers on a model with 397B total parameters and an A17B configuration (as named), with the release framed explicitly around real-world agent behavior. The company highlights:

  • Native multimodal support
  • Training “for real-world agents”
  • Hybrid linear attention + sparse MoE
  • Large-scale RL environment scaling
  • 201 languages & dialects
  • Apache 2.0 licensing
  • A reported 8.6x–19.0x decoding throughput vs Qwen3-Max

Separate follow-up posts also point to slides or figures covering efficiency, infrastructure, and both LM and VLM performance.

Why the architecture choice matters for agentic coding

The technical throughline here is efficiency on long-running generations. Observers in the thread note the intuition behind the combo: linear attention targets the quadratic costs that show up as contexts grow, while sparse MoE increases total capacity without activating all parameters every token—useful when a model is expected to sustain long rollouts across multi-step tasks.

That theme is echoed by multiple replies emphasizing throughput as the practical constraint for multi-agent coding pipelines, where latency and sustained decoding costs can dominate overall task time.

There’s also already ecosystem motion: Unsloth AI says it has published local-run GGUFs here: https://t.co/j5vIkGID5y

Open questions: requirements and agent evaluation

Notably, the replies quickly converge on practical deployment questions—especially minimum and recommended system requirements—plus requests for more complete agent evaluation artifacts, like tool-call traces or end-to-end task success suites across languages and multi-step workflows.

Original source: https://x.com/i/status/2023331062433153103

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community