Qwen launches Qwen3.5 Small models: 0.8B to 9B

With the launch of Qwen3.5 Small, Qwen is betting on more capability with less compute across four sizes, including Base models for fine-tuning. The lineup hits Ollama on day one with tool calling, “thinking,” and multimodal support for local-first builds.

llm cover

TL;DR

  • Qwen 3.5 Small models released: 0.8B, 2B, 4B, 9B, plus matching Base models for research/fine-tuning
  • Core features: Native multimodal, improved architecture, scaled RL on the Qwen3.5 foundation
  • Sizing guidance: 0.8B/2B for edge; 4B for lightweight agents; 9B aims to narrow larger-model gap
  • Ollama day-one support: native tool calling, thinking, multimodal; ollama run qwen3.5:{9b|4b|2b|0.8b}
  • Early reports: 9B ~30 token/s on Ryzen AI Max+ 395 (Q4_K_XL, 256k context, <16GB VRAM); Unsloth GGUFs for 6GB Mac/RAM devices

Qwen 3.5 Small Model Series is now out in four sizes—Qwen3.5-0.8B, 2B, 4B, and 9B—framing the release around a familiar developer tradeoff: getting more capability out of fewer parameters and less compute. Alongside the chat-oriented releases, Qwen also says it’s shipping Base models for the same sizes, positioning the lineup for research and fine-tuning work rather than only drop-in assistant usage.

What Qwen is shipping

The series is presented as a set of “small models” built on the same Qwen3.5 foundation, with native multimodal support, an improved architecture, and scaled RL.

Qwen’s own sizing guidance is straightforward:

  • 0.8B / 2B: positioned as tiny and fast, aimed at edge device scenarios
  • 4B: described as a strong multimodal baseline for lightweight agents
  • 9B: framed as compact, while “closing the gap” with much larger models

Distribution links were shared for both major model hubs: Hugging Face and ModelScope.

Local-first tooling shows up quickly

Notably, the models also landed in Ollama the same day. In a separate post, Ollama said the Qwen 3.5 small models are available there and that all models support native tool calling, thinking, and multimodal capabilities in Ollama, with run commands for each size:

  • ollama run qwen3.5:9b
  • ollama run qwen3.5:4b
  • ollama run qwen3.5:2b
  • ollama run qwen3.5:0.8b

Early community signals: edge, agents, and practical throughput

The initial replies and quotes quickly centered on local hosting and agent workflows, especially around the 4B and 9B sizes. One developer called out the 9B as “solid for local hosting,” while another highlighted the 4B as an “embedded agent” sweet spot for multi-step tool calls.

There were also early performance anecdotes. Petri Kuittinen reported running Qwen3.5-9B at ~30 token/s on an “AMD Ryzen™ AI Max+ 395” using Q4_K_XL quantization and a 256k context window, adding that it required less than 16 GB VRAM in that setup. Separately, Unsloth AI said Qwen3.5 Small models can be run locally on a 6GB Mac / RAM device via their GGUFs, linking here: https://t.co/7Jmp13uYfU.

As often happens with small-model drops, the thread also filled with practical questions—minimum hardware, GPU speed expectations, and Mac support—suggesting immediate interest in deploying these models outside the datacenter.

Source: https://x.com/Alibaba_Qwen/status/2028460046510965160

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community