Alibaba rolls out Qwen3.7-Plus, a multimodal agent model

Alibaba has just rolled out Qwen3.7-Plus, bringing vision and language into an API-first model for coding, GUI/CLI automation, and agent workflows. Early benchmarks and demos look promising, but third-party validation will be key.

June 1, 2026

•

Qwen

TL;DR

Qwen3.7-Plus introduced: Multimodal agent model combining vision and language; API access via Alibaba Cloud Model Studio
Agent focus: “Multimodal interactive hybrid agent” for GUI/CLI tasks, coding agent, productivity assistant, visual agent
Claimed capabilities: Complex visual understanding, visual reasoning, grounding, tool use, task execution in code/GUI environments
Benchmarks shown: Terminal-Bench 2.0 70.3; SWE-bench Multilingual 75.8; SWE-bench Pro 57.6
More scores: ScreenSpot Pro 79.0; BFCLv4 72.9; RealWorldQA 86.9; compared against multiple named models
Positioning and feedback: API-first release; requests for open weights/variants; concerns about UI misreads and added failure modes

Alibaba’s Qwen team introduced Qwen3.7-Plus on Monday as what it calls a “multimodal agent model” that brings vision and language into a single system, with API access through Alibaba Cloud Model Studio. In the launch post, the company describes the model as a “multimodal interactive hybrid agent” for GUI and CLI tasks, a coding agent and productivity assistant, and a “visual agent” for perception, reasoning, grounding, and search-augmented QA.

Qwen’s own text around the release claims that the model’s multimodal gains go beyond visual recognition and extend to “understanding complex visual inputs,” “reasoning over visual information,” “using tools to solve problems,” and task execution in code or GUI environments. The post also points to demos for a “multimodal interactive hybrid agent” and a browser agent.

Reaction under the post quickly turned to access. Several commenters asked for open weights, smaller variants, or a Hugging Face link, while others asked how the new model compares with Qwen 3.7-Max and whether the benchmark set captures more demanding forms of multimodality. One commenter also suggested that vision-based agents still misread UI elements and that adding multimodal input could add failure modes as well as capability.

Source: X post

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community

OpenRouter adds Alibaba’s Qwen3.7-Max with prompt caching

OpenRouter has just rolled out Alibaba’s Qwen3.7-Max, positioning it as the flagship Qwen3.7 model for agent-centric coding, productivity, and long-horizon execution. The launch highlights claimed benchmark gains over Qwen3.6 and explicit prompt caching, as users press for more proof.

May 22, 2026

1 shared tag

Hugging Face cofounder touts Qwen 27B on MacBook Pro

Hugging Face cofounder Julien Chaumond says running Qwen3.6 27B locally via Llama.cpp in Pi felt “pretty magical,” nearing Claude Opus for real code tasks. Replies quickly honed in on RAM, speed, and battery life trade-offs.

Apr 24, 2026

1 shared tag

Alibaba previews Qwen3.6-Max

Alibaba’s Qwen team has unveiled Qwen3.6-Max-Preview as an early look at its next flagship model. The pitch: stronger agentic coding, improved instruction following, and better “real-world” reliability—alongside hints that more Qwen3.6 models are coming.

Apr 20, 2026

1 shared tag

Continue the conversation on Slack

Related Articles

OpenRouter adds Alibaba’s Qwen3.7-Max with prompt caching

Hugging Face cofounder touts Qwen 27B on MacBook Pro

Alibaba previews Qwen3.6-Max