Mistral Small 4 unifies chat, reasoning, code, and vision

Mistral has just rolled out Mistral Small 4, an Apache 2.0 MoE model that combines fast chat, deep reasoning, multimodal input, and coding-agent behavior in one checkpoint. It also adds a 256K context window and a new reasoning_effort toggle for speed vs depth.

mistral cover

TL;DR

  • Mistral Small 4: Single checkpoint for chat, coding, agentic tasks, reasoning, and native image inputs; Apache 2.0
  • Runtime control: reasoning_effort tunes latency vs depth; "none" fast chat, "high" step-by-step reasoning
  • Architecture: MoE with 128 experts (4 active/token), 119B total (6B active/token; 8B incl. embeddings/output), 256k context
  • Efficiency claims vs Small 3: 40% lower completion time (latency setup); higher requests/second (throughput setup)
  • Benchmarks focus: Shorter outputs for similar accuracy; claims match/surpass GPT-OSS 120B on LCR, LiveCodeBench, AIME 2025

Mistral Small 4 is Mistral’s latest release in the Small family, and it’s aiming squarely at a common pain point in modern model lineups: switching between “fast chat,” “serious reasoning,” “multimodal,” and “coding agent” variants depending on the task. Small 4 rolls those into a single model—pulling capabilities Mistral associates with Magistral (reasoning), Pixtral (multimodal), and Devstral (agentic coding)—and ships under the Apache 2.0 license.

One model, multiple modes (without switching checkpoints)

Mistral describes Small 4 as a hybrid model optimized for general chat, coding, agentic tasks, and complex reasoning, with native text and image inputs. The point isn’t just breadth; it’s making those behaviors selectable at runtime.

The notable control here is configurable reasoning effort. A new parameter, reasoning_effort, is intended to let teams trade latency for depth depending on context:

  • reasoning_effort="none" targets fast, lightweight responses (positioned as equivalent in chat style to Mistral Small 3.2)
  • reasoning_effort="high" is aimed at deeper, step-by-step reasoning, with verbosity described as comparable to earlier Magistral models

Architecture and scale: MoE with a long context window

Under the hood, Small 4 is a MoE model with:

  • 128 experts, with 4 active per token
  • 119B total parameters, with 6B active parameters per token (or 8B including embedding and output layers)
  • A 256k context window for long-form interactions and document analysis

This combination—large total capacity with a smaller active slice—fits the model’s emphasis on efficiency while keeping headroom for harder tasks.

Efficiency claims: lower latency, higher throughput

Compared to Mistral Small 3, Mistral reports:

  • 40% reduction in end-to-end completion time in a latency-optimized setup
  • 3× more requests per second in a throughput-optimized setup

That framing is practical for teams that care less about a single benchmark score and more about serving behavior under load.

Benchmarks, with an emphasis on “performance per token”

Beyond raw scores, Mistral leans into shorter outputs for comparable accuracy across LCR, LiveCodeBench, and AIME 2025. The post claims Small 4 (with reasoning enabled) matches or surpasses GPT-OSS 120B on those three benchmarks while generating shorter responses; as one concrete example, on AA LCR it reports 0.72 accuracy with 1.6K characters, while Qwen models reportedly require 3.5–4× more output length (5.8–6.1K) for similar performance.

The argument is straightforward: fewer generated tokens can mean lower latency and reduced inference cost, especially when long-form reasoning is common.

Deployment: open-source, but tuned for NVIDIA stacks

Mistral positions Small 4 as runnable with relatively constrained enterprise infrastructure, listing minimum and recommended configurations:

  • Minimum: 4× NVIDIA HGX H100, 2× NVIDIA HGX H200, or 1× NVIDIA DGX B200
  • Recommended: 4× HGX H100, 4× HGX H200, or 2× DGX B200

On the software side, it’s described as available via community and common inference frameworks including vLLM, llama.cpp, SGLang, and Transformers, with inference optimizations called out for vLLM and SGLang through collaboration with NVIDIA. Mistral also notes it has joined the NVIDIA Nemotron Coalition as a founding member.

Availability: API, AI Studio, Hugging Face, and NVIDIA NIM

Small 4 is available through several channels:

Source: https://mistral.ai/news/mistral-small-4

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community