Mistral Small 4 is Mistral’s latest release in the Small family, and it’s aiming squarely at a common pain point in modern model lineups: switching between “fast chat,” “serious reasoning,” “multimodal,” and “coding agent” variants depending on the task. Small 4 rolls those into a single model—pulling capabilities Mistral associates with Magistral (reasoning), Pixtral (multimodal), and Devstral (agentic coding)—and ships under the Apache 2.0 license.
One model, multiple modes (without switching checkpoints)
Mistral describes Small 4 as a hybrid model optimized for general chat, coding, agentic tasks, and complex reasoning, with native text and image inputs. The point isn’t just breadth; it’s making those behaviors selectable at runtime.
The notable control here is configurable reasoning effort. A new parameter, reasoning_effort, is intended to let teams trade latency for depth depending on context:
reasoning_effort="none"targets fast, lightweight responses (positioned as equivalent in chat style to Mistral Small 3.2)reasoning_effort="high"is aimed at deeper, step-by-step reasoning, with verbosity described as comparable to earlier Magistral models
Architecture and scale: MoE with a long context window
Under the hood, Small 4 is a MoE model with:
- 128 experts, with 4 active per token
- 119B total parameters, with 6B active parameters per token (or 8B including embedding and output layers)
- A 256k context window for long-form interactions and document analysis
This combination—large total capacity with a smaller active slice—fits the model’s emphasis on efficiency while keeping headroom for harder tasks.
Efficiency claims: lower latency, higher throughput
Compared to Mistral Small 3, Mistral reports:
- 40% reduction in end-to-end completion time in a latency-optimized setup
- 3× more requests per second in a throughput-optimized setup
That framing is practical for teams that care less about a single benchmark score and more about serving behavior under load.
Benchmarks, with an emphasis on “performance per token”
Beyond raw scores, Mistral leans into shorter outputs for comparable accuracy across LCR, LiveCodeBench, and AIME 2025. The post claims Small 4 (with reasoning enabled) matches or surpasses GPT-OSS 120B on those three benchmarks while generating shorter responses; as one concrete example, on AA LCR it reports 0.72 accuracy with 1.6K characters, while Qwen models reportedly require 3.5–4× more output length (5.8–6.1K) for similar performance.
The argument is straightforward: fewer generated tokens can mean lower latency and reduced inference cost, especially when long-form reasoning is common.
Deployment: open-source, but tuned for NVIDIA stacks
Mistral positions Small 4 as runnable with relatively constrained enterprise infrastructure, listing minimum and recommended configurations:
- Minimum: 4× NVIDIA HGX H100, 2× NVIDIA HGX H200, or 1× NVIDIA DGX B200
- Recommended: 4× HGX H100, 4× HGX H200, or 2× DGX B200
On the software side, it’s described as available via community and common inference frameworks including vLLM, llama.cpp, SGLang, and Transformers, with inference optimizations called out for vLLM and SGLang through collaboration with NVIDIA. Mistral also notes it has joined the NVIDIA Nemotron Coalition as a founding member.
Availability: API, AI Studio, Hugging Face, and NVIDIA NIM
Small 4 is available through several channels:
- Mistral API and AI Studio
- Hugging Face Repository
- build.nvidia.com for prototyping on NVIDIA accelerated computing
- Day-0 production availability as an NVIDIA NIM, with customization via NVIDIA NeMo for domain-specific fine-tuning
- Technical documentation via Mistral’s AI Governance Hub
