Google’s DiffusionGemma promises 4x faster text generation

Google has just rolled out DiffusionGemma, an Apache 2.0 open model that generates 256-token blocks in parallel for faster inference. The 26B MoE model activates 3.8B params at runtime and targets speed-first local workflows—at some cost to output quality.

June 11, 2026

•

LLM Gemini

TL;DR

DiffusionGemma: Experimental Apache 2.0 open text-diffusion model, generating text in parallel rather than token-by-token
Model profile: 26B MoE; 3.8B active parameters at inference; fits 18GB VRAM when quantized
Speed claims: Up to 4× faster on dedicated GPUs; vendor benchmarks: 1000+ tok/s H100, 700+ tok/s RTX 5090
Architecture: 256-token parallel blocks with bi-directional attention; diffusion head based on Gemma 4 and Gemini Diffusion research
Use cases: In-line editing, code infilling, rapid iteration, amino acid sequences, mathematical graphs; iterative refinement via placeholder updates
Access and support: Weights on Hugging Face; implementations via MLX, vLLM, Transformers, Hackable Diffusion, NeMo, and soon llama.cpp

Google has released DiffusionGemma, an experimental open model under Apache 2.0 that the company describes as a text-diffusion system designed to generate text in parallel rather than token by token. Google claims the 26B Mixture of Experts model can reach "up to 4x faster" inference on dedicated GPUs, while shifting some of the usual trade-offs in speed versus output quality.

The model activates only 3.8B parameters during inference, according to Google, and is positioned to fit within 18GB VRAM when quantized. The company also points to reported throughput of "1000+ tokens per second" on a single NVIDIA H100 and "700+ tokens per second" on an NVIDIA GeForce RTX 5090, though those figures are tied to dedicated hardware and should be treated as vendor-provided benchmarks.

Google says DiffusionGemma builds on its Gemma 4 family and Gemini Diffusion research, using a diffusion head intended to increase generation speed. The model generates 256-token blocks in parallel, which gives every token access to the rest of the block through bi-directional attention. Google suggests that makes the system useful for speed-sensitive workflows such as in-line editing, code infilling, rapid iteration, amino acid sequences, and mathematical graphs.

There is a clear trade-off, at least as Google presents it. The company states that DiffusionGemma’s output quality is lower than standard Gemma 4 models, and recommends Gemma 4 for applications that require the best possible output rather than speed. Google also notes that the parallel-decoding approach appears best suited to low- and medium-batch local inference, while offering less advantage in high-QPS cloud serving.

The company highlights a few examples of where the model’s structure may help. In one case, Unsloth fine-tuned DiffusionGemma to solve Sudoku, a task that autoregressive models reportedly struggle with because future tokens matter. Google also points to iterative refinement, where the model starts with placeholder tokens and repeatedly updates them until the text converges.

For developers interested in trying the model, Google says the weights are available on Hugging Face, alongside a developer guide and A Visual Guide to DiffusionGemma. The company also lists support paths through MLX, vLLM, Hugging Face Transformers, Hackable Diffusion, NVIDIA NeMo, and, soon, llama.cpp. Google further says it worked with NVIDIA on optimized support for consumer and enterprise hardware, including NVFP4 acceleration, with cloud deployment options through Gemini Enterprise Agent Platform Model Garden and NVIDIA NIM.

Source: Google Blog

Google’s DiffusionGemma promises 4x faster text generation

TL;DR

Continue the conversation on Slack

Related Articles

Antigravity adds Gemini 3.5 Flash Low to cut tokens 45%

NVIDIA ships Nemotron 3 Ultra: open 550B MoE for agents

How to build AI agents from first principles, not frameworks