Google has released DiffusionGemma, an experimental open model under Apache 2.0 that the company describes as a text-diffusion system designed to generate text in parallel rather than token by token. Google claims the 26B Mixture of Experts model can reach "up to 4x faster" inference on dedicated GPUs, while shifting some of the usual trade-offs in speed versus output quality.
The model activates only 3.8B parameters during inference, according to Google, and is positioned to fit within 18GB VRAM when quantized. The company also points to reported throughput of "1000+ tokens per second" on a single NVIDIA H100 and "700+ tokens per second" on an NVIDIA GeForce RTX 5090, though those figures are tied to dedicated hardware and should be treated as vendor-provided benchmarks.
Google says DiffusionGemma builds on its Gemma 4 family and Gemini Diffusion research, using a diffusion head intended to increase generation speed. The model generates 256-token blocks in parallel, which gives every token access to the rest of the block through bi-directional attention. Google suggests that makes the system useful for speed-sensitive workflows such as in-line editing, code infilling, rapid iteration, amino acid sequences, and mathematical graphs.
There is a clear trade-off, at least as Google presents it. The company states that DiffusionGemma’s output quality is lower than standard Gemma 4 models, and recommends Gemma 4 for applications that require the best possible output rather than speed. Google also notes that the parallel-decoding approach appears best suited to low- and medium-batch local inference, while offering less advantage in high-QPS cloud serving.
The company highlights a few examples of where the model’s structure may help. In one case, Unsloth fine-tuned DiffusionGemma to solve Sudoku, a task that autoregressive models reportedly struggle with because future tokens matter. Google also points to iterative refinement, where the model starts with placeholder tokens and repeatedly updates them until the text converges.
For developers interested in trying the model, Google says the weights are available on Hugging Face, alongside a developer guide and A Visual Guide to DiffusionGemma. The company also lists support paths through MLX, vLLM, Hugging Face Transformers, Hackable Diffusion, NVIDIA NeMo, and, soon, llama.cpp. Google further says it worked with NVIDIA on optimized support for consumer and enterprise hardware, including NVFP4 acceleration, with cloud deployment options through Gemini Enterprise Agent Platform Model Garden and NVIDIA NIM.
Source: Google Blog

