NVIDIA ships Nemotron 3 Ultra: open 550B MoE for agents

NVIDIA has just rolled out Nemotron 3 Ultra, a 550B MoE open model built for long-running agents. It promises 5x faster inference and up to 30% lower costs on complex agentic workloads. NVIDIA says weights, data, and recipes are fully open.

NVIDIA ships Nemotron 3 Ultra: open 550B MoE for agents

TL;DR

  • NVIDIA shipping Nemotron 3 Ultra: 550B MoE open model designed for long-running agents
  • Claimed 5× faster inference and up to 30% lower cost for complex agentic tasks versus other open frontier models
  • Target workloads: coding and deep research; supports planning, tool use, failure recovery, next-step decisions
  • Hybrid Mamba-Transformer MoE architecture; positioned for tool-heavy agent workflows, not simple chat
  • Benchmarks reported: RULER @ 1M 95%, PinchBench 91%, Terminal-Bench 2.0 54%, EnterpriseOps-Gym 33%
  • Fully open release: weights, synthetic data, post-training recipes; on Hugging Face; post-trained with openclaw, Hermes Agent, LangChain

NVIDIA AI posted on X that it is shipping Nemotron 3 Ultra, a “550B MoE frontier-intelligence open model” built for long-running agents. The company claims the model delivers “5x faster inference” and can lower the cost of complex agentic tasks by up to “30%” versus other open frontier models.

In the launch thread, NVIDIA AI states that Ultra is aimed at workloads such as coding and deep research, where agents spend time planning, using tools, recovering from failures, and deciding what to do next. The company attributes the system’s efficiency to a hybrid Mamba-Transformer MoE architecture that allegedly enables more reasoning cycles within the same time budget.

NVIDIA also published comparison visuals that place Nemotron 3 Ultra alongside GLM 5.1, Kimi K2.6 and Qwen3.5 across several agentic benchmarks. In the table shown in the post, Nemotron 3 Ultra is listed at 91% on PinchBench for agent productivity, 33% on EnterpriseOps-Gym for long-horizon planning, 54% on Terminal-Bench 2.0 for coding, 82% on IFBench, 1,448 on GDPval-AA, 56% on ProfBench (Search), and 95% on RULER @ 1M for long context. The same table shows some rivals with higher scores on selected rows, so the company’s broader “leading accuracy” claim appears to depend on the benchmark.

The images also show a “Nemotron 3 - Hybrid Mamba Transformer Latent MoE” architecture diagram and a separate agent workflow schematic, suggesting the model is being positioned for tool-heavy, long-running systems rather than simple chat. Another chart compares accuracy and “Relative Throughput (Output tokens/s/GPU),” with visible labels including 5.9, 1.0, 1.2 and 3.7.

NVIDIA AI further mentions that Ultra was post-trained for agent harnesses including openclaw, NousResearch Hermes Agent, and LangChain. The company also states that Nemotron 3 Ultra is “fully open,” including model weights, synthetic data and post-training recipes, and that it is available on Hugging Face.

The launch drew quick responses from other accounts, including Glean, which posted that the model is “coming soon” to its enterprise stack, and Unsloth AI, which said it had uploaded Dynamic GGUF files for local use.

Source: NVIDIA AI on X

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community