NVIDIA ships Nemotron 3 Ultra: open 550B MoE for agents

NVIDIA has just rolled out Nemotron 3 Ultra, a 550B MoE open model built for long-running agents. It promises 5x faster inference and up to 30% lower costs on complex agentic workloads. NVIDIA says weights, data, and recipes are fully open.

June 5, 2026

•

LLM NVIDIA

TL;DR

NVIDIA shipping Nemotron 3 Ultra: 550B MoE open model designed for long-running agents
Claimed 5× faster inference and up to 30% lower cost for complex agentic tasks versus other open frontier models
Target workloads: coding and deep research; supports planning, tool use, failure recovery, next-step decisions
Hybrid Mamba-Transformer MoE architecture; positioned for tool-heavy agent workflows, not simple chat
Benchmarks reported: RULER @ 1M 95%, PinchBench 91%, Terminal-Bench 2.0 54%, EnterpriseOps-Gym 33%
Fully open release: weights, synthetic data, post-training recipes; on Hugging Face; post-trained with openclaw, Hermes Agent, LangChain

NVIDIA AI posted on X that it is shipping Nemotron 3 Ultra, a “550B MoE frontier-intelligence open model” built for long-running agents. The company claims the model delivers “5x faster inference” and can lower the cost of complex agentic tasks by up to “30%” versus other open frontier models.

In the launch thread, NVIDIA AI states that Ultra is aimed at workloads such as coding and deep research, where agents spend time planning, using tools, recovering from failures, and deciding what to do next. The company attributes the system’s efficiency to a hybrid Mamba-Transformer MoE architecture that allegedly enables more reasoning cycles within the same time budget.

NVIDIA also published comparison visuals that place Nemotron 3 Ultra alongside GLM 5.1, Kimi K2.6 and Qwen3.5 across several agentic benchmarks. In the table shown in the post, Nemotron 3 Ultra is listed at 91% on PinchBench for agent productivity, 33% on EnterpriseOps-Gym for long-horizon planning, 54% on Terminal-Bench 2.0 for coding, 82% on IFBench, 1,448 on GDPval-AA, 56% on ProfBench (Search), and 95% on RULER @ 1M for long context. The same table shows some rivals with higher scores on selected rows, so the company’s broader “leading accuracy” claim appears to depend on the benchmark.

The images also show a “Nemotron 3 - Hybrid Mamba Transformer Latent MoE” architecture diagram and a separate agent workflow schematic, suggesting the model is being positioned for tool-heavy, long-running systems rather than simple chat. Another chart compares accuracy and “Relative Throughput (Output tokens/s/GPU),” with visible labels including 5.9, 1.0, 1.2 and 3.7.

NVIDIA AI further mentions that Ultra was post-trained for agent harnesses including openclaw, NousResearch Hermes Agent, and LangChain. The company also states that Nemotron 3 Ultra is “fully open,” including model weights, synthetic data and post-training recipes, and that it is available on Hugging Face.

The launch drew quick responses from other accounts, including Glean, which posted that the model is “coming soon” to its enterprise stack, and Unsloth AI, which said it had uploaded Dynamic GGUF files for local use.

Source: NVIDIA AI on X

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community

How to build AI agents from first principles, not frameworks

Anshuman Mishra lays out a bottom-up recipe for agent training using a tiny text-to-diagram task. The key: start with a strict environment and reward loop, use SFT to learn valid actions, then apply RL to optimize behavior—and watch for reward hacking.

May 22, 2026

1 shared tag

Zed makes the case for local AI models in its editor

Zed has published a new post arguing that local AI delivers stronger privacy guarantees, steadier costs, and less reliance on cloud policy changes. It says local model usage in Zed’s agent has tripled in 10 weeks, with setup tips for LM Studio, Ollama, and llama.cpp.

May 22, 2026

1 shared tag

Agent frameworks may be sabotaging prefix caching and inference speed

In a X thread, Chayenne Zhao argues that many agent frameworks waste tokens in ways that undercut key inference optimizations like prefix caching—hurting cost and throughput in long sessions. The takeaway: better agent–inference co-design may unlock big efficiency gains.

Apr 6, 2026

1 shared tag

Continue the conversation on Slack

Related Articles

How to build AI agents from first principles, not frameworks

Zed makes the case for local AI models in its editor

Agent frameworks may be sabotaging prefix caching and inference speed