Your agents are choking on their own words—15× more tokens per task—until a 12-billion-parameter slice of a 120-billion-parameter giant promises to remember everything without bankrupting you.
NVIDIA calls the fix Nemotron 3 Super. The model is free to download under an open-weights license and ships as an NVIDIA NIM microservice available today from NVIDIA's registry, Hugging Face, Perplexity, or OpenRouter.
The pitch: one-million-token context, five-times throughput, double the accuracy of its predecessor. The hitch? No one will tell you what the cloud API costs.
Multi-agent workflows generate up to 15× more tokens than standard chat because each interaction requires resending full histories, including tool outputs and intermediate reasoning.
That 15× multiplier lands directly on your cloud bill—whatever API you're currently using, every agent round-trip costs fifteen times what a single chat call does. Nemotron 3 Super's undisclosed per-token pricing means teams cannot yet model whether the switch pays off.
Enterprise logos Amdocs, Palantir, Cadence, Dassault, Siemens—dot the slide deck, while AI-native shops like Perplexity, CodeRabbit, and Factory already run the NIM in production.
The binary runs in NVFP4 on Blackwell, up to 4× faster than FP8 on NVIDIA Hopper “with no loss in accuracy,” according to the vendor. That’s a big “if” until independent benchmarks show up.
Bottom line: the weights are free, the NIM container is ready to deploy, but the meter on your cloud bill is still a black box. Until NVIDIA posts dollars per million tokens, mid-sized SaaS teams live in the same guessing game—just with longer memory.
Source: Nvidia