--:--
CATEGORIES
AUTHORS

One-Million-Token Memory or Million-Dollar Bill? Inside NVIDIA’s Token-Guzzling Agent Fix

15× token bloat meets 1 M-context Nemotron 3 Super—free weights, undisclosed API price.

One-Million-Token Memory or Million-Dollar Bill? Inside NVIDIA’s Token-Guzzling Agent Fix

Your agents are choking on their own words—15× more tokens per task—until a 12-billion-parameter slice of a 120-billion-parameter giant promises to remember everything without bankrupting you.

NVIDIA calls the fix Nemotron 3 Super. The model is free to download under an open-weights license and ships as an NVIDIA NIM microservice available today from NVIDIA's registry, Hugging Face, Perplexity, or OpenRouter.

The pitch: one-million-token context, five-times throughput, double the accuracy of its predecessor. The hitch? No one will tell you what the cloud API costs.

Multi-agent workflows generate up to 15× more tokens than standard chat because each interaction requires resending full histories, including tool outputs and intermediate reasoning.

That 15× multiplier lands directly on your cloud bill—whatever API you're currently using, every agent round-trip costs fifteen times what a single chat call does. Nemotron 3 Super's undisclosed per-token pricing means teams cannot yet model whether the switch pays off.

Enterprise logos Amdocs, Palantir, Cadence, Dassault, Siemens—dot the slide deck, while AI-native shops like Perplexity, CodeRabbit, and Factory already run the NIM in production.

The binary runs in NVFP4 on Blackwell, up to 4× faster than FP8 on NVIDIA Hopper “with no loss in accuracy,” according to the vendor. That’s a big “if” until independent benchmarks show up.

Bottom line: the weights are free, the NIM container is ready to deploy, but the meter on your cloud bill is still a black box. Until NVIDIA posts dollars per million tokens, mid-sized SaaS teams live in the same guessing game—just with longer memory.

Source: Nvidia