NVIDIA Releases Nemotron-Labs Diffusion Language Models — 4-6x Throughput vs Autoregressive

Published 2026-05-23Ingested 2026-05-25Foundation ModelsMedium⭐ Timeline Candidate

Summary

NVIDIA's Nemotron-Labs team published their diffusion language models (DLMs), which generate multiple tokens in parallel via iterative denoising over 32-token blocks rather than the sequential token-by-token approach of autoregressive (AR) models. The Nemotron-Labs Diffusion 8B model achieves approximately 865 tokens/second on B200 hardware — roughly 4x the AR baseline — while improving accuracy 1.2% over Qwen3 8B. Self-speculation modes push throughput further to 6-6.4x AR baseline. The model s

Alignment: New signal not yet covered

Related Positions: Multi-Model Multi-Vendor, AI Infrastructure Strategy

nvidianemotron-labsdiffusion-language-modelsinference-throughputb200self-speculationmodel-architecture