News
Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization
2+ hour, 55+ min ago (870+ words) This post first analyzes the memory hierarchy of the NVIDIA GPUs, discussing the power and performance impacts of data transfer over die-to-die link. It then reviews how to use NVIDIA Multi-Instance GPU (MIG) mode to achieve data localization. Finally, it…...
Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai
1+ day, 2+ hour ago (1169+ words) Joint benchmarking with Nebius shows that fractional GPUs significantly improve throughput and utilization for production LLM workloads As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges through intelligent…...
Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute
1+ day, 3+ hour ago (700+ words) The leaderboard scores how fast users" custom GPU kernels solve a set of standard problems like vector addition, sorting, and matrix multiply. Python dominates machine learning for its ergonomics, but writing truly fast GPU code has historically meant dropping into…...
Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy
1+ week, 3+ day ago (1026+ words) NVIDIA TensorRT LLM enables developers to build high-performance inference engines for large language models (LLMs), but deploying a new architecture traditionally requires significant manual effort. To address this challenge, today we are announcing the availability of AutoDeploy as a beta…...
How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation
2+ week, 2+ hour ago (816+ words) Specialized AI models are built to perform specific tasks or solve particular problems. But if you've ever tried to fine-tune or distill a domain-specific model, you've probably hit a few blockers, such as: These challenges often prevent promising AI projects…...
Accelerating Long-Context Model Training in JAX and XLA
2+ week, 2+ day ago (783+ words) To understand why NVSHMEM provides significant speedups for long-context training, it's necessary to first understand how context parallelism works and the unique communication patterns it creates. This section explains why the fine-grained, latency-sensitive communication of ring attention makes it an…...
Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel
2+ week, 3+ day ago (878+ words) In LLM training, Expert Parallel (EP) communication for hyperscale mixture-of-experts (MoE) models is challenging. EP communication is essentially all-to-all, but due to its dynamics and sparseness (only topk experts per AI token instead of all experts), it's challenging to implement…...
Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core
3+ week, 1+ day ago (1340+ words) This post introduces Dynamic Context Parallelism (Dynamic-CP), a scheduling approach in NVIDIA Megatron Core used for LLM post-training or DiT pre-training. It dynamically selects the CP size per microbatch to efficiently handle variable-length sequences, achieving up to 1.48x speedup on real-world…...
Updating Classifier Evasion for Vision Language Models
3+ week, 1+ day ago (588+ words) How much influence can we exert over the LLM if we control the image input? Can we adapt classic adversarial image generation techniques to VLMs? If so, this may impact how we secure systems integrating these VLMs into control flow…...
Accelerating Diffusion Models with an Open, Plug-and-Play Offering
3+ week, 2+ day ago (663+ words) NVIDIA Research has built a new open source library called NVIDIA FastGen that unifies state-of-the-art video diffusion distillation techniques. Accelerating diffusion sampling without sacrificing quality and diversity has emerged as a key open challenge, with video generation being one of…...