News
vllm. model_executor. warmup. qwen_triton_warmup
3+ hour, 25+ min ago (26+ words) v LLM docs Warm up Qwen Triton kernels from the loaded model's compile keys. Warm Qwen Triton kernels reported by the JIT monitor....
serving - v LLM
3+ hour, 27+ min ago (73+ words) v LLM docs Postprocess a Generate Response into a Chat Completion Response. Postprocess a list of Generate Responses into a Completion Response. Extract multimodal metadata from a rendered engine prompt. Returns None for text-only prompts. When request. chat_request is provided, the…...
vllm. v1. attention. backends. mla. prefill. aiter_flash_attn
3+ hour, 59+ min ago (54+ words) v LLM docs AITER Flash Attention backend for MLA prefill (ROCm). This backend calls aiter. flash_attn_varlen_func directly, which natively supports different q/k and v head dims (qk headdim 192, v headdim 128) without padding V, and dispatches to the fast aiter: :fmha_fwd_ kernel…...
vllm. entrypoints. scale_out. token_in_token_out
3+ hour, 27+ min ago (15+ words) v LLM docs Encode/decode utilities for multimodal tensors and field metadata...
base - v LLM
3+ day, 23+ hour ago (221+ words) Abstract base class for Mo E kernel oracles. Each Mo E oracle (unquantized / fp8 / nvfp4 / mxfp4 / mxfp8 / int8 / int_wna16) is responsible for selecting the right Mo E kernel backend for a given (model, hardware, deployment-config) tuple. The current implementation expresses this responsibility as module-level functions that…...
sparse_attn
4+ day, 3+ hour ago (113+ words) v LLM docs ROCm gfx942/gfx950 block-sparse GQA prefill kernel for Mini Max-M3. Only the prefill path is specialized on CDNA: each 128-token KV block is split into SUB_K-token sub-tiles to right-size the per-block QK/PV MFMAs. Everything else -- the decode split-K…...
diffusion - v LLM
2+ week, 3+ day ago (108+ words) diffusion v LLM docs Configuration for discrete diffusion (d LLM) models. Configuration for discrete diffusion language models (d LLMs). d LLMs generate tokens via iterative denoising over a fixed-length canvas rather than left-to-right autoregressive decoding. They reuse the speculative-decoding data path…...
Optimization and Tuning
7+ mon, 3+ week ago (1562+ words) This guide covers optimization strategies and performance tuning for v LLM V1. Running out of memory? Consult this guide on how to conserve memory. v LLM provides 4 optimization levels (-O0, -O1, -O2, -O3) that allow users to trade off startup time for performance: For more…...
deepep_v2 - v LLM
2+ week, 6+ day ago (102+ words) deepep_v2 v LLM docs Prepare/Finalize using Deep EP v2 Elastic Buffer (unified API). Supports two modes controlled by the use_cudagraph constructor arg: Decode mode (use_cudagraph=True): - do_expand=False, do_cpu_sync=False - Tokens returned in original order with recv_topk_idx (global IDs) - Worst-case tensor allocation; padding rows zeroed via…...
vllm. model_executor. layers. fused_moe. experts. lora_experts_mixin
2+ week, 6+ day ago (27+ words) v LLM docs The helper methods are pure functions of their inputs; all required state is on lora_context or passed as arguments....