News

v LLM docs
docs. vllm. ai > en > latest > api > vllm > model_executor > warmup > qwen_triton_warmup

vllm. model_executor. warmup. qwen_triton_warmup

3+ hour, 25+ min ago  (26+ words) v LLM docs Warm up Qwen Triton kernels from the loaded model's compile keys. Warm Qwen Triton kernels reported by the JIT monitor....

Symbols: nyse:vrt
v LLM docs
docs. vllm. ai > en > latest > api > vllm > entrypoints > scale_out > derender > serving

serving - v LLM

3+ hour, 27+ min ago  (73+ words) v LLM docs Postprocess a Generate Response into a Chat Completion Response. Postprocess a list of Generate Responses into a Completion Response. Extract multimodal metadata from a rendered engine prompt. Returns None for text-only prompts. When request. chat_request is provided, the…...

Symbols: nyse:v,nyse:vcv,nyse:vly,nasdaq:vitl,nasdaq:avgo,next.js
v LLM docs
docs. vllm. ai > en > latest > api > vllm > v1 > attention > backends > mla > prefill > aiter_flash_attn

vllm. v1. attention. backends. mla. prefill. aiter_flash_attn

3+ hour, 59+ min ago  (54+ words) v LLM docs AITER Flash Attention backend for MLA prefill (ROCm). This backend calls aiter. flash_attn_varlen_func directly, which natively supports different q/k and v head dims (qk headdim 192, v headdim 128) without padding V, and dispatches to the fast aiter: :fmha_fwd_ kernel…...

Symbols: nasdaq:vfs
v LLM docs
docs. vllm. ai > en > latest > api > vllm > entrypoints > scale_out > token_in_token_out

vllm. entrypoints. scale_out. token_in_token_out

3+ hour, 27+ min ago  (15+ words) v LLM docs Encode/decode utilities for multimodal tensors and field metadata...

Symbols: nasdaq:dvlt
v LLM docs
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > fused_moe > oracle > base

base - v LLM

3+ day, 23+ hour ago  (221+ words) Abstract base class for Mo E kernel oracles. Each Mo E oracle (unquantized / fp8 / nvfp4 / mxfp4 / mxfp8 / int8 / int_wna16) is responsible for selecting the right Mo E kernel backend for a given (model, hardware, deployment-config) tuple. The current implementation expresses this responsibility as module-level functions that…...

Symbols: nasdaq:ntwk,nasdaq:smci
v LLM docs
docs. vllm. ai > en > latest > api > vllm > models > minimax_m3 > amd > ops > sparse_attn

sparse_attn

4+ day, 3+ hour ago  (113+ words) v LLM docs ROCm gfx942/gfx950 block-sparse GQA prefill kernel for Mini Max-M3. Only the prefill path is specialized on CDNA: each 128-token KV block is split into SUB_K-token sub-tiles to right-size the per-block QK/PV MFMAs. Everything else -- the decode split-K…...

Symbols: d05.S0,u11.S0,z74.S0,bac.si,5ab.si,e6r.si
v LLM docs
docs. vllm. ai > en > latest > api > vllm > config > diffusion

diffusion - v LLM

2+ week, 3+ day ago  (108+ words) diffusion v LLM docs Configuration for discrete diffusion (d LLM) models. Configuration for discrete diffusion language models (d LLMs). d LLMs generate tokens via iterative denoising over a fixed-length canvas rather than left-to-right autoregressive decoding. They reuse the speculative-decoding data path…...

Symbols: btc-usd
v LLM docs
docs. vllm. ai > en > latest > configuration > optimization

Optimization and Tuning

7+ mon, 3+ week ago  (1562+ words) This guide covers optimization strategies and performance tuning for v LLM V1. Running out of memory? Consult this guide on how to conserve memory. v LLM provides 4 optimization levels (-O0, -O1, -O2, -O3) that allow users to trade off startup time for performance: For more…...

Symbols: setaf-af
v LLM docs
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > fused_moe > prepare_finalize > deepep_v2

deepep_v2 - v LLM

2+ week, 6+ day ago  (102+ words) deepep_v2 v LLM docs Prepare/Finalize using Deep EP v2 Elastic Buffer (unified API). Supports two modes controlled by the use_cudagraph constructor arg: Decode mode (use_cudagraph=True): - do_expand=False, do_cpu_sync=False - Tokens returned in original order with recv_topk_idx (global IDs) - Worst-case tensor allocation; padding rows zeroed via…...

Symbols: nyse:ibm,0700.hk,80700.hk,btc-usd
v LLM docs
docs. vllm. ai > en > stable > api > vllm > model_executor > layers > fused_moe > experts > lora_experts_mixin

vllm. model_executor. layers. fused_moe. experts. lora_experts_mixin

2+ week, 6+ day ago  (27+ words) v LLM docs The helper methods are pure functions of their inputs; all required state is on lora_context or passed as arguments....

Symbols: lloy.l,shel.l,btc-usd,0m69.il,0qfp.il,0rgc.il