News
flashinfer_trtllm_moe
12+ hour, 34+ min ago (32+ words) v LLM Docs Supports only Blackwell-family GPUs. BF16 kernels do not support non-gated Mo E Supports TRTLLM Kernel does not support EPLB. This method mirrors mk. Fused Mo EPermute Experts Unpermute. is_supported_config for BF16 unquantized kernels....
Prompt Embedding Inputs
5+ mon, 4+ week ago (213+ words) This page teaches you how to pass prompt embedding inputs to v LLM. You can pass prompt embeddings from Hugging Face Transformers models to the 'prompt_embeds' field of the prompt embedding dictionary, as shown in the following examples: Our Open AI-compatible server…...
Multimodal Inputs
5+ mon, 4+ week ago (889+ words) This page teaches you how to pass multi-modal inputs to multi-modal models in v LLM. We are actively iterating on multi-modal support. See this RFC for upcoming changes, and open an issue on Git Hub if you have any feedback or…...
gemma4_utils
3+ week, 2+ day ago (340+ words) Gemma4 output parsing utilities for offline inference. Standalone functions that parse decoded model text to extract structured thinking content and tool calls from Gemma4 models. These are pure-Python utilities with zero heavy dependencies " they work on raw decoded strings from any inference…...
Mxfp8 Linear Kernel
3+ week, 3+ day ago (46+ words) v LLM Docs Base class for MXFP8 quantized linear kernels. Each subclass implements a specific GEMM backend (Flash Infer CUTLASS, Marlin, emulation). Configuration for an MXFP8 linear layer. All MXFP8 layers share the same structure: FP8-E4 M3 weights with uint8 (E8 M0) per-block scales at block size 32....
emulation
3+ week, 3+ day ago (10+ words) v LLM Docs Software emulation fallback for MXFP8 (dequant to BF16)....
mxfp8 - v LLM
3+ week, 3+ day ago (28+ words) mxfp8v LLM Docs Configuration for an MXFP8 linear layer. All MXFP8 layers share the same structure: FP8-E4 M3 weights with uint8 (E8 M0) per-block scales at block size 32....
fireredlid
3+ week, 4+ day ago (73+ words) v LLM Docs Fire Red LID feature extractor and processor. - Raw waveform " 80-dim log-mel filterbank (via kaldi_native_fbank) The Processor wraps the Feature Extractor and a tokenizer. Extracts 80-dim log-mel filterbank features from raw waveforms, applies CMVN, and returns padded feature tensors…...
triton_w4a16
3+ week, 4+ day ago (169+ words) Triton-based W4 A16 GEMM kernel for ROCm (MI300 and newer). Supports GPTQ-format int4 weights (uint4b8 symmetric, uint4 asymmetric) with grouped quantization. Weight tensors are transposed from the compressed-tensors checkpoint layout to the kernel's [K, N//8] layout. Fused W4 A16 GEMM using GPTQ-packed int4 weights. Activation matrix [M, K], float16 or…...
base - v LLM
3+ week, 4+ day ago (133+ words) Base class for NVFP4 quantized linear kernels. Each subclass implements a specific GEMM backend (CUTLASS, Marlin, etc). The kernel selection mechanism iterates over registered subclasses in priority order, calling is_supported and can_implement to find the best match for the current hardware. Run the…...