WebNews

Please enter a web search for web results.

NewsWeb

vLLM Docs
docs.vllm.ai > en > latest > api > vllm > distributed > stateless_coordinator

stateless_coordinator

9+ hour, 22+ min ago  (53+ words) stateless_coordinator'vLLM Docs A stateless version of the GroupCoordinator class in parallel_state, It will create CPU, device and TCPStore based communication groups that are independent of PyTorch's WORLD group. Hence, communication groups with a different set of participants GPUs can be created…...

vLLM Docs
docs.vllm.ai > projects > gaudi > en > 0.11.2 > dev_guide > profiling > e2e-profiling.html

End-to-End Profiling - vLLM Hardware Plugin for Intel® Gaudi®

1+ mon, 1+ week ago  (201+ words) E2E profiling captures all relevant data in a single run, combining: Due to the large amount of data collected during E2E profiling, Python stack events in the PyTorch Profiler are disabled by default. If you need Python stack events, use either PyTorch…...

vLLM Docs
docs.vllm.ai > projects > ascend > zh-cn > main > user_guide > release_notes.html

Release Notes

3+ mon, 2+ week ago  (1724+ words) This is the final release of v0.13.0 for vLLM Ascend. Please follow the official doc to get started. Qwen3-Next: Full support for Qwen3-Next series including 80B-A3B-Instruct with full graph mode, MTP, quantization (W8A8), NZ optimization, and chunked prefill. Fixed multiple accuracy and…...

vLLM Docs
docs.vllm.ai > en > latest > models > supported_models

Supported Models

3+ mon, 3+ week ago  (1230+ words) vLLM supports generative and pooling models across various tasks. For each task, we list the model architectures that have been implemented in vLLM. Alongside each architecture, we include some popular models that use it. If vLLM natively supports a model, its…...

vLLM Docs
docs.vllm.ai > en > latest > api > vllm > entrypoints > openai > speech_to_text

speech_to_text

3+ mon, 3+ week ago  (105+ words) Convert tokens to verbose segments. This method expects the model to produce timestamps as tokens (similar to Whisper). If the tokens do not include timestamp information, the segments may not be generated correctly. Note: Fields like avg_logprob, compression_ratio, and no_speech_prob are not supported…...

vLLM Docs
docs.vllm.ai > en > latest > api > vllm > distributed > kv_transfer > kv_connector > v1 > multi_connector

multi_connector

3+ mon, 3+ week ago  (81+ words) A wrapper for using multiple KVConnectors at the same time. the required KV cache layout. e.g. HND, or NHD. None if the connector does not require a specific layout. Set xPU-specific copy ops for all sub-connectors. Set the KV connector handshake…...

vLLM Docs
docs.vllm.ai > en > latest > usage > troubleshooting

Troubleshooting

3+ mon, 3+ week ago  (961+ words) This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please search existing issues first to see if it has already been reported. If not, please file a new issue, providing as much relevant…...

vLLM Docs
docs.vllm.ai > en > latest > api > vllm > model_executor > model_loader > weight_utils

weight_utils

3+ mon, 3+ week ago  (481+ words) Context manager that provides an atomic file writing routine. The context manager writes to a temporary file and, if successful, atomically replaces the original file. The path to the file to write. The file mode for the temporary file (e.g., 'w…...

vLLM Docs
docs.vllm.ai > en > latest > api > vllm > model_executor > layers > quantization > modelopt

modelopt - vLLM

3+ mon, 3+ week ago  (222+ words) Supports loading kv-cache scaling factors from FP8 checkpoints. Linear method for Model Optimizer static quantization. Supports loading FP8 checkpoints with static weight scale and activation scale. Future support might be added for dynamic scales. Pad intermediate size so FlashInfer kernels' alignment constraints…...

vLLM Docs
docs.vllm.ai > en > latest > api > vllm > reasoning > step3_reasoning_parser

step3_reasoning_parser

3+ mon, 3+ week ago  (120+ words) step3_reasoning_parser'vLLM Docs Reasoning parser for Step3 model. The Step3 model uses token to denote the end of reasoning text. This parser extracts all content before as reasoning content. Extract reasoning content from a delta message. Handles streaming output where previous + delta = current....