UNCLASSIFIED - NO CUI

Update dependency vllm to v0.18.0

This MR contains the following updates:

Package Update Change
vllm minor 0.17.10.18.0
vllm minor ==0.17.1==0.18.0

Release Notes

vllm-project/vllm (vllm)

v0.18.0

Compare Source

vLLM v0.18.0
Known issues
  • Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#​37618)
  • If you previously ran into CUBLAS_STATUS_INVALID_VALUE and had to use a workaround in v0.17.0, you can reinstall torch 2.10.0. PyTorch published an updated wheel that addresses this bug.
Highlights

This release features 445 commits from 213 contributors (61 new)!

  • gRPC Serving Support: vLLM now supports gRPC serving via the new --grpc flag (#​36169), enabling high-performance RPC-based serving alongside the existing HTTP/REST interface.
  • GPU-less Render Serving: New vllm launch render command (#​36166, #​34551) enables GPU-less preprocessing and rendering, allowing separation of multimodal preprocessing from GPU inference.
  • NGram GPU Speculative Decoding: NGram speculative decoding now runs on GPU and is compatible with the async scheduler (#​29184), significantly reducing spec decode overhead.
  • KV Cache Offloading Improvements: Smart CPU offloading that stores only frequently-reused blocks (#​35342), plus FlexKV as a new offloading backend (#​34328) and support for multiple KV groups in offloading spec (#​36610).
  • Elastic Expert Parallelism Milestone 2: NIXL-EP integration (#​35627) enables dynamic GPU scaling for MoE experts, with new --enable-ep-weight-filter CLI option (#​37351) for faster EP model loading.
  • FlashInfer 0.6.6: Updated FlashInfer dependency (#​36768) with numerous performance and correctness improvements.
  • Responses API Streaming Tool Calls: The OpenAI Responses API now supports tool/function calling with streaming (#​29947).
  • Online Beam Search for ASR: Beam search support for encoder/decoder models both offline (#​36153) and online transcriptions (#​36160).
  • Ray No Longer a Default Dependency: Ray has been removed as a default dependency (#​36170) — install it explicitly if needed.
Model Support
  • New architectures: Sarvam MoE (#​33942), OLMo Hybrid (#​32550), HyperCLOVAX-SEED-Think-32B VLM (#​31471), HyperCLOVAX-SEED-Think-14B (#​37107), Kimi-Audio-7B-Instruct (#​36127), ColPali late-interaction retrieval (#​36818), ERNIE pooling models (#​36385).
  • Speculative decoding: Eagle3 for Qwen3.5 (#​36658), Eagle3 for Kimi K2.5 MLA (#​36361), Eagle for Mistral Large 3 with dense layers (#​36163).
  • LoRA: Whisper LoRA (#​29856), FP8 LoRA dense kernel (#​35242).
  • Multimodal: Online use_audio_in_video (#​36319), audio extraction from MP4 for Nemotron Nano VL (#​35539), audio transcription for MP4/M4A/WebM (#​35109), expose media_io_kwargs at runtime (#​34778), fast media preprocessing for Nano Nemotron VL (#​35657).
  • Compatibility: Gemma/Gemma2 inputs_embeds (#​36787), SigLIP/CLIP Transformers v5 (#​37200), fused expert weights in Transformers backend (#​36997).
  • Performance: Qwen3 Next fused GDN kernel (#​35777), LFM2 tuned H100 MoE configs (#​36699).
  • Fixes: DeepSeek-V3.2 tokenizer space stripping (#​37004), Qwen3.5 tool calling (#​36774), Qwen3-VL timestamp mismatch (#​36136), Qwen3-Next TP>1 weight sharding (#​36242), Qwen3-ASR torch.compile (#​35869), MiniCPM-V audio inference (#​36751), MiniCPM-O 4.5 ViT attention (#​34127), routed experts for hybrid models (#​35744), Qwen2.5-Omni/Qwen3-Omni multi-video audio_in_video (#​37147), DeepSeek-OCR empty images crash (#​36670).
Engine Core
  • Model Runner V2: Probabilistic rejection sampling for spec decode (#​35461), pooling models (#​36019), extensible CUDA graph dispatch (#​35959), WhisperModelState (#​35790), XD-RoPE (#​36817), model_state CUDA graph capture (#​36544).
  • KV cache offloading: Reuse-frequency-gated CPU stores (#​35342), FlexKV offloading backend (#​34328), multiple KV groups (#​36610), async scheduling fix (#​33881).
  • Speculative decoding: NGram GPU implementation with async scheduler (#​29184), fused EAGLE step slot mapping (#​33503).
  • Performance: Remove busy loop from idle buffer readers (#​28053), 2.7% E2E throughput for pooling via worker-side maxsim (#​36159), 3.2% via batched maxsim (#​36710), CUDA graph memory accounting during profiling (#​30515), checkpoint prefetch to OS page cache (#​36012), InstantTensor weight loader (#​36139), sporadic stall fix via pin_memory removal (#​37006).
  • Stability: VLM concurrent throughput degradation fix (#​36557), DP deadlock fix (#​35194), DeepSeek V3.2 OOM during CG profiling (#​36691), Ray DP startup crash (#​36665), NCCL rank calculation fix (#​36940), zero-init MLA output buffers for NaN prevention (#​37442), CUDA OOM fix (#​35594).
  • Defaults: Cascade attention disabled by default (#​36318).
  • Extensibility: OOT linear method registration (#​35981), custom collective ops registration for non-CUDA platforms (#​34760).
Kernel
  • FA4 for MLA prefill (#​34732).
  • FlashInfer Sparse MLA: FP8 KV cache support (#​35891), CUDA graphs on ROCm (#​35719), MTP lens > 1 on ROCm (#​36681).
  • TRTLLM FP8 MoE modular kernel (#​36307).
  • FP8 KV cache for Triton MLA decode (#​34597).
  • FlashInfer MoE A2A kernel (#​36022).
  • Remove chunking from FusedMoE for full batch processing (#​34086).
  • CustomOp FusedRMSNormGated for torch.compile compatibility (#​35877).
  • Mamba2 SSD prefill Triton kernel optimization (#​35397).
  • DeepSeek-V3.2: Vectorized MLA query concat kernel (#​34917), optimized FP8 KV cache gather for context parallel (#​35290).
  • 320-dimension MLA head size support (#​36161).
  • Packed recurrent fast path for decode (#​36596).
  • EP scatter race condition fix (#​34991).
Hardware & Performance
Large Scale Serving
  • Elastic EP Milestone 2: NIXL-EP integration (#​35627), --enable-ep-weight-filter for faster EP loading (#​37351).
  • PD Disaggregation: ~5% scheduler overhead reduction (#​35781), KV transfer fix with spec decode (#​35158), P/D for hybrid SSM-FA models via NIXL (#​36687), PP for multimodal models on Transformers backend (#​37057).
  • KV Connectors: HMA + NIXL connector (#​35758), FlexKV offloading (#​34328), worker→scheduler metadata (#​31964), All-to-All DCP backend (#​34883).
  • LMCache: Fault tolerance mechanism (#​36586), memory leak fix (#​35931), race condition fix (#​35831), TP size for MLA multi-reader locking (#​36129).
  • EP loading: Skip non-local expert weights (#​37136).
Quantization
  • ModelOpt MXFP8 MoE support (#​35986).
  • MXFP4 MoE routing simulation override for accuracy (#​33595).
  • FP8 LoRA dense kernel (#​35242).
  • ROCm: Quark W4A8 MXFP4/FP8 for LinearLayer (#​35316), compressed-tensors fix for DeepSeek-R1 on MI300x (#​36247).
  • Fixes: MLA crash with AWQ/GPTQ quantized models (#​34695), score layer quantization for reranker models (#​35849), GLM-4.1V non-default quantization (#​36321), FP8 k_scale/v_scale loading for Qwen3-MoE (#​35656).
API & Frontend
  • gRPC: New --grpc flag for gRPC serving (#​36169).
  • GPU-less serving: vllm launch render for preprocessing-only serving (#​36166), vllm launch for GPU-less preprocessing (#​34551).
  • Responses API: Streaming tool/function calling (#​29947), reasoning item fixes (#​34499, #​36516).
  • Anthropic API: Accept redacted thinking blocks (#​36992).
  • ASR: Online beam search transcriptions (#​36160), offline beam search (#​36153), audio transcription for MP4/M4A/WebM (#​35109), realtime endpoint metrics (#​35500).
  • Tool calling: Granite4 tool parser (#​36827), Qwen3Coder anyOf double encoding fix (#​36032).
  • New options: --distributed-timeout-seconds (#​36047), --attention-backend auto (#​35738), reasoning_effort=none (#​36238), PyTorch profiler schedule (#​35240).
  • Cohere Embed v2 API support (#​37074).
  • Azure Blob Storage support for RunAI Model Streamer (#​34614).
  • Graceful shutdown timeout for in-flight requests (#​36666).
  • Fixes: tool_choice=required exceeding max_tokens crash (#​36841), negative max_tokens with long prompts (#​36789), concurrent classify/token_classify race (#​36614), Anthropic billing header prefix cache miss (#​36829), render endpoint crash for multimodal requests (#​35684), xgrammar dtype mismatch on macOS CPU (#​32384), minimax_m2 tool parser with stream interval > 1 (#​35895).
Security
  • Respect user trust_remote_code setting in NemotronVL and KimiK25 (#​36192).
  • Upgrade xgrammar for security fix (#​36168).
  • Guard RLHF weight sync deserialization behind insecure serialization flag (#​35928).
Dependencies
Breaking Changes
  1. Ray no longer a default dependency — install explicitly if needed (#​36170).
  2. Deprecated items removed — items deprecated in v0.18 have been removed (#​36470, #​36006).
  3. Cascade attention disabled by default (#​36318).
  4. swap_space parameter removed (V0 deprecation, #​36216).
  5. Monolithic TRTLLM MoE disabled for renormalize routing — late fix cherry-picked (#​37591).
New Contributors 🎉

Configuration

📅 Schedule: Branch creation - Only on Sunday ( * * * * 0 ) in timezone America/Los_Angeles, Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻️ Rebasing: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this MR and you won't be reminded about these updates again.


  • If you want to rebase/retry this MR, check this box

This MR has been generated by Renovate Bot.

Merge request reports

Loading