Update dependency vllm to v0.18.0 (!57) · Merge requests · Iron Bank Containers / SabelSystems / vllm

This MR contains the following updates:

Package	Update	Change
vllm	minor	`0.17.1` → `0.18.0`
vllm	minor	`==0.17.1` → `==0.18.0`

Release Notes

vllm-project/vllm (vllm)

`v0.18.0`

Compare Source

vLLM v0.18.0

Known issues

Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618)
If you previously ran into CUBLAS_STATUS_INVALID_VALUE and had to use a workaround in v0.17.0, you can reinstall torch 2.10.0. PyTorch published an updated wheel that addresses this bug.

Highlights

This release features 445 commits from 213 contributors (61 new)!

gRPC Serving Support: vLLM now supports gRPC serving via the new --grpc flag (#36169), enabling high-performance RPC-based serving alongside the existing HTTP/REST interface.
GPU-less Render Serving: New vllm launch render command (#36166, #34551) enables GPU-less preprocessing and rendering, allowing separation of multimodal preprocessing from GPU inference.
NGram GPU Speculative Decoding: NGram speculative decoding now runs on GPU and is compatible with the async scheduler (#29184), significantly reducing spec decode overhead.
KV Cache Offloading Improvements: Smart CPU offloading that stores only frequently-reused blocks (#35342), plus FlexKV as a new offloading backend (#34328) and support for multiple KV groups in offloading spec (#36610).
Elastic Expert Parallelism Milestone 2: NIXL-EP integration (#35627) enables dynamic GPU scaling for MoE experts, with new --enable-ep-weight-filter CLI option (#37351) for faster EP model loading.
FlashInfer 0.6.6: Updated FlashInfer dependency (#36768) with numerous performance and correctness improvements.
Responses API Streaming Tool Calls: The OpenAI Responses API now supports tool/function calling with streaming (#29947).
Online Beam Search for ASR: Beam search support for encoder/decoder models both offline (#36153) and online transcriptions (#36160).
Ray No Longer a Default Dependency: Ray has been removed as a default dependency (#36170) — install it explicitly if needed.

Model Support

New architectures: Sarvam MoE (#33942), OLMo Hybrid (#32550), HyperCLOVAX-SEED-Think-32B VLM (#31471), HyperCLOVAX-SEED-Think-14B (#37107), Kimi-Audio-7B-Instruct (#36127), ColPali late-interaction retrieval (#36818), ERNIE pooling models (#36385).
Speculative decoding: Eagle3 for Qwen3.5 (#36658), Eagle3 for Kimi K2.5 MLA (#36361), Eagle for Mistral Large 3 with dense layers (#36163).
LoRA: Whisper LoRA (#29856), FP8 LoRA dense kernel (#35242).
Multimodal: Online use_audio_in_video (#36319), audio extraction from MP4 for Nemotron Nano VL (#35539), audio transcription for MP4/M4A/WebM (#35109), expose media_io_kwargs at runtime (#34778), fast media preprocessing for Nano Nemotron VL (#35657).
Compatibility: Gemma/Gemma2 inputs_embeds (#36787), SigLIP/CLIP Transformers v5 (#37200), fused expert weights in Transformers backend (#36997).
Performance: Qwen3 Next fused GDN kernel (#35777), LFM2 tuned H100 MoE configs (#36699).
Fixes: DeepSeek-V3.2 tokenizer space stripping (#37004), Qwen3.5 tool calling (#36774), Qwen3-VL timestamp mismatch (#36136), Qwen3-Next TP>1 weight sharding (#36242), Qwen3-ASR torch.compile (#35869), MiniCPM-V audio inference (#36751), MiniCPM-O 4.5 ViT attention (#34127), routed experts for hybrid models (#35744), Qwen2.5-Omni/Qwen3-Omni multi-video audio_in_video (#37147), DeepSeek-OCR empty images crash (#36670).

Engine Core

Model Runner V2: Probabilistic rejection sampling for spec decode (#35461), pooling models (#36019), extensible CUDA graph dispatch (#35959), WhisperModelState (#35790), XD-RoPE (#36817), model_state CUDA graph capture (#36544).
KV cache offloading: Reuse-frequency-gated CPU stores (#35342), FlexKV offloading backend (#34328), multiple KV groups (#36610), async scheduling fix (#33881).
Speculative decoding: NGram GPU implementation with async scheduler (#29184), fused EAGLE step slot mapping (#33503).
Performance: Remove busy loop from idle buffer readers (#28053), 2.7% E2E throughput for pooling via worker-side maxsim (#36159), 3.2% via batched maxsim (#36710), CUDA graph memory accounting during profiling (#30515), checkpoint prefetch to OS page cache (#36012), InstantTensor weight loader (#36139), sporadic stall fix via pin_memory removal (#37006).
Stability: VLM concurrent throughput degradation fix (#36557), DP deadlock fix (#35194), DeepSeek V3.2 OOM during CG profiling (#36691), Ray DP startup crash (#36665), NCCL rank calculation fix (#36940), zero-init MLA output buffers for NaN prevention (#37442), CUDA OOM fix (#35594).
Defaults: Cascade attention disabled by default (#36318).
Extensibility: OOT linear method registration (#35981), custom collective ops registration for non-CUDA platforms (#34760).

Kernel

FA4 for MLA prefill (#34732).
FlashInfer Sparse MLA: FP8 KV cache support (#35891), CUDA graphs on ROCm (#35719), MTP lens > 1 on ROCm (#36681).
TRTLLM FP8 MoE modular kernel (#36307).
FP8 KV cache for Triton MLA decode (#34597).
FlashInfer MoE A2A kernel (#36022).
Remove chunking from FusedMoE for full batch processing (#34086).
CustomOp FusedRMSNormGated for torch.compile compatibility (#35877).
Mamba2 SSD prefill Triton kernel optimization (#35397).
DeepSeek-V3.2: Vectorized MLA query concat kernel (#34917), optimized FP8 KV cache gather for context parallel (#35290).
320-dimension MLA head size support (#36161).
Packed recurrent fast path for decode (#36596).
EP scatter race condition fix (#34991).

Hardware & Performance

NVIDIA: FA4 for MLA prefill (#34732), DeepSeek-V3.2 MLA kernel optimizations (#34917, #35290).
AMD ROCm: Sparse MLA CUDA graphs (#35719), MTP lens > 1 in Sparse MLA (#36681), MLA with nhead<16 + FP8 KV for TP=8 (#35850), RoPE+KV cache fusion for AITER FA (#35786), AITER MLA CPU sync avoidance (#35765), Quark W4A8 MXFP4/FP8 (#35316), gfx1152/gfx1153 Krackan support (#36499), fused_topk_bias AITER optimization (#36253), skinny GEMM improvements (#34304), DeepEP in ROCm Dockerfile (#36086), startup OOM fix (#36720).
Intel XPU: Model Runner V2 enabled (#36078), MLA Sparse backend for DeepSeek V3.2 (#33230), LoRA via torch.compile (#36962), block FP8 MoE fallback (#36458), deepseek_scaling_rope fused kernel (#36612).
CPU: aarch64 int8 matmul via OneDNN upgrade (#36147), AMD Zen CPU backend via zentorch (#35970).
RISC-V: CPU backend support (#36578).
Performance: 5% E2E improvement for PD disaggregation scheduling (#35781), packed recurrent decode fast path (#36596), pooling model maxsim 2.7%+3.2% throughput (#36159, #36710).
torch.compile: FakeTensors instead of real GPU tensors for single-size compilation (#36093), non-contiguous fused RMSNorm + group quant (#36551), stop lazy compiling (#35472).

Large Scale Serving

Elastic EP Milestone 2: NIXL-EP integration (#35627), --enable-ep-weight-filter for faster EP loading (#37351).
PD Disaggregation: ~5% scheduler overhead reduction (#35781), KV transfer fix with spec decode (#35158), P/D for hybrid SSM-FA models via NIXL (#36687), PP for multimodal models on Transformers backend (#37057).
KV Connectors: HMA + NIXL connector (#35758), FlexKV offloading (#34328), worker→scheduler metadata (#31964), All-to-All DCP backend (#34883).
LMCache: Fault tolerance mechanism (#36586), memory leak fix (#35931), race condition fix (#35831), TP size for MLA multi-reader locking (#36129).
EP loading: Skip non-local expert weights (#37136).

Quantization

ModelOpt MXFP8 MoE support (#35986).
MXFP4 MoE routing simulation override for accuracy (#33595).
FP8 LoRA dense kernel (#35242).
ROCm: Quark W4A8 MXFP4/FP8 for LinearLayer (#35316), compressed-tensors fix for DeepSeek-R1 on MI300x (#36247).
Fixes: MLA crash with AWQ/GPTQ quantized models (#34695), score layer quantization for reranker models (#35849), GLM-4.1V non-default quantization (#36321), FP8 k_scale/v_scale loading for Qwen3-MoE (#35656).

API & Frontend

gRPC: New --grpc flag for gRPC serving (#36169).
GPU-less serving: vllm launch render for preprocessing-only serving (#36166), vllm launch for GPU-less preprocessing (#34551).
Responses API: Streaming tool/function calling (#29947), reasoning item fixes (#34499, #36516).
Anthropic API: Accept redacted thinking blocks (#36992).
ASR: Online beam search transcriptions (#36160), offline beam search (#36153), audio transcription for MP4/M4A/WebM (#35109), realtime endpoint metrics (#35500).
Tool calling: Granite4 tool parser (#36827), Qwen3Coder anyOf double encoding fix (#36032).
New options: --distributed-timeout-seconds (#36047), --attention-backend auto (#35738), reasoning_effort=none (#36238), PyTorch profiler schedule (#35240).
Cohere Embed v2 API support (#37074).
Azure Blob Storage support for RunAI Model Streamer (#34614).
Graceful shutdown timeout for in-flight requests (#36666).
Fixes: tool_choice=required exceeding max_tokens crash (#36841), negative max_tokens with long prompts (#36789), concurrent classify/token_classify race (#36614), Anthropic billing header prefix cache miss (#36829), render endpoint crash for multimodal requests (#35684), xgrammar dtype mismatch on macOS CPU (#32384), minimax_m2 tool parser with stream interval > 1 (#35895).

Security

Respect user trust_remote_code setting in NemotronVL and KimiK25 (#36192).
Upgrade xgrammar for security fix (#36168).
Guard RLHF weight sync deserialization behind insecure serialization flag (#35928).

Dependencies

FlashInfer 0.6.6 (#36768).
Ray removed from default dependencies (#36170).
kaldi_native_fbank made optional (#35996).
OpenAI dependency bounded to 2.24.0 (#36471).
Deprecated items from v0.18 removed (#36470, #36006).
Mistral common v10 (#36971).

Breaking Changes

Ray no longer a default dependency — install explicitly if needed (#36170).
Deprecated items removed — items deprecated in v0.18 have been removed (#36470, #36006).
Cascade attention disabled by default (#36318).
swap_space parameter removed (V0 deprecation, #36216).
Monolithic TRTLLM MoE disabled for renormalize routing — late fix cherry-picked (#37591).

New Contributors 🎉

@11happy made their first contribution in #35481
@12010486 made their first contribution in #36782
@abhishkh made their first contribution in #32454
@AjAnubolu made their first contribution in #35976
@alvinttang made their first contribution in #36397
@amd-asalykov made their first contribution in #35093
@amd-lalithnc made their first contribution in #35970
@arlo-scitix made their first contribution in #36139
@benenzhu made their first contribution in #36253
@ChuanLi1101 made their first contribution in #35893
@cluster2600 made their first contribution in #34882
@cong-or made their first contribution in #36164
@daje0601 made their first contribution in #29856
@davzaman made their first contribution in #32441
@eellison made their first contribution in #35877
@fangyuchu made their first contribution in #35194
@feiqiangs made their first contribution in #34328
@fenypatel99 made their first contribution in #35240
@gambletan made their first contribution in #36402
@giulio-leone made their first contribution in #36937
@gkswns0531 made their first contribution in #35849
@grimulkan made their first contribution in #34597
@hai-meh-cs made their first contribution in #36684
@hasethuraman made their first contribution in #34614
@Hongbin10 made their first contribution in #36713
@jeonsworld made their first contribution in #34499
@jjmiao1 made their first contribution in #35994
@Kaonael made their first contribution in #36818
@ketyi made their first contribution in #36670
@KevinZonda made their first contribution in #36209
@leo-cf-tian made their first contribution in #36022
@lisperz made their first contribution in #34531
@mitre88 made their first contribution in #35933
@nkm-meta made their first contribution in #34760
@nvnbagrov made their first contribution in #35657
@rahul-sarvam made their first contribution in #33942
@royyhuang made their first contribution in #35931
@sbeurnier made their first contribution in #37006
@seanmamasde made their first contribution in #35109
@sergey-zinchenko made their first contribution in #35684
@shaunkotek made their first contribution in #36149
@shubhra made their first contribution in #36545
@simone-dotolo made their first contribution in #36000
@sladyn98 made their first contribution in #33503
@slin1237 made their first contribution in #36938
@SoluMilken made their first contribution in #36511
@Srinivasoo7 made their first contribution in #35342
@stecasta made their first contribution in #35871
@sungsooha made their first contribution in #34883
@SunMarc made their first contribution in #36896
@TQCB made their first contribution in #36165
@tunglinwood made their first contribution in #36127
@tusharshetty61 made their first contribution in #36243
@typer-J made their first contribution in #36578
@weiguangli-io made their first contribution in #35815
@wuxun-zhang made their first contribution in #33230
@XingLiu1 made their first contribution in #35197
@yanhong-lbh made their first contribution in #32550
@yitingw1 made their first contribution in #36612
@yuanheng-zhao made their first contribution in #36106
@zihaoanllm made their first contribution in #35973

Configuration

📅 Schedule: Branch creation - Only on Sunday ( * * * * 0 ) in timezone America/Los_Angeles, Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻️ Rebasing: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this MR and you won't be reminded about these updates again.

If you want to rebase/retry this MR, check this box

This MR has been generated by Renovate Bot.

Update dependency vllm to v0.18.0