CVE-2026-53923

Description

vLLM is an inference and serving engine for large language models (LLMs). From 0.5.5 until 0.23.1rc0, integer truncation of tensor dimensions in vLLM's GGUF dequantize kernels (csrc/quantization/gguf/gguf_kernel.cu) causes partial tensor processing. The output tensor is allocated at full size via torch::empty (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users' inference requests, constituting information disclosure. This vulnerability is fixed in 0.23.1rc0.

CVSS breakdown

CVSS 4.0

Attack Vector

Network

Attack Complexity

Low

Attack Requirements

None

Privileges Required

None

User Interaction

Passive

Confidentiality (Vulnerable System)

Low

Integrity (Vulnerable System)

Low

Availability (Vulnerable System)

None

Confidentiality (Subsequent System)

None

Integrity (Subsequent System)

None

Availability (Subsequent System)

None

Affected products

vllm-project / vllm>= 0.5.5, < 0.23.1rc0 – >= 0.5.5, < 0.23.1rc0

Description

CVSS breakdown

Affected products

References