Description
vLLM is an inference and serving engine for large language models (LLMs). From 0.18.0 to before 0.20.0, the extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty). A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. This vulnerability is fixed in 0.20.0.
CVSS breakdown
CVSS 3.1
Attack Vector
Network
Attack Complexity
Low
Privileges Required
Low
User Interaction
None
Scope
Unchanged
Confidentiality
None
Integrity
None
Availability
High
Affected products
- vllm-project / vllm>= 0.18.0, < 0.20.0 – >= 0.18.0, < 0.20.0