The vulnerability lies in the propose method of the ExtractHiddenStatesProposer class in vLLM's speculative decoding implementation. A refactoring in version 0.18.0 removed a call to .unsqueeze(-1) on the returned tensor, sampled_token_ids. This change was incorrect because, after the first decode step, the rejection sampler can produce a tensor with a shape of (batch_size, 2). The downstream code, specifically the part that applies penalty parameters, expects a tensor of shape (batch_size, 1). This shape mismatch triggers a RuntimeError, crashing the server. The provided patch bd1c3a9c34ce623edca021623c720fff1b8cf588 confirms this analysis by explicitly slicing the sampled_token_ids tensor to [:, :1], ensuring it always has the correct shape before being returned. Therefore, the ExtractHiddenStatesProposer.propose function is the direct source of the vulnerability.