The vulnerability exists in the vllm.multimodal.hasher.MultiModalHasher class, specifically within its serialization logic used for hashing multimodal inputs. The root cause is the serialize_item method's insufficient serialization of PIL.Image.Image objects and numpy.ndarray objects.
For images, serialize_item previously called obj.tobytes(), which only extracts raw pixel data. It critically omitted metadata such as image dimensions (width, height), color mode, and other format-specific information (like palette data stored in Image.info). This meant that two visually distinct images (e.g., a 30x100 image and a 100x30 image, or two palette images with different palettes but same pixel indices) could produce the same byte stream if their raw pixel data happened to be identical in sequence.
Similarly, for numpy.ndarray objects (which could also originate from torch.Tensor or scalar inputs), serialize_item used obj.tobytes(), which serializes the array's data buffer but doesn't inherently include the array's shape or data type in a distinguishable way for the hasher. This could lead to arrays with the same total number of elements but different dimensions (e.g., a 2x6 array vs. a 3x4 array) producing the same byte stream if their flattened data was identical.
The hash_kwargs method uses serialize_item (indirectly in the vulnerable version through item_to_bytes) to convert all inputs into byte sequences before feeding them to a blake3 hashing function. Because serialize_item produced non-unique byte sequences for the distinct inputs described above, hash_kwargs would generate identical hash values for them. This hash collision is the primary impact, leading to incorrect cache hits (e.g., the system believing two different images are the same) and potential data leakage or incorrect application behavior.
The fix involved augmenting serialize_item to explicitly include all critical metadata (image dimensions, mode, RGBA conversion for consistency, numpy array shape, and dtype) as part of the byte stream fed to the hasher. This ensures that distinct inputs produce distinct byte sequences, thus preventing hash collisions.