The vulnerability exists in the deserialization of custom Arrow extension types within Ray's data processing library. The root cause is the introduction of the _deserialize_with_fallback function in python/ray/air/util/tensor_extensions/arrow.py as part of PR #54831. This function attempts to deserialize data using cloudpickle.loads() before falling back to the safer json.loads().
When a Ray Data process reads a Parquet file, PyArrow identifies the custom Ray tensor extension types (ray.data.arrow_tensor, ray.data.arrow_tensor_v2, etc.) in the file's schema. For these types, PyArrow invokes their respective __arrow_ext_deserialize__ methods to process the extension metadata.
The patch analysis of commit 6654b6f5d6547d259995445d1611b199596a430b shows that the __arrow_ext_deserialize__ methods for ArrowTensorTypeV1, ArrowTensorTypeV2, and ArrowVariableShapedTensorType were modified to use the new _deserialize_with_fallback function. This function takes serialized metadata directly from the Parquet file and passes it to cloudpickle.loads(). Since cloudpickle can execute arbitrary code when deserializing a crafted payload, an attacker can create a malicious Parquet file that, when read by a Ray Data application, triggers remote code execution. The execution happens as soon as the file's schema is parsed, before any data rows are even accessed.