The security vulnerability is a memory exhaustion issue (CWE-409: Improper Handling of Highly Compressed Data) in the pypdf library, specifically within its LZW decompression functionality. An attacker can craft a PDF containing a compressed data stream that, upon decompression, expands to an extremely large size, consuming all available memory and causing a denial of service.
The analysis of the provided patch commit e51d07807ffcdaf18077b9486dadb3dc05b368da reveals the exact location of the vulnerability and its fix.
-
pypdf/_codecs/_codecs.py: The core of the vulnerability lies in the LzwCodec.decode method. The patch introduces a mechanism to track the length of the output stream (output_length) and compares it against a max_output_length. If the limit is exceeded, a LimitReachedError is raised. Before this change, there was no such limit, allowing for uncontrolled memory allocation.
-
pypdf/filters.py: The LZWFlateDecode.decode method is the entry point for applying the LZW filter. It was modified to instantiate LzwCodec with a newly defined LZW_MAX_OUTPUT_LENGTH. This change ensures that the safety limit is applied whenever the LZW filter is used.
Therefore, during the exploitation of this vulnerability, a runtime profiler would show significant time and memory being spent in the LzwCodec.decode function, which is called by LZWFlateDecode.decode. These two functions are the primary indicators of the vulnerable code path being executed.