The vulnerability allows an attacker to cause a denial of service (DoS) via excessive memory and CPU consumption by crafting a malicious PDF file. The root cause is the lack of input validation in the parsing of the /ToUnicode CMap, which is used for mapping character codes to Unicode values during text extraction.
The analysis of the patch commit 77d7b8d7cfbe8dd179858dfa42666f73fc6e57a2 reveals that the core of the vulnerability lies in two functions within the pypdf._cmap module: parse_bfrange and parse_bfchar.
-
pypdf._cmap.parse_bfrange: This function handles bfrange operators. A malicious PDF can define a range with a very large difference between its start and end values. The original code would loop through this entire range, creating dictionary entries and consuming significant CPU and memory. The patch mitigates this by adding a check (_check_mapping_size) to ensure the size of the range does not exceed a predefined limit before entering the loop.
-
pypdf._cmap.parse_bfchar: This function handles bfchar operators, which define individual character mappings. A malicious PDF can include a vast number of these mappings. The original code would process all of them, leading to a large dictionary being created in memory. The patch adds a similar size check at the beginning of the function to limit the number of mappings processed.
These functions are the direct points of exploitation where the malicious data from the PDF is processed without proper limits, leading to the DoS condition. Any operation that involves text extraction from a PDF, such as calling the PageObject.extract_text() method, could trigger these vulnerable functions.