The analysis focused on the patches introduced in versions 2.91.0 and 2.94.0 of the docling library, which were referenced in the security advisory. By comparing the code before and after the patches, I identified two key functions in docling/backend/html_backend.py that were the source of the vulnerabilities.
-
HTMLDocumentBackend._render_with_browser: The commit 9813190ab4126c1ff2fde1e3e72322821530390b shows that this function originally lacked controls over the browser context it created. The patch added crucial security measures: disabling JavaScript and implementing a request routing system to block unauthorized resource loading. The absence of these controls in the vulnerable versions made this function a primary entry point for exploitation.
-
HTMLDocumentBackend._load_image_data: The commit cd0cb695303d8ce1b3c9fe620b182b0e22d8c53f reveals multiple vulnerabilities within this single function. The original code performed HTTP requests (requests.get), base64 decoding (base64.b64decode), and file access (os.path.isfile) without proper validation. This allowed for Server-Side Request Forgery (SSRF), Uncontrolled Resource Consumption (DoS), and Path Traversal. The patch systematically adds validation and limits: IP address validation (_validate_url_safety), size checks for remote and data URI images, and stricter timeout handling.
The identified functions are directly responsible for processing external HTML and its resources (like images), which is where the vulnerabilities lie. The patch evidence clearly demonstrates the introduction of security controls that were previously missing, confirming these functions as the vulnerable ones.