The vulnerability is a classic XML External Entity (XXE) injection within the HTMLSectionSplitter class of the langchain-text-splitters library. The vulnerability existed due to a combination of two factors:
-
User-Controlled Input: The HTMLSectionSplitter.__init__ method accepted an xslt_path parameter, allowing an attacker to specify the location of an XSLT stylesheet. This served as the injection vector.
-
Unsafe Parsing: The HTMLSectionSplitter.convert_possible_tags_to_header method used lxml.etree.parse() to process the stylesheet from the provided path. This parsing was done without any security hardening, meaning external entities within a malicious XSLT file would be resolved. This could be exploited to read sensitive files from the local system or initiate server-side requests (SSRF).
An attacker could exploit this by instantiating HTMLSectionSplitter with a path to a crafted malicious XSLT file and then calling a method like split_text or split_documents, which in turn calls the vulnerable convert_possible_tags_to_header function.
The patch effectively mitigates this vulnerability through two main changes:
-
Removing the Attack Vector: The xslt_path parameter was removed from the constructor, forcing the class to use a hardcoded, trusted default XSLT file. This eliminates the ability for an attacker to supply a malicious file.
-
Defense-in-Depth: As a secondary protection, the XML and XSLT parsers were hardened by explicitly disabling network access, entity resolution, and DTD loading, and by applying a strict access control policy. This ensures that even if an attacker found another way to control the XSLT content, the parser would not process dangerous entities or access external resources.
Therefore, the functions HTMLSectionSplitter.__init__ and HTMLSectionSplitter.convert_possible_tags_to_header are the key indicators of this vulnerability, as one provides the entry point and the other performs the unsafe operation.