The vulnerability is a classic XML External Entity (XXE) Expansion, also known as a 'billion laughs' attack. It existed across multiple data readers within the LlamaIndex ecosystem that parse XML from external sources, such as API responses or sitemap files. The root cause was the use of Python's standard xml.etree.ElementTree.fromstring function, which does not protect against recursive entity expansion in DTDs. An attacker could craft a malicious XML document with nested entities that, when parsed, would cause the application to consume an excessive amount of memory, leading to a Denial of Service (DoS).
The provided patch addresses this issue by replacing all instances of xml.etree.ElementTree.fromstring with defusedxml.ElementTree.fromstring (or defusedxml.fromstring). The defusedxml library is specifically designed to be secure against such attacks. The vulnerable functions were identified by locating every place in the codebase where the unsafe XML parsing function was called on data that could originate from an external, untrusted source. The affected readers include those for Pubmed, Stripe Docs, and generic web sitemaps.