The vulnerability lies in the lxml-html-clean library's default HTML cleaning process, which fails to remove the <base> HTML tag. The analysis of the patch in commit 9c5612ca33b941eec4178abf8a5294b103403f34 pinpoints the exact location of the fix.
The file lxml_html_clean/clean.py contains the Cleaner class, which is responsible for the sanitization logic. The __call__ method of this class iterates through the HTML document and removes unwanted tags.
Before the patch, the __call__ method had no specific logic to handle the <base> tag. The default settings are to remove page structure tags like <head>, but <base> was not included in this set. An attacker could inject a <base href="http://evil.com/"> tag, and it would be preserved in the cleaned output. This would cause all relative links, scripts, and stylesheets on the page to be loaded from the attacker's domain, leading to phishing, XSS, or defacement.
The patch introduces a check within the Cleaner.__call__ method. It adds the <base> tag to the kill_tags set whenever the <head> tag is being removed. This ensures that if the page structure is being cleaned (which is the default behavior), any malicious <base> tags are also eliminated.
Therefore, the vulnerable function is Cleaner.__call__, as it is the method that contains the flawed sanitization logic that allows the <base> tag to pass through. The user-facing function clean_html is the entry point that uses this vulnerable method with its default, insecure configuration.