The vulnerability exists in the NLTK downloader component, specifically in how it handles package metadata from remote XML index files. The root cause is a lack of input validation on the id and subdir attributes of a package defined in the XML index. An attacker can host a malicious XML file with path traversal sequences (e.g., ../) in these attributes.
When a user attempts to download a package using nltk.downloader.Downloader.download(), the following occurs:
- The
Package.fromxml method is called to parse the malicious XML, which in turn calls the Package.__init__ constructor.
- Inside
Package.__init__, the unvalidated subdir and id attributes are used with os.path.join to create a filename attribute on the Package object. This filename now contains a path that traverses outside the intended download directory.
- The
Downloader.download method eventually calls Downloader._download_package.
- In
_download_package, the malicious info.subdir is used with os.makedirs to create directories at an arbitrary location. Subsequently, the malicious info.filename is joined with the base download directory to create a final filepath. This filepath is then used with open() to write a file, resulting in an arbitrary file write vulnerability.
The identified functions are the key components in this attack chain, from the initial parsing of malicious data to the final vulnerable file system operations.