| Package Name | Ecosystem | Vulnerable Versions | First Patched Version |
|---|---|---|---|
| llama-index | pip | < 0.12.41 | 0.12.41 |
| llama-index-readers-docugami | pip | < 0.3.1 | 0.3.1 |
The vulnerability exists in the DocugamiReader class within the llama-index library, specifically in how document chunk IDs are generated. The _build_framework_chunk function, nested within the load_data method, was using an MD5 hash of only the chunk's text content to create a unique identifier. This created a vulnerability where two structurally different chunks with identical text would produce the same hash, leading to a collision. When such a collision occurred, one chunk would overwrite the other in storage, resulting in data loss. This could have significant consequences, such as losing important semantic or legal information from documents, breaking the hierarchical relationship between parent and child chunks, and causing AI models relying on this data to produce inaccurate or "hallucinated" responses. The patch addresses this by incorporating the chunk's XPath in addition to its text into the material being hashed. This ensures that even if two chunks have identical text, their different structural locations (represented by their XPaths) will result in unique hashes, thus preventing collisions and the associated data loss. The vulnerable function is _build_framework_chunk because it contains the flawed ID generation logic.
Ongoing coverage of React2Shell