The vulnerability is a heap-based buffer overflow in the hashmap implementation used by Nokogiri's Gumbo HTML5 parser. The root cause is a type confusion in how strings were stored in the hashmap. The hashmap was configured to store items of a fixed size (e.g., the size of a pointer), but the functions in gumbo-parser/src/string_set.c were passing variable-length strings directly as if they were the items themselves.
When a function like gumbo_string_set_insert was called with a string longer than the hashmap's item size, the underlying hashmap_set (and its internal counterpart hashmap_set_with_hash) would attempt to copy the oversized string into a smaller, fixed-size buffer on the heap, causing an overflow.
The patch addresses this by changing the logic to store pointers to strings (char**) in the hashmap instead of the string data itself. This is reflected in the changes to gumbo_string_set_insert and gumbo_string_set_contains, which now pass the address of the string pointer (&str). Consequently, the helper functions string_compare and string_hash were updated to dereference the pointers before using the string values.
Therefore, the vulnerable functions are the ones in string_set.c that incorrectly interacted with the hashmap API, leading to the memory corruption.