Description
The RecursiveUrlLoader class in @langchain/community is a web crawler that recursively follows links from a starting URL. Its preventOutside option (enabled by default) is intended to restrict crawling to the same site as the base URL.
The implementation used String.startsWith() to compare URLs, which does not perform semantic URL validation. An attacker who controls content on a crawled page could include links to domains that share a string prefix with the target (e.g., https://example.com.attacker.com passes a startsWith check against https://example.com), causing the crawler to follow links to attacker-controlled or internal infrastructure.
Additionally, the crawler performed no validation against private or reserved IP addresses. A crawled page could include links targeting cloud metadata services (169.254.169.254), localhost, or RFC 1918 addresses, and the crawler would fetch them without restriction.
Impact
An attacker who can influence the content of a page being crawled (e.g., by placing a link on a public-facing page, forum, or user-generated content) could cause the crawler to:
- Fetch cloud instance metadata (AWS, GCP, Azure), potentially exposing IAM credentials and session tokens
- Access internal services on private networks (
10.x, 172.16.x, 192.168.x)
- Connect to localhost services
- Exfiltrate response data via attacker-controlled redirect chains
This is exploitable in any environment where RecursiveUrlLoader runs on infrastructure with access to cloud metadata or internal services — which includes most cloud-hosted deployments.
Resolution
Two changes were made:
-
Origin comparison replaced. The startsWith check was replaced with a strict origin comparison using the URL API (). This correctly validates scheme, hostname, and port as a unit, preventing subdomain-based bypasses.