Web Content (Crawler) Connector Troubleshooting Guide
Tips for troubleshooting issues that may occur when you ingest a website content into Data Cloud using the Web Content (Sitemap) connector.
When you create a new connection, Data Cloud performs a validation test.
The test connection fails.
Check if the website you're ingesting blocks the connector. Some websites may block the connector in two possible ways.
- User-Agent Blocking: By default, the connector uses the Mozilla/5.0 user agent. Some sites may reject it. If your connection test fails, try entering a different user agent in the User Agent field.
- IP Blocking: Sites using services like Cloudflare may block unauthorized IPs. Confirm with your IT or security teams that the website doesn’t block an IP address used by Data Cloud listed in IP addresses used by Data Cloud services.
When you ingest data using the Web Content (Crawler) connector, Data Cloud syncs it.
The sync fails.
Review the troubleshooting guide for common errors and solutions.
When you ingest content from a website using the Web Content (Crawler) connector, your records can be viewed on the Data Stream tab.
No records were ingested.
-
Check the format of your ingested files. The Web Content (Crawler) connector currently supports only HTML and PDF files. Data Cloud skips and doesn’t index other file formats.
-
Dynamic Content and JavaScript Rendering: The Web Content (Crawler) connector doesn’t support content that is dynamically rendered via JavaScript at runtime. To validate whether this is the issue, inspect the website page source.
- Right-click on the page.
- From the context menu, click View Page Source (in Mac, click command+option+u).
- Check if the expected content is present. If the content is only visible after the page loads in a browser and not in the raw HTML source, it’s likely that it's JavaScript-rendered, and can’t be ingested.
- Right-click on the page.
- From the context menu, click Inspect.
- Click the Settings icon button on the top right.
- Scroll down to Debugger and check Disable JavaScript.
- Refresh the page to view what the connector can ingest.