Web Content (Crawler) Connector Limitations

Learn about the functional limitations of the Web Content (Crawler) connector that affect certain behaviors and outcomes.

  • HTML and PDF pages are supported.
  • Supported image formats: .jpg, .jpeg, .png.
  • Video and audio files (.mp3, .mp4, etc.) aren’t supported.
  • The Web Content (Crawler) connector only identifies the ingested content as English, without providing other language metadata.

The Web Content (Crawler) connector doesn’t support content that is dynamically rendered via JavaScript at runtime. Any content rendered in this manner isn't ingested.

The connector’s rate limit is 5 records (pages) per second. If your website doesn't support this rate limit, the connector may fail.

The ingested site needs to allow Apache HttpClient Requests.

Learn about Data Cloud's general limitations which may create issues for this specific connector.