Prepare Your Web Content Connection (Crawler)
To set up a website content connection with Data Cloud, you need to provide some details. Gather this information before you get started.
Prepare the authentication details for the website you want to ingest.
- Website: Identify the website you want to ingest.
- Authentication Method: Determine the website authentication method. If your website is accessible to anyone, you don’t need to provide any credentials. However, if your website uses basic authentication, get a username and password, and ensure that the user associated with the username has the necessary access to the content you want to ingest, as the connector gets the same access permissions as this user.
- User Agent: If your website requires a specific user agent, prepare the required string.
Decide what content to include in the ingestion.
- Starting URL: Decide on a starting URL for your website ingestion, for example, your site’s homepage:
https://acme.com
. - Crawl Depth: Decide on a number of page levels from the starting URL to include in the ingestion. For example, if your starting URL is the website’s home page and the crawl depth is 1, the home page and its immediate links are ingested. If the crawl depth is 2, then the home page, the pages it links to, and the pages those pages link to are ingested. Links outside of the website’s domain are not ingested.
- Included Pages: To ingest specific web pages, prepare a regular expression that matches the pages that you want to ingest. For example, if you enter ./content/., all pages under the starting URL that include /content/ in their URL path are included. If you don’t specify any included pages, all pages under the sitemap URL are included.
Example
Instead of manually listing every URL you want to include in the ingestion, you can use a regular expression (regex) that includes URLs with specific text or structure patterns.
- For URLs containing the text “support-content”, use the regular expression
.*support-content.*
. This includes any URLs containing the specified text, regardless of what comes before or after it. - For URLs containing the path “/support/content/”, use the regular expression
.*\/support\/content\/.*
. This includes any URLs containing the specified path, regardless of what comes before or after it.
- Included Page Elements: To ingest specific page elements, prepare a jQuery selector to specify which HTML element to include. By default, the body page element, representing the content of the HTML page, is included.
To extract a jQuery selector, use your browser’s dev tools to inspect a page, find the HTML element you want to ingest, right-click it, and then from the menu, click Copy > Copy selector.