Prepare Your Web Content Connection (Sitemap)
To set up a website content connection with Data Cloud, you need to provide some details. Gather this information before you get started.
Prepare the authentication details for the website you want to ingest.
- Website: Identify the website you want to ingest.
- Authentication Method: Determine the website authentication method. If your website is accessible to anyone, you don’t need to provide any credentials. However, if your website uses basic authentication, get a username and password, and ensure that the user associated with the username has the necessary access to the content you want to ingest, as the connector gets the same access permissions as this user.
- User Agent: If your website requires a specific user agent, prepare the required string.
Decide what content to include in the ingestion.
- Sitemap URL: Get a valid URL of an XML sitemap ending with /sitemap.xml (for example:
https://help.yourdomain.com/sitemap.xml
). Data Cloud also supports URLs for Gzip-compressed XML sitemaps. - Ingested Pages: To ingest specific web pages, prepare a regular expression that matches the pages that you want to ingest. For example, if you enter ./content/., all pages under the starting URL that include /content/ in their URL path are included. If you don’t specify any included pages, all pages under the sitemap URL are included.
Example
Instead of manually listing every URL you want to include in the ingestion, you can use a regular expression (regex) that includes URLs with specific text or structure patterns.
- For URLs containing the text “support-content”, use the regular expression
.*support-content.*
. This includes any URLs containing the specified text, regardless of what comes before or after it. - For URLs containing the path “/support/content/”, use the regular expression
.*\/support\/content\/.*
. This includes any URLs containing the specified path, regardless of what comes before or after it.
- Included Page Elements: To ingest specific page elements, prepare a jQuery selector to specify which HTML element to include. By default, the body page element, representing the content of the HTML page, is included.
To extract a jQuery selector, use your browser’s dev tools to inspect a page, find the HTML element you want to ingest, right-click it, and then from the menu, click Copy > Copy selector.