Create an Unstructured Data Connection from the Web Content Sitemap
Create an unstructured data lake object in Data Cloud to ingest your organization’s marketing, ecommerce, documentation, or other website content into Data Cloud.
See the Unstructured Data Reference for a list of supported file formats.
User Permissions Needed | |
---|---|
To connect unstructured data: | Data Cloud Architect |
Before you begin:
- Make sure you've set up a Web Content (Sitemap) connection.
-
From App Launcher, select Data Cloud.
-
Click Data Lake Objects and then click New.
-
To choose a method for creating your data lake object, select From External Files, and click Next.
-
Choose the Web Content (Sitemap) connector, and click Next.
-
To specify what content is ingested into Data Cloud, from the Select Connection dropdown list, select the Web Content (Sitemap) connection you previously created. Data Cloud auto-populates the source based on the connection that you select.
-
Configure what content is ingested from the website you connect to Data Cloud.
This connector currently supports only HTML and PDF formats. Data Cloud skips other file types and doesn't ingest them.
- Included Web Pages: Enter the regular expression you prepared to match the pages you want to ingest. For example, if you enter ./content/., all pages under the starting URL that include /content/ in their URL path are included. Leave this field empty to include all pages under the starting URL.
Example
Instead of manually listing every URL you want to include in the ingestion, you can use a regular expression (regex) that includes URLs with specific text or structure patterns.
-
For URLs containing the text “support-content”, use the regular expression
.*support-content.*
. This includes any URLs containing the specified text, regardless of what comes before or after it. -
For URLs containing the path “/support/content/”, use the regular expression
.*\/support\/content\/.*
. This includes any URLs containing the specified path, regardless of what comes before or after it. -
Included Page Element: Enter the jQuery selector you prepared to specify which HTML element to include. Add the element without any surrounding parentheses or single quotes. The first matching HTML element and its child elements are included. You can only enter one jQuery selector. By default, the body page element, representing the content of the HTML page, is included.
To extract a jQuery selector, use your browser’s dev tools to inspect a page, find the HTML element you want to ingest, right-click it, and then from the menu, click Copy > Copy selector.
-
Click Next. The connector ingests the data every 24 hours. To view the ingestion status, go to your Data Stream status.
-
To set up your unstructured lake object and its associated data model object, add an Object Name and an Object API Name for the UDLO. See Data Lake Object Naming Standards.
-
Map the UDLO to a UDMO.
- To create a new UDMO, click New. Then select from the Data Space dropdown list a data space in which to create it. Add an Object Name and an Object API Name for the UDLO. See Data Lake Object Naming Standards.
- To use an existing UDMO, click Existing, and select a data space and a UDMO from the list from which to select the existing UDMO.
-
Optionally, leave the checkbox selected to create a search index configuration for the UDMO using system defaults that automatically selects text fields and a chunking strategy for each field. You can deselect the checkbox and create a search index configuration later if you choose not to do so now.
-
Click Next, or if you created a search index configuration, review the details, and save your work.
The data stream now ingests your website data into an unstructured data lake object, and maps it to an unstructured data model object (UDMO). From this UDMO, a search index is created which can now be used to ground AI-generated responses.