Create an Unstructured Data Connection from the Web Content Sitemap

Create an unstructured data lake object in Data 360 to ingest your organization’s marketing, ecommerce, documentation, or other website content into Data 360.

See the Unstructured Data File Formats and Connectors for a list of supported file formats.

User Permissions Needed
To connect unstructured data:Data Cloud Architect

Before you begin:

  • Make sure you’ve set up a Web Content (Sitemap) connection.
  1. From App Launcher, select Data Cloud.

  2. Click Data Lake Objects and then click New.

  3. To choose a method for creating your data lake object, select From External Files, and click Next.

  4. Choose the Web Content (Sitemap) connector, and click Next.

  5. To specify what content Data 360 ingests, from the Select Connection dropdown list, select the Web Content (Sitemap) connection you previously created. Data 360 populates the source based on the connection that you select.

  6. Configure what content Data 360 ingests from the website you connect.

    This connector currently supports only HTML and PDF formats and also supports image formats: .jpg, .jpeg, .png. Data 360 skips other file types and doesn’t ingest them.

    • Included Web Pages: Enter the regular expression you prepared to match the pages you want to ingest. For example, if you enter .*/content/.*, all pages under the starting URL that include /content/ in their URL path are included. Leave this field empty to include all pages under the starting URL. For additional examples, see Regular Expression Examples.

    • Included Page Element: Enter the jQuery selector you prepared to specify which HTML element to include. Add the element without any surrounding parentheses or single quotes. The first matching HTML element and its child elements are included. You can only enter one jQuery selector. By default, the body page element, representing the content of the HTML page, is included.

      To extract a jQuery selector, use your browser’s dev tools to inspect a page, find the HTML element you want to ingest, right-click it, and then from the menu, click Copy > Copy selector.

    • Select Ingest data from external domains if you want to include content that is linked from the site you are crawling. This can be useful if your documentation site has a link to your corporate website. In this case, your corporate content would be ingested as well.

      It is your obligation to ensure that you have the rights to the data collected using this feature. Salesforce disclaims all liability with respect to such data collected.

    • Included Website URL Paths: When the connector runs, the ingestion process confirms that the links in the sitemap match the sitemap domain. There may be cases where your domain has differences you want to preserve.
      For example, you may have URLs in your sitemap that have related sub-URLs (such as language versions, pagination, or filtered pages). To allow for these differences, select Primary and sub-paths to include all associated sub-URLs. If you want to not include the additional URLs, select Primary path only to ingest only the main URL.

    • Select Remove URL Parameters to remove the query parameter when performing a clean link. This action removes the URL text that follows the ?.

  7. If you want to use only files that were added or changed from a specific date:

    • In the Content Filtering Details section, use the calendar picker to select a date. You can’t select a range, only a start date. Data 360 ingests all files added or changed from this date forward.
  8. In the Page Elements section, select the HTML elements to include in the ingestion, in addition to the <body> element. You can select none, one, or multiple elements. When the connector runs, it ingests the content located inside the selected HTML elements. If you don’t select any elements the <body> element is ingested by default.

  9. Click Next. The connector ingests the data every 24 hours. To view the ingestion status, go to your Data Stream status.

  10. To set up your unstructured lake object and its associated data model object, add an Object Name and an Object API Name for the UDLO. See Data Lake Object Naming Standards.

  11. Map the UDLO to a UDMO.

    • To create a new UDMO, click New. Then select from the Data Space dropdown list a data space in which to create it. Add an Object Name and an Object API Name for the UDMO. See Data Lake Object Naming Standards.
    • To use an existing UDMO, click Existing, and select a data space and a UDMO from the list from which to select the existing UDMO.
  12. Optionally, leave the checkbox selected to create a search index configuration for the UDMO using system defaults that automatically selects text fields and a chunking strategy for each field. You can deselect the checkbox and create a search index configuration later if you choose not to do so now.

  13. Select the checkbox to enable Content Harmonization for the UDLO. Leave it deselected to enable content harmonization later.

    If you enable content harmonization now, AI Enrichments (Summary and Q&A) are disabled by default. To enable AI Enrichments, create a Harmonization Configuration and enable Einstein.

    When you enable content harmonization, you enable collection of Content Viewer engagement data.

  14. Click Next, or if you created a search index configuration, review the details, and save your work.

The data stream now ingests your website data into an unstructured data lake object, and maps it to an unstructured data model object (UDMO). From this UDMO, a search index is created which can now be used to ground AI-generated responses. Use the Monitoring Guidelines for tips on how to track your connector’s ingestion progress.

Instead of manually listing every URL you want to include in the ingestion, you can use a regular expression (regex) that includes URLs with specific text or structure patterns.

  • For URLs containing the text “support-content”, use the regular expression .*support-content.*. This includes any URLs containing the specified text, regardless of what comes before or after it.
  • For URLs containing the path “/support/content/”, use the regular expression .*\/support\/content\/.*. This includes any URLs containing the specified path, regardless of what comes before or after it.

After you create the UDLO, you have the option to edit the settings. Each connector has different settings available.

  1. From the Data Cloud App in the App Launcher, click Data Stream.
  2. Find the row which has the Web Content Sitemap data stream.
  3. At the end of the row, expand the menu and click Edit.
  4. Change any of the available settings. See the list below.
  5. Save your changes.

For this connector, you can edit the following settings.

  • Data Stream Name: Choose a different name (see step 10 above).
  • Refresh Mode: This connector only uses Upsert. There is no need to change it.
  • Included Web Pages: Enter a different regular expression (see step 6 above).
  • Included Page Element: Enter a different jQuery Selector (see step 6 above).
  • Ingest Data from External Domains: Select to include content that is linked from the site you are crawling or remove to not include this content (see step 6 above).
  • Included Website URL Paths: Choose a different option (see step 6 above).
  • Remove URL Parameters: Select to remove the query parameter when performing a clean link (see step 6 above).
  • Page Elements: Select different HTML elements (see step 8 above).
  • Schedule: Set the time interval for refresh.
  • Refresh Immediately: Start refresh now.