Create an Unstructured Data Connection from the Web Content Crawler

Create an unstructured data lake object in Data 360 to ingest your organization’s marketing, ecommerce, documentation, or other website content into Data 360.

See the Unstructured Data File Formats and Connectors for a list of supported file formats.

User Permissions Needed
To connect unstructured data:Data Cloud Architect

Before you begin:

  • Make sure you’ve set up a Web Content (Crawler) connection.
  1. From App Launcher, select Data Cloud.

  2. Click Data Lake Objects and then click New.

  3. To choose a method for creating your data lake object, select From External Files, and click Next.

  4. Choose the Web Content (Crawler) connector, and click Next.

  5. To specify what content is ingested into Data 360, from the Select Connection dropdown list, select the Web Content (Crawler) connection you previously created. Data 360 populates the source based on the connection that you select.

  6. Configure what content is ingested from the website you connect to Data 360.

    This connector supports HTML, PDF, and image formats: .jpg, .jpeg, .png. Data 360 skips other file types and doesn’t ingest them.

    • Crawl Depth: Enter the number of levels (1–8) you want to include in the ingestion, from the starting URL you provided when you set up the connection. For example, if you enter your website’s home page URL and set the crawl depth as 1, the home page and its immediate links are ingested. If the crawl depth is 2, then the home page, the pages it links to, and the pages those pages link to are ingested. Links outside of the website’s domain are not ingested.

    • Included Page Element: Enter the jQuery selector you prepared to specify which HTML element to include. Add the element without any surrounding parentheses or single quotes. The first matching HTML element and its child elements are included. You can only enter one jQuery selector. By default, the body page element, representing the content of the HTML page, is included.

      To extract a jQuery selector, use your browser’s dev tools to inspect a page, find the HTML element you want to ingest, right-click it, and then from the menu, click Copy > Copy selector.

    • Select Remove URL Parameters to remove the query parameter when performing a “clean link” (cleans the URL after ”?”).

    • Included Web Pages: Enter the regular expression you prepared to match the pages you want to ingest. For example, if you enter .*/content/.*, all pages under the starting URL that include /content/ in their URL path are included. Leave this field empty to include all pages under the starting URL. For additional examples of Regular Expressions, see Regular Expression Examples.

  7. In the Page Elements section, select the HTML elements to include in the ingestion, in addition to the <body> element. You can select none, one, or multiple elements. When the connector runs, it ingests the content located inside the selected HTML elements. If you don’t select any elements the <body> element is ingested by default.

  8. Click Next. The connector ingests the data every 24 hours. To view the ingestion status, go to your Data Stream status.

  9. To set up your unstructured lake object and its associated data model object, add an Object Name and an Object API Name for the UDLO. See Data Lake Object Naming Standards.

  10. Map the UDLO to a UDMO.

    • To create a new UDMO, click New. Then select from the Data Space dropdown list a data space in which to create it. Add an Object Name and an Object API Name for the UDMO. See Data Lake Object Naming Standards.
    • To use an existing UDMO, click Existing, and select a data space and a UDMO from the list from which to select the existing UDMO.
  11. Optionally, leave the checkbox selected to create a search index configuration for the UDMO using system defaults that automatically selects text fields and a chunking strategy for each field. You can deselect the checkbox and create a search index configuration later if you choose not to do so now.

  12. Select the checkbox to enable Content Harmonization for the UDLO. Leave it deselected to enable content harmonization later.

    If you enable content harmonization now, AI Enrichments (Summary and Q&A) are disabled by default. To enable AI Enrichments, create a Harmonization Configuration and enable Einstein.

    When you enable content harmonization, you enable collection of Content Viewer engagement data.

  13. Click Next, or if you created a search index configuration, review the details, and save your work.

The data stream now ingests your website data into an unstructured data lake object, and maps it to an unstructured data model object (UDMO). From this UDMO, a search index is created which can now be used to ground AI-generated responses. Use the Monitoring Guidelines for tips on how to track your connector’s ingestion progress.

Instead of manually listing every URL you want to include in the ingestion, you can use a regular expression (regex) that includes URLs with specific text or structure patterns.

  • For URLs containing the text “support-content”, use the regular expression .*support-content.*. This includes any URLs containing the specified text, regardless of what comes before or after it.
  • For URLs containing the path “/support/content/”, use the regular expression .*\/support\/content\/.*. This includes any URLs containing the specified path, regardless of what comes before or after it.

After you create the UDLO, you have the option to edit the settings. Each connector has different settings available.

  1. From App Launcher, select Data Cloud.
  2. Select the Data Stream tab.
  3. Find the row for the Web Content (Crawler) connection data stream.
  4. At the end of the row, expand the menu and click Edit.
  5. Change any of the available settings. See the list below.
  6. Save your changes.

For this connector, you can edit the following settings.

  • Data Stream Name: Choose a different name (see step 10 above).
  • Refresh Mode: This connector only uses Upsert. There is no need to change it.
  • Crawl Depth: Choose a different crawl depth (see step 6 above).
  • Included Page Element: Enter a different jQuery Selector (see step 6 above).
  • Included Web Pages: Enter a different regular expression (see step 6 above).
  • Remove URL Parameters: Select to remove the query parameter when performing a clean link (see step 6 above).
  • Included Page Elements: Select different HTML elements to include (see step 7 above).
  • Schedule: Set the time interval for refresh.
  • Refresh Immediately: Start refresh now.