Step 1: Strategize Website Ingestion

Plan your website ingestion. An informed strategy at this point helps you avoid issues after Data 360 ingests your content and you build your agent.

Select your marketing, ecommerce, documentation, or any other website you want to ingest into Data 360 to support Agentforce. You can ingest the entirety of your website's content, or you can choose to scope the ingestion to specific pages or areas.

  • You can ingest only public content. Data 360 doesn’t ingest content that's behind a login.
  • Data 360 doesn’t ingest content that's generated dynamically by using JavaScript.

To check whether your site content is generated dynamically using JavaScript:

  1. On your website, open a page you want to ingest.
  2. Right-click anywhere on the page. On the menu, click Inspect to see developer tools.
  3. Click the settings icon, and scroll down to the Debugger section.
  4. Check Disable JavaScript, and refresh the page. If after the refresh the page appears empty, Data 360 can’t ingest the content of this page.

Depending on the content you want to make available for Agentforce, you can ingest your entire website or parts of it. Define your ingestion scope, and list a few sample pages. You can use this page list later to validate that the correct content ingests successfully.

Cirrus is a sustainable energy solution provider that wants to build a service agent to answer questions about its products, using content from its public site: https://cirrus-static-site.vercel.app.

As Cirrus wants the agent to answer only product-related questions, the team scopes the ingestion to specific product-related pages and lists sample pages from the company’s website to include in the ingestion.

  • https://cirrus-static-site.vercel.app/products/solar_panels/standard-solar-panel-250w.html
  • https://cirrus-static-site.vercel.app/products/starter_sets/homestarter-kit.html
  • https://cirrus-static-site.vercel.app/products/small_turbines/tinyturbine-250w.html

Data 360 offers a growing list of connectors and strategic partners to help you ingest content from external sources into Salesforce. See comprehensive list of connectors and integrations. Out of this list, two connectors ingest website content.

For public website ingestion, we recommend the Web Content (Sitemap) connector. Sitemaps draw a defined list of URLs from the website, and can include metadata such as the page’s last updated date. This defined list and metadata make the ingestion process predictable and easy to manage. A crawler, however, doesn't use a closed list of URLs and instead discovers pages by following links based on crawl depth and rules. While the crawling ingestion method is more flexible, it can be less predictable.

If your site has a relevant XML sitemap that includes the pages you want to ingest, select the Web Content (Sitemap) connector. Use the Web Content (Crawler) connector when a sitemap isn't available, or if your site has a sitemap that doesn't include the content you want to ingest.

To check for a sitemap:

  1. Go to https://<your domain>/robots.txt. The robots.txt page of a website typically lists the site's sitemap.
  2. If you can’t find the robots.txt page, try adding sitemap.xml to your domain. Alternatively, add another sitemap index URL exposed by the site, like https://<your domain>/sitemap.xml.
  3. Can’t find the sitemap? Go to Strategize Connection with Web Content (Crawler) Connector to proceed with the Web Content (Crawler) connector.

Cirrus finds its sitemap on this URL: https://cirrus-static-site.vercel.app/robots.txt.

Your site can have multiple sitemaps for different languages, product areas, or content types. Select the relevant sitemap based on the content that you want to ingest. See, for example, this sitemap for English content only: https://help.your-domain.com/oie/en-us/Sitemap.xml.

After you find the right sitemap, verify that it includes the pages you want to ingest. Search the sitemap for your sample URLs, or for distinctive parts of those URLs, and confirm that they appear.

After you select the relevant ingestion method and sitemap, decide how to scope the site ingestion so that it includes only the relevant pages for ingestion. A simple way to scope content ingestion is by filtering URLs with a regular expression. Prepare a regular expression that matches the pages you want to ingest. For example, if you enter ./content/., all pages under the starting URL that include /content/ in their URL path are ingested. If you don’t specify any included pages, all pages under the sitemap URL are included.

Cirrus doesn't want to ingest its entire site. They're only interested in pages related to products. Based on the sample pages they list, a suitable regular expression pattern is: ._/products/._. This pattern matches URLs that contain the expression /products/, which aligns with the company's content ingestion scope.

After strategizing your website ingestion, make sure that you have the following:

  • Your selected web content connector
  • The content scope for the agent
  • The URL of your sitemap
  • The filtering logic to limit ingestion to the relevant pages

Cirrus gathers its findings.

  • Selected website content connector: Web Content (Sitemap)
  • Content scope: Product-related
  • Sitemap URL: https://cirrus-static-site.vercel.app/sitemap.xml
  • Filter pattern: ._/products/._