Strategize Connection with Web Content (Crawler) Connector

Follow these steps to continue strategizing your content ingestion with a Web Content (Crawler) connector. If your site has a relevant sitemap, continue to set up your connector.

The Starting URL is the page on your site where Data 360 begins crawling for ingestion when you use the Web Content (Crawler) connector. Crawl Depth is the number of link levels Data 360 follows from that starting URL. For example, if the starting URL is your home page and the crawl depth is 1, Data 360 ingests the home page and the pages it links to directly. At depth 2, it also ingests pages linked from those pages.

To find a starting URL and crawl depth:

  1. List a few representative pages you want to include in the ingestion, like the sample pages from your strategizing exercise. Open each in the browser and note how you navigate there from a natural entry point like home or a docs landing page.
  2. Pick a starting URL that's the highest shared ancestor in that navigation path. This is a single page that is still on the path to your targets but sits as “high” as needed so you don't start deeper than necessary.
  3. Set crawl depth to the smallest number that still reaches your target pages from that starting URL: count how many link hops it takes from the starting URL down to the deepest page you need.

Cirrus wants Agentforce to answer questions about TinyTurbine 250W. A representative target page is the product detail page: https://cirrus-static-site.vercel.app/products/small_turbines/tinyturbine-250w.html.

From the site navigation, the Products level that covers this and all other products is a practical starting URL: https://cirrus-static-site.vercel.app/products/small_turbines/.

With crawl depth 2, Data 360 ingests the Products page, the Small Turbines category page, and every page it links to directly, including the TinyTurbine 250W page and any sibling product pages in that section.

After you select the relevant ingestion method and decide on a starting URL and crawl depth, decide how to scope the site ingestion to include only the relevant pages. A simple way to scope content ingestion is by filtering URLs with a regular expression. Prepare a regular expression that matches the pages you want to ingest. For example, if you enter ./content/., all pages under the starting URL that include /content/ in their URL path get ingested. If you don’t specify any included pages, all pages under the starting URL get ingested.

Cirrus only wants to ingest product-related pages. Based on the sample pages they've previously selected, a suitable regular expression pattern is: ._/products/._. This pattern matches URLs that contain the expression /products/, which aligns with Cirrus's content ingestion scope.

After strategizing your website ingestion, make sure that you have:

  • Your selected web content connector
  • The content scope for the agent
  • The starting URL and crawl depth
  • The filtering logic to limit ingestion to the relevant pages

Cirrus gathers its findings for its ingestion strategy.