Step 2: Set Up a Web Content Connector and Verify Content Ingestion
Now that you've strategized your site ingestion and selected a web content connector, set it up. Follow the specific instructions in the connector documentation, noting that additional settings are available to further refine your content ingestion.
After you set up your connector, confirm that the expected content is in Data 360.
Follow the set-up instructions in your selected connector documentation. As you set up the connector, don't enable the default search index. Instead, this guide provides instructions for a customized search index in Create Search Index and Retriever.
-
Set up the Web Content (Sitemap) connector.
OR
-
Set up the Web Content (Crawler) connector.
After you set up your connector and Data 360 starts the ingestion process, check the syncing status on the Data Streams tab (see Monitor the Status of Unstructured Connectors Data Ingestion). Then continue with the relevant validations for the sitemap or crawler connector.
Verify that Data 360 ingested the number of pages you expected based on your sitemap and filters.
Based on your sitemap and filtering strategy, estimate the number of pages you expect Data 360 to ingest by reviewing your sitemap and counting the entries matching your filters. You can compare this number against Data 360.
On the Data Streams tab, make sure your data stream’s Last Run Status is Success and query the Data Lake Object (DLO) to confirm the number of ingested HTML pages.
- On Data Cloud app, go to the Query Editor tab.
- To create a new workspace, give it a name and save your changes.
- On the left, select Data Lake Objects and select your Data Lake Object (DLO).
- Click Create Query on the right.
- Paste the following query example, replacing
DLO API Namewith your Data Lake Object API name. To find your Data Lake Object (DLO) API name, look for it in double quotes in the Query Editor or on your DLO tab.
Cirrus has filtered its ingestion to product-related content (.*/products/.*). The team reviews the sitemap and counts the entries matching this pattern. They find 20 URLs on the sitemap that contain this expression, which gives them a concrete number to compare against Data 360. After they run the query, the result is 20, which matches the number of product pages found on the sitemap.
The exact count may vary over time as the source site changes, but the process stays the same: define the expected scope first, then compare it to what was actually ingested.
The result of the query should be reasonably close to the expected number of pages you identified. If there's a meaningful gap, investigate before you continue. Possible reasons for a mismatch include:
- The regular expression filtered out more than you intended.
- Some URLs in the sitemap point to non-HTML assets.
- The source content changed between your review and the sync.
- Some pages were skipped or failed during ingestion.
To validate a crawler-based ingestion, use the sample pages you identified earlier.
- On the Data Cloud app, go to the Query Editor tab.
- To create a new workspace, give it a name and save your changes.
- On the left, select Data Lake Objects and select your Data Lake Object (DLO).
- Click Create Query on the right.
- Paste the following query example, replacing
DLO API Namewith your Data Lake Object API name. To find your Data Lake Object (DLO) API name, look for it in double quotes in the Query Editor or on your DLO tab.
This query helps confirm that Data 360 ingests the relevant content.
- The expected page exists.
- The stored URL is correct.
- The content type is correct.
- The connector ingested the kind of content that you intended for agent grounding.
If the content validation is unsuccessful, fix the ingestion setup before continuing. It's easier to correct filtering or connector configuration now than after you build the search index and connect it to an agent.
Before moving on to create a search index, make sure that the answer to these questions is yes.
- Did the sync complete successfully?
- Are your sample pages ingested?
- Does the content in Data 360 match the intended scope?
If the answer to any of these questions is no, fix the ingestion setup before continuing. It's easier to correct filtering or connector configuration now than after you build the search index and connect it to an agent.
- Step 3: Create Search Index and Retriever: Create the search index and retriever that your agent will use over the ingested website content.