Work with Data Cloud Data using Amazon SageMaker ML Capabilities

In an earlier blog post, I explained how to connect your Salesforce Data Cloud org to Amazon SageMaker and how to create a Data Wrangler flow. Once you’ve done that, you’re ready for the next steps: importing your data from Data Cloud into Amazon SageMaker and preparing it for use in a machine learning (ML) model.

In this post, you’ll learn how to use Amazon SageMaker to view and query your Data Cloud data and metadata. You’ll also see how to use Data Wrangler flows to process, cleanse, transform, prepare, and analyze your data for use in an ML model. Finally, I’ll show you how to use Data Wrangler flows in a Jupyter Notebook as an input for your ML models.

If you haven’t already read the previous post, I encourage you to do so as it explains some important prerequisite steps for the topics covered here.

Viewing your metadata in Amazon SageMaker

After your Salesforce Data Cloud org is connected to Amazon SageMaker, you can easily view your org’s metadata in Data Wrangler. You can see data lake objects, data model objects, the fields in each object, and the data in the fields of each object.

Text Amazon SageMaker and console with Data Cloud objects.

Querying your Data Cloud data and metadata in Amazon SageMaker

You can use SOQL to query data lake objects and data model objects in Data Cloud. After you type a query and click Run, the results — including the fields and values of your query — are shown on the Query Results tab.

SOQL query in Amazon SageMaker.

You can then click Create a Dataset to use your Data Cloud data in your ML models. Now that you have your data imported into Amazon SageMaker and you have your dataset created, let’s look at what you can do with the data.

Preprocessing your Data Cloud Data in Amazon SageMaker Data Wrangler

Data Wrangler flow with transformations.

Data Wrangler enables you to prepare and transform your data so that you can use it in your ML models. Data Wrangler has built-in transforms, configurable from the user interface, that you can use to transform columns without any code. You can also create your own custom transformations with code using PySpark, Python, pandas, and PySpark SQL. You can view an extensive list of the types of transformations that are available out-of-the-box and popular custom transformations in the Amazon SageMaker Data Wrangler documentation.

As an example, the following PySpark transform creates categorical values for state__c, case_count__c, and tenure__c.

Some common transforms are “Drop Column,” which enables you to choose columns that you want dropped (excluded from training or predictions), and “Drop Missing,” which enables you to skip rows that have missing or blank values. You can also use “One-Hot Encode” to transform categorical variables into numerical variables, which is necessary as ML models require numerical values for training and prediction.

Analyzing Data Cloud data in Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler includes built-in analyses that help you generate visualizations and perform data analysis using just clicks. You can also create custom analyses using your own code. You can use histograms to see frequency distribution for a specific feature as in the example below. You can also create scatter plots, summaries, and more. To learn more about the analyses you can perform on your Data Cloud Data in Amazon SageMaker Data Wrangler see Analyze and Visualize in the Amazon SageMaker documentation.

Histogram in Amazon SageMaker

Using your Data Cloud Data in an ML model

Data from a Data Wrangler flow can be used as an input for your ML models, and can be imported into your SageMaker Jupyter notebooks. The example below references a product recommendation flow, which is a Data Wrangler flow that was created in Amazon SageMaker. For more on this use case and its implementation, see Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce apps with AI/ML.

Jupyter notebook referencing a Data Wrangler flow.

You can also export your Data Cloud data and train it in ML models using Amazon SageMaker Autopilot. Autopilot analyzes your data and performs the preprocessing necessary for model training. Autopilot then selects models that fit your data. It splits the data into training and validation sets and chooses the best models for your data. Lastly, it ranks the optimized model candidates based on their performance and identifies the best-performing model. Once Autopilot has identified the best-performing model, it gives you the option to deploy the model automatically by generating the model artifacts and the endpoint exposing an API. External applications like Einstein Copilot Studio can then send data to the endpoint and receive the corresponding predictions or inferences.

Conclusion

This post covered how to use the power of Salesforce Data Cloud and Amazon SageMaker to query, import, and analyze your Data Cloud data. Just as important, it showed how to use Data Wrangler flows to help with the preprocessing of your Data Cloud data so that your data is ready to be used in artificial intelligence and ML models, how to import your Data Cloud data into a SageMaker Jupyter notebook from a Data Wrangler flow, and how to use it in your ML models with Amazon SageMaker Autopilot. With that, you have all you need to start experimenting with your Data Cloud data, artificial intelligence, and machine learning!

Resources

Learn more about Salesforce Data Cloud
Trailhead: Artificial Intelligence Fundamentals
Trailhead: Generative AI Basics
Trailhead: Data Fundamentals for AI

About the author

Danielle Larregui is a Senior Developer Advocate at Salesforce focusing on the Data Cloud platform. She enjoys learning about cloud technologies, speaking at and attending tech conferences, and engaging with technical communities. You can follow her on X(Twitter).