You are planning a Force.com implementation with large volumes of data. Your data model is in place, all your code is written and has been tested, and now it’s time to load the objects, some of which have tens of millions of records.

What is the most efficient way to get all those records into the system?

The Force.com Extreme Data Loading Series

This is the fourth entry in a six-part series of blog posts covering many aspects of data loading for very large enterprise deployments.

Here are the topics planned for this series.

Designing the data model for performance
Loading data into a lean configuration
Suspending events that fire on insert
Sequencing load operations
Loading and extracting data
Taking advantage of deferred sharing calculations

This post explains how to best sequence setup and loading operations to optimize overall throughput when inserting very large volumes of data.

Why does the sequence of my load operations matter?

The relationships between your objects, how you configure roles and users, and what you define in your sharing settings can cause each step that you perform during a data load to affect later steps. The sequencing of some of these steps is a hard constraint—you cannot load child records in a master-detail relationship before loading their parent records because you need the parent IDs to complete your data load. Other constraints are subtler but could have even greater effects on overall processing time. For example, if you load your data first, then move the users who own the records around in your role hierarchy, the system must perform additional sharing calculations, which can slow down your role updates.

The following recommendations for sequencing loads are best practices that we developed after working with customers with very large data volumes. As always, you should test and adjust new configurations and loading sequences in a sandbox organization to ensure that you have the most efficient process for production.

1. Configuring Your Organization for the Data Load

Consider enabling the parallel recalculation and defer sharing calculation features. To enable these features—or to ask if your organization already has or could benefit from them—contact salesforce.com Customer Support.
Create the role hierarchy.
Load users, assigning them to appropriate roles.
Configure Public Read/Write organization-wide sharing defaults on the objects you plan to load.*

2. Preparing to Load Data

Make sure the data is clean, especially in foreign key relationships. When there’s an error, parallel loads switch to single execution mode, slowing down the load considerably.
Suspend events that fire on insert (See this previous entry in the data loading series.)
Perform advance testing to tune your batch sizes for throughput. For both the Bulk API and the SOAP API, look for the largest batch size that is possible without generating network timeouts from large records, or from additional processing on inserts or updates that can’t be deferred until after the load completes.

3. Executing the Data Load

Load parent objects before their master-detail children, then extract keys as needed for later loading. See additional details in the the second and third entries in this series.
Use the fastest operation possible: insert is faster than upsert, and even insert + update can be faster than upsert alone.
When processing updates, only send fields that have changed for existing records.
Group child records by ParentId, making sure that separate batches don’t reference the same ParentIds. This practice can greatly reduce or eliminate the risk of record-locking errors. If this cannot be arranged, you also have the option of using the Bulk API in serial execution mode to avoid locking from parallel updates.

4. Configuring Your Organization for Production

Defer sharing calculations before performing some or all of the operations below, depending on the results of your sandbox testing.
Change Public Read/Write organization-wide sharing models to Public Read Only or Private, where appropriate.
Create or configure public groups and queues.
Configure sharing rules.
If you are not using deferred sharing calculation, create public groups, queues, and sharing rules one at a time, and allow sharing calculations to complete before moving on to the next one.*
Resume events that fire on insert so validation and data enhancement processes run properly in production.

Summary

Loading large volumes of data can be a complex process with a lot of moving parts. If you focus only on making individual processes as fast as possible, and don’t address correct sequencing of the steps, just one out-of-place step could dramatically slow later operations. By understanding how various configuration and loading steps affect one another, you can sequence your operations intelligently and increase your overall loading throughput.

Related Resources

About the Author

Bud Vieira is an Architect Evangelist within the Technical Enablement team of the salesforce.com Customer-Centric Engineering group. The team’s mission is to help customers understand how to implement technically sound salesforce.com solutions. Check out all of the resources that this team maintains on the Architect Core Resources page of Developer Force.