Image credit: flickr.com/photos/62904109@N00/231849844/

You are planning a Force.com implementation with large volumes of data. Your data model is in place, all your code is written and has been tested, and now it’s time to load the objects, some of which have tens of millions of records.

What is the most efficient way to get all those records into the system?

The Force.com Extreme Data Loading Series

This is the fifth entry in a six-part series of blog posts covering many aspects of data loading for very large enterprise deployments.

Here are the topics planned for this series.

Designing the data model for performance
Loading data into a lean configuration
Suspending events that fire on insert
Sequencing load operations
Loading and extracting data
Taking advantage of deferred sharing calculations

In this post, we’ll look at actual customer use cases involving high-volume data loading and extraction, and outline some tools and strategies that you can use to get data into or out of Salesforce as quickly as possible

Data Loading Stories from the Field

All of the implementation planning from the previous posts in this series has been leading up to this moment—you’ve made your loading configuration as lean as possible, and now you’re ready to load a huge number of records into your Salesforce organization. There is no question which tool is best suited to this task; the Bulk API is by far the quickest way to insert, query, and delete records.

The Bulk API is very thoroughly documented in the Bulk API Developer’s Guide, some of its key advantages are highlighted in this wiki article, and you can learn how to monitor and view details of currently running bulk load jobs by visiting Help & Training. So instead of covering that information here, we’ll highlight some examples that show how our customers have used the Bulk API to manage their large volumes of data.

Chunking Queries to Extract Large Data Volumes

Customer A had two crucial use cases that involved performing large-scale data extractions.

After loading millions of accounts, Customer A needed to pause its data loading and extract the account record IDs before loading the accounts’ associated contacts, cases, opportunities, and custom objects.
To perform one of its daily sales management processes, Customer A needed both to extract millions of leads coming in from an external system and to quickly process them so that they could be distributed to sales agents.

To meet its business needs, Customer A worked with salesforce.com Customer Support to design an effective strategy for chunking its extraction queries to achieve optimum performance.

Salesforce.com Customer Support told Customer A that when queries might return many millions of records, they should be broken into smaller chunks to avoid running into query time limits or even session termination.
The Salesforce platform does not currently allow you to chunk queries based on record ID. To define chunk boundaries, Customer A generated another sequence of numbers by adding an autonumber field to the records that it was extracting. Because autonumber fields are treated as text, they can’t be used in the extract query as the upper and lower bounds of a chunk boundary. So Customer A also created a formula field to convert the query’s autonumbers into numeric values for the query. Finally, Customer A worked with salesforce.com Customer Support to create a custom index on the formula field to speed up its query.
Salesforce.com Customer Support told Customer A that the most effective chunk size would feature fewer than 250,000 records; if chunks feature more than 250,000 records, the Force.com query optimizer’s sampling code causes the query to be driven by the full table rather than an index.
Customer A used Informatica to automate the construction of the batches, the submission of the queries, and the collection of the results.
Using this approach, Customer A was able to extract 16 million of 300 million records in about 30 minutes.

Bulk Deleting and Inserting Records to Rebuild a Large Territory Management Configuration

Customer B was managing a very large Territory Management implementation, assigning owners to more than 60 million accounts and associated objects, while also facing some challenging business conditions.

The key consideration driving the overall design was the need to completely rebuild the territory hierarchy and all assignments in less than 48 hours. (Note: This is not considered a best practice, as processing incremental changes is more efficient. In this case it was a contractual requirement.)
To meet this requirement, Customer B worked with salesforce.com Customer Support to design an assignment engine that could determine the new territory structure and territory assignments, and save that data in custom objects.
Customer B used the Bulk API to delete all of the territories and user territory assignments from the group membership tables—and all assignments of data to territories from the sharing tables—and then to rebuild the territory management configuration based on the new assignments stored in the custom objects.

Tip: If you use the Bulk API’s “hard delete” option, the records that you delete are not moved to the Recycle Bin—they are immediately flagged for physical deletion. Until records are physically deleted, they still might be returned by the Force.com query optimizer and count against its selectivity thresholds, which help determine which index, if any, should drive each of your queries.

Developing a Proof of Concept for Searching Across a Large Data Set

Customer C was planning to create a very large searchable database of customer order information, and it worked with salesforce.com Customer Support to develop a proof of concept.

Customer C loaded over 500 million records from 14 source files, each of which was between 35 and 40 GB in size.
It used the Informatica Powercenter Plug-in to pre-process, sort, and aggregate the data in those files.
Finally, Customer C upserted the records to the Bulk API using the Informatica Cloud, achieving a sustained rate of 9 to 10 million per hour.

Helpful Data Loading Hints

When designing your chunking strategy, remember that the platform puts a 10-minute limit on the execution time of each batch. If a batch fails to execute within that time limit, the system will try to execute it again. If the batch does not complete within 10 retries, it will fail.
Larger batches can allow you to perform data loads faster—if those batches avoid hitting limits and timing out. Test various data loading options in a sandbox organization to help you tune your batch sizes to optimize your overall data throughput.
The Bulk API can load batches faster if you process them in parallel, but beware of data skew, which can be caused by a large number of child objects being associated with the same account. You can find more information about data skew in this blog post and this paper.

Summary

When you have tens of millions or hundreds of millions of records to load or extract from your Salesforce organization, the Bulk API should be your tool of choice. By chunking your queries into batches, tuning your batch sizes for optimum throughput, and avoiding locking errors, you can achieve very high performance for your data movement tasks.

Related Resources

About the Author

Bud Vieira is an Architect Evangelist within the Technical Enablement team of the salesforce.com Customer-Centric Engineering group. The team’s mission is to help customers understand how to implement technically sound salesforce.com solutions. Check out all of the resources that this team maintains on the Architect Core Resources page of Developer Force.