Use PK Chunking to Extract Large Data Sets from Salesforce

The size and complexity of Salesforce platform implementations continues to increase as customers migrate business critical operations. With large data sets with record counts in the 10s and 100s of millions, and an ever-increasing number of integration points, data management is front and center in the concerns of Salesforce administrators. Best practices for operations such as synchronization across systems, data replication and archiving stress that only data that has changed within a certain time window be queried and migrated. But some use cases, such as a first-time synch or initial population of a data mart, require querying a full table. And in systems with a very high volume of transactions, the recently changed data could be a large percentage of an object’s records.

Administrators who routinely extract large amounts of data from Salesforce to complete these tasks are familiar with use of the Bulk API query function, and separating large queries into chunks. At extremely high volumes — 100s of millions of records — defining these chunks by filtering on field values may not be practical, because the number of rows that are returned may be higher than the selectivity threshold of Salesforce’s query optimizer. The result could be a full table scan and very slow performance, or even failure of the query to complete.

So if attribute filtering doesn’t help you break the data up into small enough chunks, what can you do?

PK Chunking to the rescue!

Some of our larger enterprise customers have recently been using a strategy we call PK Chunking to handle large data set extracts. PK stands for Primary Key — the object’s record ID — which is always indexed. With this method, customers first query the target table to identify a number of chunks of records with sequential IDs. They then submit separate queries to extract the data in each chunk, and finally combine the results.

With the arrival of the Spring ’15 release, we’re pleased to announce that PK Chunking is a supported feature of the Salesforce Bulk API. Now you can get the performance benefits of PK Chunking without doing all the work of splitting the queries into manageable chunks. You can simply enter a few parameters on your Bulk API job, and the platform will automatically split the query into separate chunks, execute a query for each chunk and return the data. Here are the basics:

You can use PK Chunking with most standard objects and all custom objects.
To enable the feature you specify the header ‘Sforce-Enable-PKChunking‘ on the job request for your Bulk API query.
By default the Bulk API will split the query into 100,000 record chunks – you can use the ‘chunkSize‘ header field to configure smaller chunks or larger ones up to 250,000. Larger chunk sizes will use up fewer Bulk API batches, but may not perform as well. For each object you are extracting, you might need to experiment a bit to determine the optimal chunk size.
You can perform filtering while using PK Chunking by simply including a WHERE clause in the Bulk API query. In this case, there may be fewer records returned for a chunk than the number you have specified in ‘chunkSize‘.
If an object is supported, you can also use PK Chunking to query the object’s sharing table. In this case, determining the chunks is more efficient if the boundaries are defined on the parent object record IDs, rather than the share table record IDs. To take advantage of this, you should set the value of the Parent header field to the name of the parent object. For example, when querying OpportunityShare, set Parent to Opportunity.

Examples

Suppose a customer is using the custom object MessageStatus__c to keep track of a high volume of phone calls, emails, and other communications. They want to perform a complete extract and limit the number of chunks to make consolidating the data easier. They can perform a Bulk API query on MessageStatus with this header:

Another customer is planning a security audit and wants to identify all the manual shares that exist on their Account records. To execute this, they can perform a bulk query on AccountShare, using the filter WHERE rowCause=Manual, with a header like this:

For more details on the use of the PK Chunking header and its options, you can find the full documentation here:

https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/async_api_headers_enable_pk_chunking.htm

Happy Chunking!

Summary

When you need to extract 10s or 100s of millions of records from Salesforce, you’ll get better performance and reliability if you split the job into a number of separate queries that each retrieve a smaller portion of the data. When the number of records in a single query is lower than the selectivity threshold of the Salesforce Query Optimizer, the platform can process the queries more efficiently. The new PK Chunking feature of the Bulk API automates this process by using the Primary Key (Record ID) of an object to break up the data into manageable chunks and query them separately. This feature is supported for all custom objects, many standard objects, and their sharing tables.

Related Resources

About the Author

Bud Viera is an Architect Evangelist within the Technical Enablement team of the salesforce.com Customer-Centric Engineering group. The team’s mission is to help customers understand how to implement technically sound salesforce.com solutions. Check out all of the resources that this team maintains on the Architect Core Resources page of Salesforce Developers.