You’ve been tasked with extracting data from a Salesforce object. If you’re dealing with small volumes of data, this operation might be simple, involving only a few button clicks using some of the great tools available on the AppExchange.
But when it comes to dealing with millions of records in a limited time frame, you might need to take extra steps to optimize the data throughput.
Here’s a common scenario: Your business analysis team is architecting an external business intelligence (BI) system, and you want both to integrate data from your Salesforce organization into this new system and to run a daily integration. Although you have built application logic to extract only the deltas (i.e., newly inserted or updated records since the last extraction), you still have hundreds of thousands of records you need to extract from tens of millions of records in your Salesforce object. You also have a limited window in which you can complete this task as a nightly batch job, and you can’t use the weekly export service because you need to extract the data more than once per week.
Trying to get everything out from a large object in a single transaction is a problem because the query takes too long and, in most cases, times out. We recommend both architecting data extractions to retrieve only the delta (i.e., updated, newly inserted, and deleted) rows and updating those rows in the external system. However, if the number of records you need to extract is still in the millions, the first step in resolving this issue is to split or “chunk” the transaction into smaller pieces so that each request does not time out. After doing that, make each of the smaller requests complete faster, then run the requests in parallel to compress the overall time it takes to extract all of the data.
Using a date range to group data into sets can be easy and clean, but the total number of records that could fall within a given range might go above the Force.com query optimizer’s selectivity threshold. When this happens, the underlying query must do a full object scan, often leading to timeouts. (Refer to the Database Query & Search Optimization Cheat Sheet to see what the selectivity thresholds are.) If you are certain that filtering by the date range alone will always be selective, you can use this field to chunk your data. Otherwise, using a unique ID is more reliable, and can be easily implemented by following these steps.
Select ExportID__c From YourObject__c order by ExportID__c asc null last limit 1;
Select ExportID__c From YourObject__c order by ExportID__c desc null last limit 1;
Hint: Although you should have no records with the “ExportID__c” field empty (i.e., null), look for them before running your queries so that you don’t miss them before the extraction.
SELECT Field1__c, Field2__c, [Field3__c …] FROM Account WHERE ExportID__c > 1000000 AND ExportID__c <= 1200000 ;
When submitting the subsequent requests to retrieve the data, simply replace the “ExportID__c” values by incrementing the ranges until you have gone through all of your data. Things get a little more complicated if you mass delete records often. While the approach described above still works, the gaps created by deleting records reduces the number of records included in each range, making it less optimal from a performance perspective. In other words, you could have a “chunk” that returns only a few or even no rows. There is a solution to this problem, which I will cover in a later Technical Enablement post.
Increasing data throughput not only shortens the time it takes to retrieve the data, but it also increases the quality of the extracted data at a specific moment in time. For example, by running the extraction during off-peak hours and within a shorter window, you are less likely to have data discrepancy issues, such as having an extracted child record in one of your queries turn into an orphan record because someone deleted its parent record before the subsequent query ran.
Here are some best practices for increasing throughput.
If you need an extra boost to increase throughput, consider using a skinny table, which can reduce the database table joins and parallel processing that happens under the hood. It can be a good fit if you are extracting a subset of fields (fewer than 100 fields) from the base table, and if those fields’ types are supported by the skinny table. If you decide to use a skinny table, you must first create an Apex trigger to take the auto-number field (or the sequence ID field) and store its value in a static custom field because formula fields cannot be included in skinny tables.
When extracting large data volumes, consider splitting requests into smaller sets to avoid timeouts and using techniques such as indexed filtering, parallel requests, and skinny tables to maximize data throughput. In addition, when you’re designing your extraction process, keep idempotence in mind—it’s a key part of every successful integration implementation.
Daisuke Kawamoto is an Architect Evangelist within the Technical Enablement team of the salesforce.com Customer-Centric Engineering group. The team’s mission is to help customers understand how to implement technically sound salesforce.com solutions. Check out all of the resources that this team maintains on the Architect Core Resources page of Developer Force.