You’ve been tasked with extracting data from a Salesforce object. If you’re dealing with small volumes of data, this operation might be simple, involving only a few button clicks using some of the great tools available on the AppExchange.
But when it comes to dealing with millions of records in a limited time frame, you might need to take extra steps to optimize the data throughput.
Optimizing Data Extraction in Force.com
Here’s a common scenario: Your business analysis team is architecting an external business intelligence (BI) system, and you want both to integrate data from your Salesforce organization into this new system and to run a daily integration. Although you have built application logic to extract only the deltas (i.e., newly inserted or updated records since the last extraction), you still have hundreds of thousands of records you need to extract from tens of millions of records in your Salesforce object. You also have a limited window in which you can complete this task as a nightly batch job, and you can’t use the weekly export service because you need to extract the data more than once per week.
Trying to get everything out from a large object in a single transaction is a problem because the query takes too long and, in most cases, times out. We recommend both architecting data extractions to retrieve only the delta (i.e., updated, newly inserted, and deleted) rows and updating those rows in the external system. However, if the number of records you need to extract is still in the millions, the first step in resolving this issue is to split or “chunk” the transaction into smaller pieces so that each request does not time out. After doing that, make each of the smaller requests complete faster, then run the requests in parallel to compress the overall time it takes to extract all of the data.
Chunking the Data into Smaller Sets
Using a date range to group data into sets can be easy and clean, but the total number of records that could fall within a given range might go above the Force.com query optimizer’s selectivity threshold. When this happens, the underlying query must do a full object scan, often leading to timeouts. (Refer to the Database Query & Search Optimization Cheat Sheet to see what the selectivity thresholds are.) If you are certain that filtering by the date range alone will always be selective, you can use this field to chunk your data. Otherwise, using a unique ID is more reliable, and can be easily implemented by following these steps.
- Create or use an existing auto-number field. Alternatively, you can use any number fields that can make up a unique value, as long as they are sequential.
Note: If you are adding a new auto-number field, make sure to enable the “Generate Auto Number for existing records…” option. - Create a formula field that converts the auto-number field text value into a numeric value—you cannot use an index with comparison operators such as “<=” (less than or equal to) or “>” (greater than) for the text-based auto-number field. In this example, we’ll name this field “ExportID.”
- Place a custom index on the formula field by contacting salesforce.com Customer Support.
Hint: Depending on the characteristics and distribution of your data, adding another selective filter, such as “CreatedDate,” and placing a compound (or two-column) index might help group the data sets in a more meaningful, manageable way.
- Run pre-queries to determine the boundaries. For example:
- To find the lowest boundary:
- To find the highest boundary:
Hint: Although you should have no records with the “ExportID__c” field empty (i.e., null), look for them before running your queries so that you don’t miss them before the extraction.
- Add a range filter to your extraction (i.e., a WHERE clause in your SOQL) to limit the number of targeted rows so that they are below the selectivity threshold. For example:
When submitting the subsequent requests to retrieve the data, simply replace the “ExportID__c” values by incrementing the ranges until you have gone through all of your data. Things get a little more complicated if you mass delete records often. While the approach described above still works, the gaps created by deleting records reduces the number of records included in each range, making it less optimal from a performance perspective. In other words, you could have a “chunk” that returns only a few or even no rows. There is a solution to this problem, which I will cover in a later Technical Enablement post.
Increasing Data Throughput
Increasing data throughput not only shortens the time it takes to retrieve the data, but it also increases the quality of the extracted data at a specific moment in time. For example, by running the extraction during off-peak hours and within a shorter window, you are less likely to have data discrepancy issues, such as having an extracted child record in one of your queries turn into an orphan record because someone deleted its parent record before the subsequent query ran.
Here are some best practices for increasing throughput.
- Run your requests in parallel.
- Have an administrator with access to all the data perform the extraction. This practice can help you minimize sharing calculation overhead.
- Always use the Bulk API or batch Apex when dealing with large data volumes.
Note: Your requests might not always be returned in the order that you submitted them.
If you need an extra boost to increase throughput, consider using a skinny table, which can reduce the database table joins and parallel processing that happens under the hood. It can be a good fit if you are extracting a subset of fields (fewer than 100 fields) from the base table, and if those fields’ types are supported by the skinny table. If you decide to use a skinny table, you must first create an Apex trigger to take the auto-number field (or the sequence ID field) and store its value in a static custom field because formula fields cannot be included in skinny tables.
Additional Data Extraction Considerations
- Remember that idempotence is an important design consideration in successful extraction processes. Make sure that your job is designed so that resubmitting failed requests fills in the missing records without creating duplicate records for partial extractions.
- When testing and measuring the performance of your extraction, beware of the effects of the Force.com platform’s caching. The more tests you run, the more likely that the data extraction will complete faster because of the underlying database cache utilization. While it is great to have better performance, don’t schedule your batch jobs based on the assumption that you will always see the best results.
- Be aware of the Force.com governor limits. Follow best practices for deployments with large data volumes to reduce the risk of hitting limits when executing jobs and SOQL queries.
Summary
When extracting large data volumes, consider splitting requests into smaller sets to avoid timeouts and using techniques such as indexed filtering, parallel requests, and skinny tables to maximize data throughput. In addition, when you’re designing your extraction process, keep idempotence in mind—it’s a key part of every successful integration implementation.
Related Resources
- Long- and Short-Term Approaches for Tuning Force.com Performance
- Best Practices for Deployments with Large Data Volumes
- Database Query & Search Optimization Cheat Sheet
- Architect Core Resources
About the Author
Daisuke Kawamoto is an Architect Evangelist within the Technical Enablement team of the salesforce.com Customer-Centric Engineering group. The team’s mission is to help customers understand how to implement technically sound salesforce.com solutions. Check out all of the resources that this team maintains on the Architect Core Resources page of Developer Force.