Force.com developers, have you ever been in the following situation? You’re working with large data volumes, and because you’re already aware of the Salesforce application’s multi-tenant architecture and governor limits, you’ve chosen to use batch Apex for asynchronous processing. However, when you try to process records, the jobs are aborted.
In this blog post, we’ll cover what you can do to ensure that the SOQL selecting records for processing don’t time out. Accomplishing this goal will allow you to take full advantage of the batch Apex framework, which is designed for processing large data volumes.
Batch Apex Architecture
Batch Apex allows you to run complex and long-running processes on the Force.com platform, and it typically goes through the following general steps.
- The start method of a batch Apex class collects the records or objects that will be processed. Using well-performing SOQL in the start() method reduces the chances of a job being aborted.
- The execute method picks up batches of these records and does the bulk of the processing.
- After the batches are processed, the finish method sends confirmation emails or executes post-processing operations.
Tuning SOQL in start method
Optimizing the SOQL in the start method is key to avoiding timeouts of a batch Apex job. You might want to process millions of records, but if the job is aborted when the query runs for longer than 2 minutes, how can you avoid this scenario?
The first step involves ensuring that the query is selective and leveraging indexes.
Force.com’s query optimizer checks the filters in the query to determine if an index will be used.
|An index is used with…
|Standard indexed fields
|The filter matches less than 30% of the total records, up to 1 million records
|Custom indexed fields
|The filter matches less than 10% of the total number of records, up to 333,333 records
|The AND clause
|At least one filter meets the conditions in this table for the appropriate type of indexed fields*
|The OR clause
|All fields must be indexed and meet the conditions in this table for the appropriate type of indexed fields*
* For AND and OR clauses, Force.com’s query optimizer applies several algorithms and creates runtime joins to decide if an index is used. The table features baselines which, when met, will result in an index being used.
Consider the following query.
Start by checking for the total record count of accounts and remember that the count includes soft-deleted records, which have been deleted from the Recycle Bin but have not yet been permanently deleted. Evaluate each filter individually to determine if it is selective.
A standard index already exists on the CreatedDate field and including this filter to process smaller time ranges is recommended. The other two fields would require salesforce.com Customer Support to create custom indexes. In order for the query to use an index, either the first filter returns less than 1 million records, or one of the two remaining filters returns fewer than 333,333 records.
If a filter is selective and isn’t already indexed, make the field an External Id or contact salesforce.com Customer Support. Customer Support can create single- and two-column custom indexes on most fields, with the exception of multi-picklist fields, some formula fields, and long text area and rich text area fields. For more information about indexing formula fields, see Force.com SOQL Best Practices: Nulls and Formula Fields.
If, after you create custom indexes, the query is still not selective, is it possible to have additional criteria added? In most cases, adding additional selective filters or processing in multiple jobs allows the start method query to complete. In the previous example, we could create multiple jobs with different values for the SLA__c filter and select smaller ranges for the CreatedDate filter.
When Selective Filters Aren’t Enough
When dealing with large data volumes, having selective filters might not be enough to avoid timeouts. Suppose you have 5 million accounts, and you want to process 80% of the records represented by the following query. Assume that additional filters cannot be added, and that attempts to use multiple jobs have not succeeded.
Because you are trying to query for 4 million of the 5 million accounts, there is no way to use a selective filter to stay under the 10% of total records (up to 333,333 records) threshold, even if Rating__c is indexed.
As an alternative, you will see improved performance by removing the filters completely from the start method and filtering out the unwanted records in the execute method. Only consider this technique after attempting to tune the query and contacting salesforce.com Customer Support. It is recommended to only use this when timeouts are occurring because of large data volumes. Outside of this use case, this technique will likely perform worse than a properly optimized query would.
Note that the selection of the Name and Description fields has been moved to the SOQL in the execute method, and that this tactic doesn’t improve or degrade the performance of the query in the start method. For any job, no matter which fields are in the SELECT clause for the start method, the batch Apex framework is selecting only the Id field before creating batches of records to be processed in the execute method.
The query results in a full-table scan, and although this solution is counterintuitive, it has helped in some scenarios involving timeouts due to large data volumes. With that said, only consider this technique after working with salesforce.com Customer Support to tune the query.
The performance of batch Apex jobs can depend on sharing, large data volumes, the underlying records being processed, and how records are segmented between jobs. However, for processing to even begin, the SOQL in the start() method must complete. Therefore, it’s crucial to design the SOQL to handle increasing data volumes and avoid timeouts.
If you’re facing timeouts, engage with salesforce Customer Support for help with:
- Optimizing the SOQL in the start method. Customer Support can create custom indexes.
- Breaking up the job into multiple smaller jobs. Segmenting by CreatedDate is a recommended option.
- Selecting the Ids without any filters. Only use this tactic if you’ve already tried the previous two solutions, and timeouts are still an issue.
- Using Batch Apex
- Working with Very Large SOQL Queries
- Best Practices for Deployments with Large Data Volumes
- Architect Core Resources
About the Author and CCE Technical Enablement
John Tan is a member of Technical Enablement within the salesforce.com Customer-Centric Engineering group. The team’s mission is to help customers understand how to implement technically sound Salesforce solutions. Check out all of the resources that this team maintains on the Architect Core Resources page of Developer Force.