Working with large data sets in Force.com
If you spend any time on our Discussion Boards (Force.com newbie tip – start using the boards NOW!), you'll notice that one of the most common topics is the SOQL governor limit of returning at most 50,000 records in a single query. I've seen some variation of this question often enough to think that a quick summary of all the different options for querying and processing large data sets in Force.com is in order, and so here goes. Note that this post is not intended to be a deep-dive into the implementation and nuances of each option. Rather, it is intended as a index/jumping point for anyone trying to query or process a large data set with links to find additional information for each option.
1) Apex queryMore pattern: Lets start with the basic – how you can query upto a maximum of 50,000 records in SOQL. The most efficient way to query upto 50,000 records is by using the following syntax:
for (List<Account> accts : [select name from Account])
for (Account a : accts)
Using the above for-loop syntax, Apex will internally implement the query/queryMore pattern and return batches of 200 records, upto a maximum of 50,000. More information on this pattern can be found here.
2) Batch Apex: In general, the best way to query and process large data sets in the Force.com platform (and arguably in any platform/language) is to do it asynchronously in batches. You can query and process upto 50 million records using Batch Apex and you can find more details about that feature here. Make sure you read the best practices and governor limits sections at the end of that link before implementing your first Batch Apex class. Batch Apex doesn't work in all use cases (for e.g. if you have a synchronous use case like a VF page that needs to query more than 50,000 records), but its a great tool to have in your toolkit. You can even combine Batch Apex with the Apex Scheduler and schedule your batch job to run at pre-determined time intervals.
3) @ReadOnly (Pilot) : The @ReadOnly annotation is a new feature of the platform that is currently in Pilot (i.e. you'll need to contact Salesforce support in order to have it enabled in your Org) that allows you to perform unrestricted queries against the Force.com database. If that sounds too good to be true, it is! There are some important requirements for using the @ReadOnly annotation (for e.g., it's only available for Web Services and the Schedulable interfaces and you cannot perform DML operations within the same request) and so it's only suitable for a certain sub-set of use cases. For example, you could use the @ReadOnly annotation if you wanted to develop a custom 'reporting' web service that returned some key metrics about your (very large) data set.
4) Standard List Controller and StandardSetController (Visualforce): One of the most common use cases for performing large SOQL queries is to implement a custom Visualforce page that paginates over a large data set. Instead of using direct SOQL and implementing your own custom pagination logic however, consider using the built-in Standard List controllers or StandardSetController. Note that while you're still only allowed to iterate through a maximum of 50,000 records using either of these controllers, you should think long and hard about developing any UI that's needs to display any more records to begin with.
5) APIs – All the options listed above apply to logic that you might need to develop natively on the Force.com platform to process large data sets. There is however another use case to consider – querying and/or processing large data sets from outside the Force.com platform (for use cases like integration, archiving etc.). For that, you can pick an API from the rich Force.com library of open, standards based APIs. All the APIs will allow you to query very large data sets (i.e. you're not limited to 50,000 records) using the 'queryMore' pattern. You can choose from among the SOAP API, the REST API or the Bulk API depending on the specific use case and requirements (e.g. sync or async, SOAP or REST, expected data volume etc.).
I hope that this list will be helpful in deciding how to process large data sets in Force.com. As always, comments and questions are welcome.