Processing Large Amounts of Data with APIs (Part 1 of 2)

When working in an enterprise environment, you may need to process large amounts of Salesforce records using the Platform APIs. In this two-post series, we’ll present options for processing data at scale. In this first post, we’ll focus on read operations with the REST API and the Bulk APIs. We’ll present which option is best for specific use cases and what are the limits, and we’ll share request examples. In the second post, we’ll take a look at write operations with the Composite API and the Bulk APIs.

REST API

The REST API lets you read a large number of records synchronously, and it is the easiest API to use. However, it has some limitations. First of all, it can only return a maximum of 2000 records per API call. If you need more data, you’ll need additional requests to retrieve batches of up to 2000 records. Secondly, this API is not suited for write operations on large batches from several thousand to millions of records.

The REST API is generally best suited when working with a user interface because of its more granular approach. It’s also simpler to implement compared to the Bulk API as it relies on a single type of request, and those requests are synchronous.

To read a batch of records, simply make a GET request on the Query endpoint with a q parameter holding a URL-encoded SOQL query.

If the SOQL query returns less than 2000 records, the API response will look like this:

Response’s highlights:

The totalSize attribute gives you the total number of records that can be selected with the SOQL query
The done flag set to true indicates that there are no additional records to fetch for this query

Tip: REST API responses are available in JSON by default, but they can also be obtained as XML by adding an Accept: application/xml header to the requests.

If a SOQL query returns more than 2000 records, the API response contains only a subset of the results and you’ll have to either use a query locator or pagination to access the remaining results.

Using the query locator to retrieve additional results

If a query returns more than 2000 records, the first API response will look like this:

Response’s highlights:

The totalSize attribute gives you the total number of records that can be selected with the SOQL query. In this case, this number is larger than the number of elements in the records array.
The done flag set to false indicates that there are additional records that can be returned (this response only includes the first 2000 records).
nextRecordsUrl holds a link with the query locator URL.
records holds the first 2000 records returned by the query.

A query locator represents a server-side cursor on the database. The query locator acts as an identifier to get additional query result sets. As of Winter ’23 (API v56.0), a cursor remains available for up to two days and there are no limits to the number of open cursors.

The end of the query locator URL is composed of two parts separated by a dash (-): the query locator ID and an offset. In the previous example, 0r82211CelDTLqEAEX-2000 was the query locator for the next batch of results starting with the 2000th record. If we wanted to get a batch starting from the 4000th record, we could directly use 0r82211CelDTLqEAEX-4000.

Running a GET request on the query locator URL will get you the next batch of up to 2000 records. If the new batch is the last one, then the done flag will be true and nextRecordsUrl won’t be present. If there are some additional records, done will be false and you’ll get another query locator URL to retrieve the next batch of records (the query locator ID will remain the same but the offset will be incremented).

The query locator URL lets you control the results offset, but you cannot specify the result batch size. This means that the query locator is not suited for working with a user interface as you probably wouldn’t display 2000 records at a time. Pagination is a better approach in this case.

Using pagination to retrieve additional results

When working with a user interface, pagination can help to fetch up to 4000 records. Pagination relies on the same REST query endpoint, but the difference is that you control the response with three keywords in the SOQL query:

1. One or more sort criteria thanks to the ORDER BY clause
2. A page size that is less than 2000 records thanks to LIMIT clause
3. A result offset in the form of an OFFSET clause. The offset value cannot exceed 2000. You can use the offset to calculate a page index with this formula offset = page index * page size

For example, the following query would return a batch of 1000 accounts ordered by creation date starting from the 2000th result.

While pagination is generally best used for UI use cases, it can also be used on long-running operations provided that results are ordered by a field for which values do not change during result traversal (created date is a good example).

The REST API works well to retrieve several thousand records, but beyond this scale, from tens of thousands to millions of records, you’ll want to use the Bulk APIs.

Bulk APIs

The Bulk APIs take scale to another level at the expense of synchronicity. Longer processing time to retrieve more data implies that the query operations are no longer synchronous, unlike the REST API. This means that the Bulk API generally requires more code on the client side and requires some sort of polling mechanism.

There are two generations of Bulk APIs: Bulk API and Bulk API 2.0. These two APIs operate differently and have distinct pros and cons. We won’t replicate the content of the Bulk API 2.0 and Bulk API Developer Guide in this post, but we’ll highlight key differentiators between the two API generations.

The main difference between the two APIs is the number of requests that are required to process a query. The Bulk API 2.0 was introduced to reduce this number to a minimum of three requests compared to a minimum of six for the original Bulk API.

Here’s a diagram that illustrates the requests required for a Bulk API 2.0 query:

A diagram that illustrates the requests required for a Bulk API 2.0 query

The client sends a first request that creates and starts a query job and specifies a SOQL query.
While the query is being processed, the client makes calls to poll for the job status.
Once the job is completed, the client can retrieve the query results in a single request.

By comparison, the following diagram presents the same query with the original Bulk API:

A workflow that illustrates the requests required for a Bulk API query

The client sends a first request that creates a query job.
The client may add one or more query batches to the job with SOQL queries.
Once batches are added, the client closes the job so that it can be executed.
While the job is running, the client polls for job status.
Once the job is complete, the client fetches the job result IDs.
Thanks to the job result IDs, the client fetches the results’ data.

Besides the higher number of requests, and as of Winter ’23 (API v56.0), the original Bulk API still retains one advantage over the Bulk API 2.0: the variety of supported formats. The original Bulk API accepts and produces CSV, XML, or JSON. Binary data is also supported for ingest operations. By contrast, Bulk API 2.0 only works with CSV.

Regardless of the Bulk API type that you use, be sure to enable compression for the API responses. All you need to do is to pass an Accept-Encoding: gzip header in your request when retrieving the result sets. This small change will make a major difference in terms of performance.

Closing words

You now have a glimpse at the different API options for reading data at scale. The number of records that you aim to process should be the primary driver for choosing the right API for the right job, but be mindful of other constraints. Be sure to refer to the API documentation (see guides in the resources section) to get an understanding of the correct use cases and limits for these APIs. I also recommend that you use our Salesforce Platform APIs Postman collection to try out the different APIs before implementing a new client.

Stay tuned for the second post of this series in which we’ll focus on write operations with the Bulk APIs and the Composite API.

We’ll leave you with this table that provides a good summary of the key differences between these APIs for read operations:

	REST API with query locator	REST API with pagination	Bulk API	Bulk API 2.0
Operation type	Read only		Create, read, update, upsert, and delete
Maximum number of records returned per query	2000 records per response (pagination cannot access records beyond the first 4000 records because of the offset limit)		From several hundred thousand to millions of records (see documentation for details)
Process type	Synchronous		Asynchronous
Minimum number of request types to get first result batch	1		6	3
Supported formats	JSON or XML		CSV, JSON, or XML + binary (ingest only)	CSV
Query results lifespan	2 days	N/A provided that the ordering field values don’t change during result traversal	7 days

Resources

Query endpoint documentation in the REST API Developer Guide
Bulk API 2.0 and Bulk API Developer Guides:
- Bulk API
- Bulk API 2.0
Salesforce Developer Limits and Allocations Quick Reference
Salesforce Platform APIs Postman collection

About the author

Philippe Ozil is a Principal Developer Advocate at Salesforce where he focuses on the Salesforce Platform. He writes technical content and speaks frequently at conferences. He is a full stack developer and enjoys working on DevOps, robotics, and VR projects. Follow him on Twitter @PhilippeOzil or check his GitHub projects @pozil.

REST API

Response’s highlights:

Using the query locator to retrieve additional results

Response’s highlights:

Using pagination to retrieve additional results

Bulk APIs

Closing words

Resources

About the author

Get the latest Salesforce Developer blog posts and podcast episodes via Slack or RSS.