Loading Large Data Sets with the Force.com Bulk API

Abstract

Loading very large sets of data is becoming a common task for implementations, integrations and ongoing support. The standard Force.com SOAP API requires developers and administrators to implement complex processes to upload data in bite-sized chunks, monitor results and retry failed records. This method is acceptable for small loads of data but becomes unwieldy and time-consuming with large data sets that sometimes require numerous man-hours and multiple computers to operate in tandem.

The new Force.com Bulk API was designed specifically to offload the complex and time consuming process of importing large data sets from your client application to the Force.com platform. The Bulk API and supporting web interface allows you to upload and create import jobs, monitor and manage these jobs through an easy to use web interface, and receive notifications when the jobs complete. A few client applications already support the Bulk API but we'll develop our own Java command line application that uses the Bulk API to upload large sets of data from a CSV file and database. This application can then be used as a starter for your own custom bulk data loader.

Using the Force.com SOAP API for Large Data Sets

When writing your own data loader solution there's a lot of work to do with the Force.com SOAP API, and you need to know quite a bit about how it and Force.com functions. For example, to load 1 million records, your application will need to loop over your record set and segment your records out into chunks of 200 or fewer records at at time.

You'll then have to use something like Apache Axis to convert the records from your CSV file into XML and send these SOAP requests across the wire to Force.com. Force.com catches your requests, parses the SOAP packages back into its internal format, saves the records to the database and then return the results as a SOAP package back to your waiting, open connection. There are a number of things that can go wrong and your application must include robust error handling to take into consideration:

  • What to do if your network connection drops
  • How to handle HTTP requests that may time out due to long running processes (triggers, workflows, locked records)
  • How to log and retry failed records

As an application developer, some other things you need to take into consideration include:

  • Processing 1 million records will require 5000 API calls to Force.com. That's a lot of open connections and round trips to make. What happens when an exception occurs? How do you effectively restart a failed process?
  • To get an accurate picture of the upload process, you'll need to build in some sort of local logging or monitoring. What's the best way to do this so that others can monitor progress?
  • To speed up the import process, can you process records in parallel? Can you add multiple computers to the upload process or do your records need to be processed serially to help prevent record locking?
  • How do you tune batch sizes effectively during the upload process if you run into trigger problems?

That's a lot of work for a developer to write and maintain your own solution using the SOAP API. Thankfully, there's an easier way to do this.

Overview of the Force.com Bulk API

The REST-based Bulk API was developed specifically to simplify the process of uploading large amounts of data. It is optimized for inserting, updating, upserting, and deleting large numbers of records asynchronously by submitting them in batches to Force.com, to be processed in the background.

Uploaded records are streamed to Force.com to create a new job. As the data rolls in for the job it is stored in temporary storage and then sliced up into user-defined batches (max of 10,000 records). Even while your data is still being sent to the server, the Force.com platform submits the batches for processing.

Bulk overview.png

Batches can be processed in parallel or serially depending upon your needs. The Bulk API moves the functionality and work from your client application to the server. The API logs the status of each job and tries to reprocess failed records for you automatically. If a DML times out, the Bulk API automatically puts it back in the queue and retries it for you. Each batch is processed independently and once the batch finishes (successful or not), the job is updated with the results. Jobs can be monitored and administrated from the Salesforce.com Admin UI by anyone with appropriate access.

Note that Force.com is a multitenant architecture. The actual queue holding the batches may in fact hold batches for many different customers (platform application developers have no visibility into this data of course), and as such it is a shared resource. Consequently, the speed at which the queue is processed will vary depending on usage.

So based upon the chart below there are a number of compelling reasons to use the Bulk API over the SOAP API for loading data.

Bulk chart.png

Processing Bulk API Data

Any import processes you currently have running should continue to work with the Bulk API as it uses the same save logic and has the same processing side-effects as the SOAP API. To stay within the current governor and limits, records are processed 200 at a time from your batches. So if your triggers are written to process 200 records successfully you should be fine. Batches timeout after 10 minutes so you may have to tweak the batch size to get under this processing cap. When the job finishes it sends the same success and failure CSV files as the SOAP API.

It's important to take the Bulk API limits into account. In addition to the Bulk API limits, there are a few changes to be aware of:

  • Null values are processed differently, so instead of sending a blank value in the CSV file, you need to use #N/A instead.
  • There is no field truncation header option
  • Binary files cannot be processed with the Bulk API

Using the Bulk API and Monitoring Jobs

The Salesfore.com Data Loader v17 supports the Bulk API so it's easy to get started upload large datasets. Your user profile must have the "API Enabled" permission selected so if you are a System Administrator, you are all set. To get started, open up the Data Loader and edit the settings for the Bulk API.

Bulk settings.png

Once you've uploaded a sample CSV to your Developer or Sandbox org, log into Salesforce.com and go to Setup -> Administrative Setup -> Monitoring -> Bulk Data Load Jobs to see your job in progress. From here you can monitor your current processing quote, view completed jobs and manage running Jobs

Bulk monitor.png

Writing Applications for the Bulk API

In addition to processing records faster, the Bulk API was also developed to make the job of creating your own client application much faster and easier:

  • There is no XML to process; you can send the CSV files to the server
  • Your exception handling is greatly simplified due to functionality pushed to the server
  • Fewer network round-trips to handle since connections are only needed during the initial data upload
  • Fewer API calls
  • Anyone with the appropriate access can monitor and manage jobs from the Salesforce.com Admin UI

Writing applications using the Bulk API should look very familiar. The API uses HTTP GET and POST methods to interact with XML and CSV files. Like most requests with Force.com you'll need a URI endpoint and valid session. The Bulk API does not provide login functionality so you'll have to use Web Services API to establish a session for the user in Force.com. Once your session is established you call operation specific URIs to process data and request status with the Bulk API. Here is a sample of working with Jobs with the Bulk API.

Creating a New Job

You create a new job by POSTing a request to the following URI. The request body identifies the type of object processed in all associated batches.

URI

   https://instance_name—api.salesforce.com/services/async/APIversion/job

Example request body

   <?xml version="1.0" encoding="UTF-8"?>
   <jobInfo xmlns="http://www.force.com/2009/06/asyncapi/dataload">
    <operation>insert</operation>
    <object>Account</object>
   </jobInfo>

Example response body

   <?xml version="1.0" encoding="UTF-8"?>
   <jobInfo xmlns="http://www.force.com/2009/06/asyncapi/dataload">
    <id>750D0000000002lIAA</id>
    <operation>insert</operation>
    <object>Account</object>
    <createdById>005D0000001ALVFIA4</createdById>
    <createdDate>2009-04-14T18:15:59.000Z</createdDate>
    <systemModstamp>2009-04-14T18:15:59.000Z</systemModstamp>
    <state>Open</state>
   </jobInfo>

Getting the Status of a Job

You can fetch the details of an existing job by sending a GET request to the URI below.

URI

   https://instance_name—api.salesforce.com/services/async/APIversion/job/jobId

Example response body

   <?xml version="1.0" encoding="UTF-8"?>
   <jobInfo xmlns="http://www.force.com/2009/06/asyncapi/dataload">
    <id>750D0000000002lIAA</id>
    <operation>insert</operation>
    <object>Account</object>
    <createdById>005D0000001ALVFIA4</createdById>
    <createdDate>2009-04-14T18:15:59.000Z</createdDate>
    <systemModstamp>2009-04-14T18:15:59.000Z</systemModstamp>
    <state>Closed</state>
   </jobInfo>

Closing a Job

You can close a job by sending a POST request to the URI below. The request URI identifies the job to close. When a job is closed, no more batches can be added.

URI

   https://instance_name—api.salesforce.com/services/async/APIversion/job/jobId

Example request body

   <?xml version="1.0" encoding="UTF-8"?>
   <jobInfo xmlns="http://www.force.com/2009/06/asyncapi/dataload">
    <state>Closed</state>
   </jobInfo>

Example response body

   <?xml version="1.0" encoding="UTF-8"?>
   <jobInfo xmlns="http://www.force.com/2009/06/asyncapi/dataload">
    <id>750D0000000002lIAA</id>
    <operation>insert</operation>
    <object>Account</object>
    <createdById>005D0000001ALVFIA4</createdById>
    <createdDate>2009-04-14T18:15:59.000Z</createdDate>
    <systemModstamp>2009-04-14T18:15:59.000Z</systemModstamp>
    <state>Closed</state>
   </jobInfo>

Of course there is a lot more you can do with the API so for complete details, see the Force.com Bulk API Developer's Guide.

Building a Java Command Line Data Loader

The Appendix of the Bulk API documentation has a sample Java application that you can use for testing or as a starting point to build your own data loader. The sample code does the following:

  • Creates a connection to the target org
  • Creates a new Job in Salesforce.com for the sObject type and operation
  • Reads data from a CSV file and splits into batches that are streamed to Salesforce.com for processing
  • Closes the job, waits for notification that the job has been completed and then prints out the results.

To take this a little further, I've created a command line application and added some additional real-world functionality. When importing data one of the major tasks is consolidating, cleansing and de-duping data from multiple data sources. This is done much easier through SQL scripting and since we are working in the database let's eliminate the step of manually exporting our dataset into a CSV file. I've make some additions to the sample code that lets you query from a MySQL database (any JDBC compliant database will do), convert the query results into a CSV file and tie that into the existing sample code's import functionality.

Bulk console.png

The application is fairly generic at this point and you can certainly expand it to meet your needs. The entire project is available on Force.com Code Share so you can download it for your own use.

Other Bulk API Applications

Another application that you can use to learn how to program against the Bulk API is the open source Apex Data Loader.

Best Practices for Using the Bulk API

Consider the following best practices when using the Bulk API:

  • Test the Bulk API first in a full copy Sandbox with small datasets until you are satisfied with the results.
  • Change any existing import spreadsheets to use #N/A instead of blank values.
  • Make sure you check the client application for failed jobs as you will need to re-process them manually.
  • When using a database to load data, make sure you column names are the same as the target org.
  • Set your batch size as high as possible without causing timeout. The largest batch size may not always yield the best results.
  • Processing in parallel may produce faster load times but ensure through testing that it does not product record locking errors.
  • It's important to handle failed records appropriately. The documentation provides some helpful advice.

Summary

The new Force.com Bulk API make it much quicker and easier to load thousands or millions of records into your org with minimal effort. The Bulk API is now part of the Data Loader (v18) so you can take advantage of it right away. If you have the requirements to develop your own data loading process, the Bulk API is a welcome relief to using the Web Services API. This once error-prone process is now refreshingly streamlined. In this article we created a small Java command line application that uploads data from a CSV or database and reports on the status of the job. Anyone with appropriate access can log into the target org and view running jobs or abort the import process all together.

References

About the Author

Jeff Douglas is a Senior Technical Consultant at Appirio where he creates cutting-edge applications on the Force.com platform for some of the best companies in the world. He is a foster and adoptive parent and enjoys the fact that he doesn't have to reboot his cloud. He actively blogs about cloud computing (especially Force.com) at http://blog.jeffdouglas.com.