Data Chunking Techniques for Massive Orgs

Learn how to use 2 awesome PK chunking techniques along with some JavaScript to effectively query large databases that would otherwise be impossible to query.

Guest Post: Daniel Peter is a Lead Applications Engineer at Kenandy, Inc., building the next generation of ERP on the Salesforce App Cloud. He’s also a co-organizer of the Bay Area Salesforce Developer User Group. You can reach him on Twitter @danieljpeter or www.linkedin.com/in/danieljpeter.

Data too big to query? Don’t want to use the Bulk API? Want to stay native on the Salesforce platform? Don’t mind a little JavaScript?  Read on to find out how you can chunk even the largest database into submission!

Big Heart Pet Brands is a $2.3 billion (with a B) a year company. They are one of the largest pet food companies in the world, and they are using Kenandy on Salesforce to run their business. In order for them to go live at the beginning of 2015, we had to make sure we could scale to support their needs for real-time access to their large data volumes. Extremely large Salesforce customers call for extremely innovative solutions!

Prerequisite: Creating a Large Data Set

I set the stage for this demonstration by creating a ton of data. First I defined an empty Large_Object__c with a few custom fields:

 

Then I kicked off 5 copies of this batch at the same.  Here is the Apex code:

public with sharing class FillObjectBatch implements Database.Batchable {

    public Iterable start(Database.BatchableContext BC) {
        List intList = new List();
        
        //20,000 x 2,000 records per batch = 40M total
        for (Integer i=1; i<=4000; i++) {
            intList.add(i); 
        }
        
        return intList;
    }
 
    public void execute(Database.BatchableContext BC, List intList) {
        List largeObjList = new List();
        for (Integer i: intList) {
            for (Integer j=0; j<2000; j++) {
                largeObjList.add(
                    new Large_Object__c (
                            Some_Number__c=getRand(),
                            Some_Datetime__c=System.now()
                        )
                );                  
            }
        }
        insert (largeObjList); 
    }
 
    public static Decimal getRand() {
        Integer lower = 0;
        Integer upper = 100;
        
        //get a random int between 0-100
        Decimal retDec = Math.round( Math.random() * (upper - lower) ) + lower;
        
        //give it a random decimal
        retDec += Math.random();
        
        //round it to 2 decimal places
        retDec = retDec.setScale(2, System.RoundingMode.HALF_EVEN);
        
        return retDec;

    }
 
    public void finish(Database.BatchableContext BC) {
    }

    //Database.executeBatch(new FillObjectBatch(), 1);
}

I let it run overnight… and presto! Forty meeellion records!

 

This is too many records to query a COUNT() of:

 

Running a Salesforce report on this many records takes a very long time to load (10 mins), and will usually time out:

 

So how can you query your {!expletive__c} data?

There are plenty of resources out there on how to design and query large databases. Indexing, skinny tables, pruning records, horizontal partitioning are some popular techniques. You also need to understand how to write selective queries. The query optimizer is a great tool to help you write selective queries.

But I’m not going to go into detail on these concepts. Instead I want to talk about something unique you may not have heard about before, PK Chunking. This is a technique you can use as a last resort for huge data volumes. If you’ve indexed away, written a good query, and your query still times out, you may want to consider the PK Chunking techniques I am going to teach you. It’s a great technique to have in your toolbox.

PK Chunking to the Rescue!

The bigger the haystack, the harder it is to find the needle. PK chunking turns the big haystack into many smaller haystacks and sends a team of needle hunters off to check each small haystack at the same time. Multi-tenant, cloud platforms are very good at doing many small things at the same time. Why not use that to our advantage?

In order to chunk our database into smaller portions to search, we will be using the Salesforce Id field of the object. This is a very special field, that has a lightning-fast index. Much faster than custom indexes.

So to get the first and last Ids in the database we can do these SOQL queries:

SELECT Id FROM Large_Object__c ORDER BY Id ASC LIMIT 1
SELECT Id FROM Large_Object__c ORDER BY Id DESC LIMIT 1

Those return in no time at all since Id is so well indexed. But how do we get all the Ids in between, without querying the 40M records? If we could just get all those Ids, we could use them to chunk up our SOQL queries, like this:

SELECT Id, Autonumber__c, Some_Number__c 
FROM Large_Object__c 
WHERE Some_Number__c > 10 AND Some_Number__c < 20
AND Id >= 'a00J000000BWNYk'
AND Id <= 'a00J000000BWO4z'

(many more queries…)

SELECT Id, Autonumber__c, Some_Number__c 
FROM Large_Object__c 
WHERE Some_Number__c > 10 AND Some_Number__c < 20
AND Id >= 'a00J000000EEDO2'
AND Id <= 'a00J000000EEFUN'

We can run 800 queries like this, with id ranges which partition our database down to 50,000 records per query. Each query runs super fast since Id is so well indexed. It is a similar to querying a database with only 50,000 records in it, not 40M!

If we instead tried to run this SOQL query like this:

SELECT Id, Autonumber__c, Some_Number__c 
FROM Large_Object__c 
WHERE Some_Number__c > 10 AND Some_Number__c < 20

On the whole database, it would just time out. And even if it didn’t time out, it could potentially return too many records and would fail because of that. What we have here is a guaranteed failure with a backup plan for failure!

So how do we get those 800 ranges of Salesforce Ids to divide and conquer our goliath of a database? How do we run 800 queries and assemble the results of them?

The Two Best Methods for PK Chunking

There are two methods of PK chunking I’m going to discuss. Query Locator based PK chunking (QLPK) and Base62 based chunking (Base62PK). Each has its own pros and cons and which one to use will depend on your situation.

Query Locator PK chunking (QLPK)

QLPK leverages the fact that the Salesforce SOAP and REST APIs have the ability to create a very large, server side cursor, called a Query Locator. More on cursors here. Think of it as a List<sObject> on the database server which doesn’t have the size limitations of a List in Apex. So in our example we would create the cursor like this:

SELECT Id FROM Large_Object__c

That’s right, just the Id, and no WHERE clause. A WHERE clause would likely cause the creation of the cursor to time out, unless it was really selective. So we just leave it off. The easiest way to use the SOAP API from a Visualforce page is to use the AJAX Toolkit. We execute the query from the AJAX toolkit asynchronously with a timeout set to 15 mins. In this case it takes about 6 mins to get the query locator. It is very cool that allows us to get a 40M record query locator in six minutes. Trying to do this via an Apex query would fail after 2 minutes. We have a much larger limit this way.

Then we do this query we get the first 2000 records of the query, and a query locator:

Typically you would use this information to keep calling queryMore, and get all the records in the query 2000 at a time, in a serial fashion. However, we are going to use this information in a different way, since we don’t care about the records themselves, and we want much larger chunks of Ids than 2000. We want 50,000 in this case.

The queryLocator value that is returned is simply the Salesforce Id of the server side cursor that was created. This is used with a dash (“-”) and offset to jump into the cursor at a particular offset and return 2000 records. Like this: 01gJ000000KnR3xIAF-2000. We are going to use the query locator in this fashion, to get all the Id chunks in the whole database:

01gJ000000KnR3xIAF-50000
01gJ000000KnR3xIAF-100000
…
01gJ000000KnR3xIAF-39950000
01gJ000000KnR3xIAF-40000000
01gJ000000KnR3xIAF-39999999

Through some calculations, loops, and custom catenated queryMore requests (full code here) we are able to blast through the 40M record query locator in 800 chunks of 50k to get all the Id chunks. We get the first Id of the 2000 records which are returned, and discard the rest of the 1999 records. And then the next chunk’s first Id becomes the “less than” filter for the previous chunk. Since we rely on the next chunk to get the “less than” filter for the current chunk we can’t really use these Id ranges until they are all complete. We need to sort and assemble them all to have complete ranges. This is OK as we can get through all the queryMore requests in less than a minute in this case. Building that initial query locator is the most expensive part. The net result of chunking the query locator is that we now have a list of Id ranges which we can use to make very selective and fast running queries with. In fact, we can even request these queries in parallel!

Here is a video of the query locator chunking in action. WARNING: Blasting through query locators can be highly addictive.

And here is the object we end up with in the end:

You can see it is is an array with 800 items. Each item has the first and last id we need to use to filter our query down to 50k records. You can iterate over the list of id ranges in a for loop, and asynchronously fire off 1 JavaScript remote action or perhaps even 1 AJAX Toolkit query request for each of the 800 id ranges. If using remote actions, make sure to set “buffer: false” or you will most likely hit errors due to the response being over 15MB. This is because without “buffer: false” Salesforce will batch your requests together. We want the full 15MB for each request. The callback function for each query will add the results into a master results variable, and increment a variable which counts how many total callbacks have fired. When the total callbacks fired equals the size of our list, we know we got all the results. These queries can even be aggregate queries, in which case the chunk size can be much larger – think 1M instead of 50k. You can also use the @ReadOnly annotation to use chunks of 100k. There are many ways to adjust this technique depending on the data you are trying to get out of the object.

I ran an example that calls a remote action, and saves the autonumbers where the number on the record is between 10 and 20. Essentially 800 instances of this SOQL query, with different id range filters:

SELECT Id, Autonumber__c, Some_Number__c 
FROM Large_Object__c 
WHERE Some_Number__c > 10 AND Some_Number__c < 20
AND Id >= 'a00J000000BWNYk'
AND Id <= 'a00J000000BWO4z'

We end up with a JavaScript array containing the 3,994,748 results!

Amazing! Try querying 40M records for almost 4M results in Apex and see how far you get. Even a batch job doing this would take many hours.

Base62 Chunking (Base62PK)

This is a very exciting method of chunking the database, because it doesn’t need that expensive, initial query locator. It doesn’t bother to gather up all the ACTUAL ids in the database like in QLPK. It instead gets the very first id in the database and the very last id and figures out all the ranges in between with Apex. Getting the first and last id is an almost instantaneous thing to do, due to the fact the ids are so well indexed:

take a look at this short video to see how fast it runs:

Ok ok, so maybe a sub 1 second video isn’t that interesting. But you get the idea. It’s fast!

In order to explain how we “figure out” all the ids that lay between the first and last id in the database we need to look at the structure of the Salesforce id itself. It is more than just an auto incrementing primary key, it is actually a composite key. This is the best description I have found of what the keys are comprised of.

Here’s the visual summary:

For the purposes of Base62 PK chunking, we just care about the last part of the Id – the large number. In the base 10 decimal system, 1 character can have 10 different values. In base 62, 1 character can have 62 different values, since it uses all the numbers, plus all the lowercase letters, plus all the uppercase numbers. This means with 9 characters of base 62 numbers, we can represent a number as big as 13,537,086,546,263,600 (13.5 Quadrillion!) instead of just 999,999,999 (1 Billion) in base 10. More unique values in a smaller space = more better!

Our method of Base62 PK chunking lops off the last 9 digits of the first and last id, converts them to long integers, then chunks up everything in between and converts them back to a Base62 representation and ultimately synthesizes all those Salesforce id ranges. See this portion of the code in GitHub for more details.

Because we ignore the pod identifier, and assume all the ids are on the same pod (the pod of the lowest id), this technique falls apart if you start to use it in orgs with pod splits, or in sandboxes with a mixture of sandbox and production data in them. In these cases, it is probably better to use QLPK. But Base62PK could be enhanced to support multiple pods with some extra work.

All in all when our Base62PK run completes we get the same number of results (3,994,748) as when we did QLPK. Yay!

 

Converting from Base62 to decimal and back is a cool problem to solve. All the Apex code is in the GitHub repo at the end of this article, but here is the juicy part:

    /*
    * BEGIN Base62 encode / decode code
    * This is used to convert Salesforce Ids to Integers and back for chunking
    *
    * This code ported from python here: https://gist.github.com/adyliu/4494223
    */
    
    static String basedigits = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz';
    
    public static Long base62decode(String s) {
        Long ret = 0;
        Long mult = 1;
        
        String[] chars = s.split('');
        chars.remove(0);
        
        String[] charsReversed = new List<String>();
        
        for(Integer i = chars.size()-1; i>=0; i--) {
            charsReversed.add(chars.get(i));
        }
        
        for (String c: charsReversed) {
            Integer index = basedigits.indexOf(c);
            system.assert(index > -1, 'bad character');
            ret += mult*index;
            mult *= 62;
        }
        
        return ret;
    }    
    
    
    public static String base62encode(Long num) {
        system.assert(num >= 0, 'expected positive number');
        if (num == 0) {
            return '0';
        }

        List<String> retList = new List<String>();

        String[] basedigitsArr = basedigits.split('');
        basedigitsArr.remove(0);

        while (num != 0) {
            Integer cIndex = (Integer)Math.Mod(num, 62);
            retList.add(basedigitsArr[cIndex]);
            num = (num/62);
        }
        
        List<String> retListReversed = new List<String>();
        for(Integer i = retList.size()-1; i>=0; i--) {
            retListReversed.add(retList.get(i));
        }   
            
        return String.join(retListReversed, '');
    }  
    
    /*
    * END Base62 encode / decode code
    */

The legendary Ron Hess and I ported that Base 62 Apex over from Python in an epic 10 minute pair programming session!

On Your Marks, Get Set, Go!

Time for a head to head comparison of both of these to see which one is faster.

QLPK: 11 mins 50 seconds
Base62PK: 5 mins 9 seconds

In this case Base62 is over twice as fast! But that won’t always be the case. The 40M records were created all at once, so the ids were really dense. What can happen in practice is that the records build and are then deleted over time. This leaves lots of “holes” in the ids which are returned by Base62PK chunking. So your responses tend to be more sparsely populated as compared to QLPK. This means you may have to make more requests to get all of the ids. Since every situation will have a different data profile, it’s best to experiment to find out the fastest method.

Make sure to use appropriate screen progress indicators with wait times like these. 5 minutes is a long time to wait for a process to finish, but if they know it is working on querying 40M records, and they have something to look at while they wait, it can be acceptable.

Taking Time Out to Discuss Timeouts

After we’ve gotten the chunking ranges and we are making all the individual, chunked queries we run the risk of any one of those requests timing out. The larger our chunk size is, the more there is a risk of this happening. This is a risk with either of the chunking methods (QLPK or Base62PK). This is because we made our smaller haystacks too big to search for our needles in the 120 seconds we get before Salesforce times out the query. However most of the time if you try the same query a second time, it will succeed. Occasionally it will take 3 or more times, but most of the queries return on the first or second try. This behavior is known as “cache warming”. Even though the query timed out the first time, the database did some caching magic which will make it more readily available the next time we request it. In fact Salesforce’s own bulk API will retry up to 15 times on a query. To allow for this we look at the response of the remoting request to see if it timed out, and fire it off again if it did.

There is a lot of room to optimize the retry logic, such as waiting, or only retrying x number of times. Our simple example just retries right away and repeats until it succeeds.

A Word on Concurrency

These parallel query techniques make it possible to hit a “ConcurrentPerOrgApex Limit exceeded” exception. Salesforce limits the number of Apex processes running for 5 seconds or longer to 10 per org. You can decide to handle these by doing a wait and retry similar to the timeout logic I have shown. But most importantly, make sure to check the execution time of your code yourself. If it is close to 5 seconds see what you can do to optimize it. Adding more indexes to the fields in the where clause of your chunk query is often all it takes to stay well away from the 5 second mark. You don’t want any of your parallel pieces getting close to 5 seconds as it may impact users of other parts of your Salesforce org. Save these long running requests for where they are really needed – not here.

In my examples I am making all 800 requests in parallel. That’s a large number of connections to keep open at once! Chrome seems to handle it just fine, but for a production system that needs stability, I would recommend implementing a rolling window approach which can keeps x number of connections alive at once. This can be a custom setting you can tweak if the need arises.

Other Ideas

There are other ways to chunk Base62 numbers. For extra geek points you could operate purely in Base62 for all of it, and increment your id by advancing the characters. Salesforce’s 64 bit long integer goes into the quintillions, so I didn’t need to do this, but there may be some efficiency gain from this.

This example just queries a massive amount of data, but you can take this to the next level and use it to write data back into Salesforce. This makes for some turbo-charged batch processing jobs!

If you need to execute this in the backend, you could write the id ranges into a temporary object which you iterate over in a batch. However you won’t get awesome performance this way. Perhaps acceptable if you run 5 concurrent batches. I haven’t tested this approach.

There are other methods of PK chunking. For example serial chunking without a query locator by doing LIMIT 50000 and then using the next query where the id is greater than the previous query. However I found this to be the slowest and least useful method so I left it out. Maybe you can think of a method better than all of these!

finish()

PK chunking is a valuable technique. Salesforce uses it themselves for the Bulk API. However when you learn how to use this hammer, be cautious of wanting to pound every nail with it. Only use this as a last resort. Yet if the requirements truly dictate this approach it will deliver. It requires a lot of knowledge of JavaScript to build a robust solution, so make sure you have this skillset if you want to go this route.

Resources

GitHub repo with all the code used in this article: https://github.com/danieljpeter/pkChunking

Leave your comments...

Data Chunking Techniques for Massive Orgs