by Jennifer Wyher and Paul Battisson April, 2015
Machine Learning With Apex Part 1 we discussed machine learning as a general concept by describing its history, types of machine learning, and presenting real-world applications. We also introduced the k-means clustering algorithm and how the results of this system can help businesses make better-informed decisions. Now we'll discuss the difficulties in implementing the k-means algorithm, or any machine learning system for that matter, on the Salesforce1 Platform and present how these challenges can be overcome. We promise useful Apex coding tricks that can be applied to every day code to save on machine resources, computing time, and ultimately, LIMITS!
As a multi-tenant platform, Salesforce1 imposes limits to protect against monopolization of resources. While Salesforce continues to increase limits in each release we, as developers, will eternally be presented with design decisions that put code and/or process simplicity against computational resource consumption. As Dan Appleman suggested in his book, Advanced Apex Programming, “Don’t focus on the values that you aren’t suppose to exceed. Instead, consider each limit a pointer to an operation that you want to optimize throughout your code.” Machine learning systems require lengthy data calculations over numerous iterations and therefore inherently require substantial computational resources. However, by taking Appleman’s advice and viewing limits as a challenge to write better code, machine learning applications can be successfully implemented on the platform. Bring on the challenges!
The k-means algorithm iterates over all provided data points, groups them into k clusters, and finally calculates new centroids based on the groupings. This process is repeated numerous times until an optimal centroid is found within a margin of error. The number of iterations necessary to determine a final centroid is unknown and can be quite large. An execution of the algorithm can take as little as a couple of iterations to one hundred iterations and beyond, depending on the number of data points, clusters, variables, and variation in the data. In order for our solution in Apex to be successful, we need to not be bounded by the number of iterations necessary to solve the problem.
Introduced in Winter ‘13, chained batches allow batch processes to be queued from within one another. This new(er) functionality provides developers with the ability to daisy-chain batches by calling the executeBatch method from within the finish method. In our case, by queueing another execution of the iteration (batch) at the termination of each iteration, we are able to mimic the recursive nature of the algorithm while utilizing batches to minimize total activity within a single execution context.
At this stage of the design, the flow of the algorithm can be visualized as follows:
Aligning a batch series to an iteration of the algorithm solves our theoretically infinite execution conundrum, however we still would benefit from optimizing our code within the batches to minimize specific actions per transaction context. In the final step of the algorithm, we calculate the new centroid’s position as the average of all the data points assigned to each cluster. For these calculations to be possible, the application must have knowledge of all data points assigned to each cluster. In order to calculate the average value in this fashion, we either need to keep in memory or retrieve via SOQL queries, the data points belonging to each cluster - two actions we want to minimize/optimize on. An alternate approach that doesn’t require referencing the data points a second time is needed.
By identifying the class as stateful and implementing the Database.Stateful interface, we can maintain state across the batch transactions, exactly what is needed! In each analysis of a data point, we can determine which cluster it should be assigned to, immediately calculate the running total, and increment the count for the cluster in member variables. The state of these member variables is maintained across batches and can be used to calculate the averages and new centroids in the finish method. The outcome is minimization of in-memory data, data iterations, and queries.
Our evolved process is depicted below.
As we have seen, at its core a machine learning system is an iterative process running a large number of calculations. One of the governor limits that many developers on the Salesforce1 Platform do not pay particular attention to is the CPU time governor limit, brought in as a replacement for the previous script statements governor limit in the Winter '14 release. Rather than counting the number of statements that were executed within a transaction, the new limit enforces a rule that the amount of time spent utilizing the CPU for the particular execution context can be no longer than 10 seconds for synchronous code and 60 seconds for asynchronous code. When executing large serialization and deserialization of data, processing large volumes of records or performing repetitive calculations, these limits can be quickly consumed. We need to ensure that we are effectively and efficiently executing our code to help speed up the execution of our batches, but also ensure that we can process a large data volume in each transaction.
Where we are iterating through our records to perform some calculation on them, any attempt we can make to improve the speed of our loops is likely to yield the greatest improvement in performance for the system. Consider first, as a simple example, the loop below where we are iterating over a set of contacts related to an account and updating a single field. If we execute this statement for an account with over 10000 records to simulate a reasonable data set for us and recording the time taken to execute this method, we find that the overall execution took approximately 2.4 seconds.
If we now update the code to look like the image below and run the same experiment, we find that the execution took just 0.2 seconds or over 90% less time. This has to do with the way in which Apex allocates memory and retrieves the values we are using while looping. Although a full explanation of why this occurs is outside the scope of this article, we should be aware of this improved way of writing our code to ensure that our loops are executing in the most performant way possible. For more information on this looping method, see here.
Creating a machine learning system in Apex is challenging on its own, but we want to go further and visually show the results from each iteration and the process overall. In order for this to be possible we need to store the results of each iteration, including the datapoints assigned to each cluster and the calculated centroids. You can think of the cluster assignment as a list of record IDs and the centroid as a matrix with a row for each cluster.
As a first pass to solving our storage challenge, we considered saving the entire result in JSON format in a longtext type field. However, this approach limited our storage to only roughly 6,898 records IDs, given the max length of 131,072 characters - much less than we anticipated processing.
The logical location to easily store and retrieve larger text strings is with Attachments. Attachments allow us to store the results of each iteration in a document containing our JSON serialized clustered data and attach it to the Iteration record as a result to that process. The Iteration record also maintains the details of the final calculated centroid, also serialized. JSON serialization is used to speed reads and writes. By utilizing Attachments, we remove the shortcoming of storing the results in a limited length field, and the number of records to be processed is now only limited by SOQL limits.
As mentioned above, all of the results for the system will need to be stored in an attachment as JSON, rather than in a field on the record, to allow us to store the necessary volumes of information. We want to be able to display this in a figure similar to the image below to enable us to visualize the clusters and assignments that we had made.
In order to display these in a step-by-step fashion, we must load the JSON data from the attachment onto a Visualforce page for us to render on a chart. If we attempt to load this data into memory to pass down in a variable to the page from the controller, this will enable us to simply render the image, but will lead to us breaking the heap size governor execution limit through having so many attachments in memory. The solution is to use a remote action from the page that will take in some parameters used to identify the iteration data set we are attempting to retrieve, and then returns the data from this set correctly formatted for our use.
As you will see in the codebase and example in the next article, we process and format the data on the server in our remote action Apex, rather than processing it on the client side. This is to improve the responsiveness of the page by having it focus solely on the rendering of the image, and not dealing with looping through a large volume of data to ensure it is correctly formatted for the charting library.
Jennifer is the Multichannel Practice Lead and a Technical Architect at Mavens Consulting. She has 10+ years of IT experience, is a Certified Salesforce Developer and an active member of the PhillyForce user group. You can follow her on twitter @JenWyher.
Paul is a Force.com MVP and Technical Architect for Mavens Consulting with over 5 years experience in developing applications for the Salesforce1 Platform as well as enjoying playing with other languages and frameworks. He is a Certified Salesforce Developer and Advanced Developer, and runs the North UK Developer Group and the Force.com Cast video series (see @forcedotcomcast on Twitter). You can follow him on twitter @pbattisson or at http://www.paulbattisson.com.