by Jennifer Wyher and Paul Battisson February, 2015
The past few years have seen machine learning emerge as one of the hottest trending topics within the technology sector as the volumes of data being collected and the resources available to process this data have grown exponentially. As more and more data is being stored on the Salesforce1 Platform, the demand to be able to perform machine learning analysis and obtain insights into data residing on Salesforce1 will grow. In this article we, Jennifer Wyher and Paul Battisson, both of Mavens Consulting, will be discussing how you can perform machine learning on the Salesforce1 Platform. We will discuss some of the limitations inherent with this and how they can be addressed before working through a practical example of building a machine learning system using Apex on the Salesforce1 Platform to see how the pieces all fit together. The contents of this article are based off and expand upon the talk “Building Machine Learning Systems Using Apex” they delivered at Dreamforce 2014.
In 1959, machine learning and artificial intelligence pioneer Arthur Samuel defined machine learning as the “field of study that gives computers the ability to learn without being explicitly programmed.” By this, Samuel meant that the machine itself could take some data and infer from it information which could be then used by the system again to improve its performance. In modern applications, machine learning is often focussed on the processing of large datasets, too big for a human to simply comprehend, to find patterns and correlations within the data for use in detailed analytics.
There are two main types of machine learning algorithms according to most definitions: supervised learning and unsupervised learning. In supervised learning a data set, including training examples and targets, is provided for an algorithm to determine the correct series of parameters it should utilize in making future predictions. Some prime examples of when this would be applicable are house price prediction or optical character recognition (OCR). In the OCR example, a training set of different ways to write the alphabet is provided and the algorithm goes through, making a prediction about the letter, being informed if it was the correct result, and adjusting its parameters accordingly. For example, in the image below we can see seven similar but slightly different ways of writing the capital letter “A.” Although it is easy for a human to distinguish between these, a computer must be trained on a large sample of data to make the prediction correctly.
Unsupervised learning in contrast, focuses on the machine being given a large volume of data and then interrogating the data to find patterns without any guidance on the expected outcome. An example of such a problem is clustering, where we have a series of objects that we wish to cluster together into logical groups. The example system we are going to construct together in this series is a clustering system using the KMeans clustering algorithm which is the most common clustering algorithm. In the image below we can see the clustering algorithm applied to a series of data points that correspond to hits from a baseball pitch. Using the clusters presented here our system has determined the optimal positions for the fielding positions of our players. This analysis could be run separately against each different opposition team (or even each hitter) to enable the defense to be optimally placed each time, thereby maximizing the chance of an out.
So we have seen what the different types of algorithms are, but what are some example uses on the Salesforce1 Platform? As mentioned previously, we are going to detail how to construct a simple KMeans application to allow us to cluster some records, but what records are we going to cluster and why?
Let’s consider an example. As an organization, Acme Inc. sells different heating and cooling equipment, warranties, service agreements for the equipment, and advisory services. They have over 100,000 customers across the United States, ranging from small shops to large office complexes and residential blocks. Acme’s marketing and sales teams would like to have their customers split into groups and profiled so they can better market and sell their products to each group, as well as understand what the typical profile of a group member is.
The KMeans Clustering algorithm is an unsupervised learning algorithm that takes a set of data and separates it into some number (called K) of distinct groups (called a cluster). The average of all the members of a cluster is called the cluster centroid, and would be classed as the average member of a cluster across all parameters. The algorithm works by randomly initializing a set of centroids and then assigning every data member to the closest centroid to form a cluster. The centroid’s position is then recalculated as the average of all the data points associated to the cluster, and the points are then reassigned to the nearest centroid. This process continues until the centroids remain in the same position (or the same position within a particular margin of error). In the image below we can see a visualization of this where the centroids for the three different clusters are shown moving along the lines before ending in their final positions with the cluster assignments highlighted in red, green, and blue.
If Acme were to run this algorithm on their account data taking parameters such as account revenue, account number of employees, revenue for each account by product line, total wattage of products installed, etc., they would be able to segment their accounts in such a way that they would have different clusters profiling the accounts based upon the selected parameters. This would then allow them to review for each account how it differed from the “average” account in that cluster so they could make informed decisions about what products or services might be best to approach them with. Profiling the accounts would also allow the marketing team to be able to analyze how to better market to the accounts in a cluster after reviewing the profile from that group. The information on a typical member and account cluster membership can be stored and reviewed on a regular basis. If the marketing department runs a big campaign aimed at improving the number of service agreements their customers have, then you can rerun the clustering to see if, in general, the number of service agreements for the average customer has increased or identify any changes in customer demographics.
Now we understand to a greater degree what we are aiming to do and why we want to do it. Let's see how this can be done in Apex at an abstract level and the limitations and issues we are going to have to cope with when implementing this system in Apex in Machine Learning With Apex Part 2.
Jennifer is the Multichannel Practice Lead and a Technical Architect at Mavens Consulting. She has 10+ years of IT experience, is a Certified Salesforce Developer and an active member of the PhillyForce user group. You can follow her on twitter @JenWyher.
Paul is a Force.com MVP and Technical Architect for Mavens Consulting with over 5 years experience in developing applications for the Salesforce1 Platform as well as enjoying playing with other languages and frameworks. He is a Certified Salesforce Developer and Advanced Developer, and runs the North UK Developer Group and the Force.com Cast video series (see @forcedotcomcast on Twitter). You can follow him on twitter @pbattisson or at http://www.paulbattisson.com.