Hadoop and Big Data: Use Cases at Salesforce.com
Salesforce.com is the premier cloud computing service provider for the enterprise. It provides several popular services such as Sales, Service, Marketing, Force.com, Chatter, Desk, and Work to over 130k customers, and millions of users. This results in over a billion transactions per day accessed through multiple channels – API, Web and Mobile. Gathering events (or clickstream or interactions) in a central location is one of the key advantages of being a cloud provider. This event data is extremely useful for internal and product use cases. This in turn enables users to have a better overall experience of using Salesforce.com services.
Events are gathered through application instrumentation and logging. For each logged event, we collect interesting information about the interaction- organizationId, userId, API/URI/Mobile details, IP address, response time, and other details. From an internal perspective, it provides the ability to troubleshoot performance problems, detect application errors, measure usage and adoption for the Salesforce code base, as well as for custom applications built on our platform. Many internal users at Salesforce use logs on a daily basis. Key user groups are the R&D teams, Technical Operations teams, Product Support, Security, and Data Science teams. We use a combination of log mining tools such as Splunk- for operational use cases, and Hadoop- for analytic use cases.
The focus of this blog is to dig into a couple of interesting use cases on our Hadoop platform. Hadoop is a popular open source MapReduce technology that is used to answer simple and complex questions on large data sets. We have used it successfully for internal as well as product use cases. Internal examples include product metrics, capacity planning and product examples include Chatter file, user recommendations, and search relevancy.
Product metrics is important for Product Managers to understand usage and adoption for their features. It also provides the ability to the executive leadership team to make decisions based on trends. Not surprisingly, it is often as important to kill unpopular features as it is to invest in popular features. With product metrics, the goal was to define features, their log instrumentation, a standard set of metrics, measure, and visualize. We used Custom Objects on the Force.com platform to record feature definitions, and log instrumentation. While we use Pig extensively to mine through the logs on an adhoc basis, we didn’t think it worthwhile or productive for every Product Manager to write their own Pig scripts for their features. So we wrote a custom Java program to auto-generate Pig scripts based on the pre-defined feature instrumentation. These scripts ran on the Hadoop cluster aggregating data, and storing the daily summaries in another Custom Object. Finally we used Salesforce.com Reports and Dashboards to visualize this data in a self-service way for PMs and executives. More information on this can be found in this recorded webinar and associated slides.
Collaborative filtering is a popular algorithm for many recommendation systems on Amazon, Facebook, Twitter, and other popular sites. At Salesforce.com, we used this for recommending files and users to follow. Files are to the enterprise, what photos are to Facebook. Given the importance of files, we wanted to recommend appropriate files to existing Chatter users, as well as new users in the enterprise. Amazon’s item to item collaborative filtering algorithm for recommending items on their website was a good inspiration. We adapted the algorithm for our use case, and chose a community based approach, which among other benefits, avoided the “cold start” problem for a new user in the enterprise. We ran the algorithm on the Hadoop cluster as Java MapReduce jobs. More information on this can be found on the same recorded webinar and associated slides.
Hadoop and other associated big data technologies are important to our success. Salesforce.com is active in the open source community with many contributions to Pig and HBase. Recently we open sourced Phoenix- a SQL based technology on top of HBase.
We are just getting started 🙂