Open Source at – Part 2: How We Contribute

In Part 2 of this series Ian looks at how contributes to numerous open-source projects including Apache Qpid, Apache Phoenix, HBase, Hadoop and Pig

One of the great things about open source is that it lets companies with large engineering teams, like Salesforce, use the specialized expertise of their engineers for a much broader impact. Note: If you missed our open source intro, it’s a great place to start!

In Part 1 of this series, we looked at a few of the open source tools, frameworks, and products that engineers use to support the Salesforce service. Now we’ll look at how engineers contribute improvements to those same programs back to the community.

Open Source at Contributing to Apache Qpid

At, we deliver 3 major releases a year and dozens of patches. We need the ability to resolve customer issues quickly.

As an example, take our Message Queue (MQ) layer. Message Queues are a way of shifting the execution of that code to a later time; like a line at the bank, each request “queues up” and waits for its turn to be run. In the early days, many Salesforce developers wrote their own implementations of this common and useful pattern, but these were eventually centralized into a single consolidated queue system running on a closed-source commercial product.

This worked well enough…until a bug appeared. In some situations, the task would fail with an opaque error message. The engineering team worked with the vendor for many release cycles, trying one patch after another to no avail. Had it been an open source program, the team would have made fixing this terrible bug their top priority and squashed it. But as it was, they were at the mercy of the vendor.

To solve this problem, as well as improve scalability, the engineering team searched for a replacement and eventually settled on Apache Qpid. The team once again ran into mysterious bugs… but because it was an open source project, they were able to look at the source code, debug the issues, fix them, and contribute them back — including a client-side fix that resulted in a 40 percent reduction in memory usage! And, not only does benefit, but everyone who uses Apache Qpid gets this improvement.

Test All The Things

Selenium is a browser-based automation tool. When you want to write integration tests against the user interface of a product (for example, the rendered web page), it’s helpful to have a tool that lets you craft those tests in a cross-platform, cross-browser way. Salesforce makes heavy use of Selenium in running our massive suite of functional and integration tests against every code check-in. engineer Luke Inman-Semerau has been a committer on the Selenium project for the last 3 years, and has been a key contributor on documentation, python, and java. is a mobile-first platform, and that’s just as evident in our testing. We’ve been heavily involved in 3 different mobile drivers for Selenium. Luke is a committer on both the Selendroid project (Selenium for Android), and the ios-driver project (Selenium for iOS), along with Salesforce engineer Roman Salvador. Both Selendroid and ios-driver were created at eBay, and was an early adopter who helped incubate and evolve them to their current state.

We’ve also adapted Selenium to work with other mobile devices. engineer Jim Evans has produced a new OS library, Windows Phone Driver, that allows you to use the same Selenium web driver API to automate web applications running on Windows Phone 8.1.

Greg Wester, Sagar Wanaselja, and David Louvton all presented on how Salesforce uses Selenium earlier this year.

Batch Data: Hadoop

The core of Salesforce’s business, of course, is data. Nearly every operation on Salesforce uses data in one form or another: viewing Accounts, executing Apex and Visualforce, generating reports, etc. Optimizing our use of data is the largest part of almost every engineer’s job at And, no surprise, open source is a key part of this too.

Processing large batches of data can be a big resource draw. Doing this against a standard database is challenging because it requires you to first extract and transform the data, then load it somewhere else where you’ll do the heavy lifting. In 2004, Google’s MapReduce paradigm took the batch-processing world by storm and was quickly given a vibrant open source life in the form of Apache Hadoop. Hadoop sends the computations to the data instead of the other way around.

At, CTO Walter Macklem started a project in 2010 to introduce Hadoop at Salesforce, under the codename “Gridforce.” This made use of Hadoop Distributed File System (HDFS) and now also uses Apache Pig, which is a high-level language on top of MapReduce programs. Prashant Kommireddi, one of the team leads for the Gridforce team, is a committer on Pig, contributing regularly to its development.

Today, batch processing with Hadoop is used extensively in back-end processing, such as improving search relevance and discovering recommendations for items to follow in Chatter. It’s also part of a new program that allows access to pre-processed log files (code name “ELF”).

Getting Committed with HBase

Relational databases are extraordinary and powerful pieces of software, and Salesforce relies on relational databases. This gives a wide range of capabilities and a consistent basis for storing all customer data. However, it comes with some inherent limitations. Because of the depth of relational capabilities in the product (including triggers, views, indexes, and wide-ranging atomic transactions), there becomes a point of diminishing returns: the amount of engineering effort required to incrementally improve performance becomes prohibitive to team scale.

So we asked ourselves: What if we could store vast numbers of records but with fewer assumptions and capabilities? What if we could scale our data storage hardware out horizontally but with the same data safety guarantees?

The answer to this question is Apache HBase. HBase is a horizontally scalable “NoSQL” database based largely on the design of Google’s Bigtable system. You may be familiar with HBase because it’s the same technology that runs the massive infrastructure behind Facebook Messages. It’s a fault-tolerant, consistent row store that scales linearly by adding commodity hardware machines. This means that some things that are easy for a relational database (like transactions and indexes) are comparatively much harder. (Though, as you’ll see in Part 3, we’re closing that gap via Apache Phoenix, a SQL library on HBase that was open-sourced by Salesforce!)

What will run on HBase? Initial features are targeting audit and compliance use cases, such as audit history, event tracking, and archival storage of older records. Eventually, though, we’re aiming to offer high-safety, low-cost large storage for “big objects” that use familiar APIs but don’t give you all the features of classic Salesforce objects.

HBase was the first major open source software project that Salesforce got deeply involved with at a community level. Lars Hofhansl, an architect at Salesforce for over 10 years, became an HBase committer in 2012, and he has since gone on to be the 0.94 series release manager and a member of the HBase PMC. Jesse Yates, another HBase committer, is also on the Salesforce HBase engineering team and has coauthored many critical HBase features, including Snapshots.

Both Lars and Jesse spoke at HBaseCon 2014 (as did Salesforce engineers Eli Levine and James Taylor). Along the way, the focus of Salesforce’s engineering efforts on HBase have been directed at bringing it up to the same level of world-class resiliency and data safety that we demand for our enterprise customers. (This is my team, so I could go on for hours about it. But I’ll leave that for a future post. However, you can check out Lars’, Jesse’s, and James’ presentations on our Slideshare channel.)

Speaking of Open Source Committers…

As you can see, isn’t just an open source consumer; we are also a contributor. But, beyond that, we’re also an open source “pusher” in that we actively support a large roster of engineers who work on open-source projects part-time or full-time. This includes project leaders like Tom Lane (PostgresSQL), Matz (Ruby), Jason van Zyl (Maven), Damien Katz (CouchDB), as well all the other folks listed above.

In Part 3, we’ll talk about how has been busy creating and releasing new open-source projects.

July 23, 2014

Leave your comments...

Open Source at – Part 2: How We Contribute