Apache Phoenix: A small step for big data

Every part of our lives is now teeming with data. Every customer, app and product is connected and generating massive data streams. Companies are becoming increasingly data-driven in their decisions. At salesforce.com, we spend every day thinking about customer success. In my world, it means how can developers build great apps using all this data for great insights. We started an open source project called Phoenix in the big data space. The Phoenix project just hit a huge milestone, bringing every developer even closer to big data.

What is Apache Phoenix?

Apache Phoenix is a layer over HBase that allows developers to write SQL queries to answer questions against very large data sets. HBase is a distributed, horizontally scalable key value store that persists its data in the Hadoop Distributed File System (HDFS). Phoenix allows a developer to express queries in familiar SQL, and then executes those queries extremely quickly. Phoenix works differently than many other SQL layers on Hadoop by actually compiling down to native HBase client calls.

Say you are a health care company, and you have 50 million customer with digital pedometers. Say you want to reward the top ten most active people in the month of May. It’s expensive to store 10K time series data points, multiplied by 50M customers, and it would be expensive to query that with a traditional database. Using Hadoop, HBase and Phoenix, a developer can answer that question with a simple SQL query and get back results with blazing speed.

SELECT name FROM pedometer_data WHERE sex=‘F’ AND age > 40 ORDER BY step_count DESC LIMIT 10;

When Phoenix compiles down this query, part will run in the server, and part will run on the client. Say you want to find the top ten out of a billion rows. Phoenix can execute the query across a billion rows where the data lives, return the top ten from every region, and then have the client select the final top ten. For some of our use cases, we’ve found Phoenix to be thousands of times faster than other SQL over Hadoop projects. See some performance results here: http://phoenix.incubator.apache.org/performance.html

To take the example further, we could create a secondary index of the customers who are most active:

CREATE INDEX step_count_idx ON pedometer_data(step_count DESC)
INCLUDE(name,sex,age)

Then, the same query would run even faster by automatically using the secondary index.

What happened last week?

The open source project hit a key milestone last week when it graduated from incubation to a top level project. The Apache contributor model helps projects like this grow their community to include people from across multiple companies all working toward the same end – the success of the project. We’re proud to see this project start as an internal Salesforce project, open externally to a github/forcedotcom project, become an Apache Incubator project, and now graduate to an Apache top level project.

Wut, salesforce.com and Open Source?

Salesforce.com operates a cloud service that performs more than 1.8B Force.com transactions, 1B ExactTarget Marketing Cloud transactions, and 5B Heroku web requests daily. We use a wide range of open-source technologies in our platform architecture and contribute back on many projects. We’ve been lucky to contribute to projects such as HBase, Pig, Ruby, Postgres, Maven, Solr, buildpacks and more. Working with these innovative projects helps us delight our customers with our service.

What does the future bring?

We love using open source, we love contributing to open source, and we love to make connections between our services and open source projects. Tell us what open source projects you are working on, and how we can connect or get involved.