Every release, there is a period of time between our internal release freeze and the sandbox release. I like to think of this time as Hammertime. When it’s Hammertime, we do a lot of work to make sure we have not introduced regressions between releases.
In this window, we run a process we call “The Hammer”. The Hammer means taking every single Apex test that you or anyone else has created and running it twice. We run the test once in the existing version of our service – the one you’re using today in production – and once in the release candidate version of the service. We compare the results to identify any unexpected functionality changes between releases.
The next time you are wondering, “why do they make us have code coverage if they don’t force us to have assertions in the tests,” you now have your answer! Even if your tests don’t follow good practice and do not actually test things, just running the code in your tests allows us to perform our hammer tests.
Apex is not alone in running hammer tests. Visualforce, packaging, Trialforce, and dashboards all go through a similar process. The Apex process is the most involved, but the general principle of finding potential issues before you do is applied to all of these.
The Hammer Process
The soundbyte is simple: We run your tests against the old version and the new version of the system, and we make sure nothing changes. More precisely, we look at the test result log produced by the test using the existing system, and look at the log output from the test run against the future system version. The two should have identical behavior. If they fail, they should both fail, and at the same place. If they pass, they should both pass. The log output should be more-or-less identical.
That would be as easy as the soundbyte if there were 60 Apex tests out there to run. In fact, there are over 60 million tests out there to run. Twice. We have to run all of these tests, compare the results, identify regressions, and fix them, in a few short weeks before the sandbox release.
Consider, for a moment, the challenge that this presents.
We need to set up the application stack to run in two different versions. We need to store the output of all of the tests so they can be contrasted. We need to run every single test, twice. We need to compare the 60 million results in their entirety to look for differences. And we can’t touch your data, ever, so this all has to run in our secure data centers.
As you form a mental picture of all the different things that go into this massive effort, you can understand how it got nicknamed “The Hammer”.
Maintaining security around your org and your org’s data is critical, and it poses a few challenges. We can’t inspect your debug logs, so we have to use the test results as a proxy. You might have an Apex statement saying “system.debug(anObject.superSecret__c + ‘ is the secret code’);” and you might not want me to see that when I’m looking for regressions. In addition, many tests still use the older seeAllData=true pattern, so the hammer needs to run with access to your actual data – but we can’t log in to your org or look at that actual data.
Thus, this whole process has to run in our data centers, where security is tight and nothing gets in or out. We borrow some unused space in the databases to make all of this happen, and we have to do this without interfering with your day-to-day processing. There are multiple teams involved, from our core infrastructure teams to our security teams to the Apex team, in making this all operate. Once test execution is complete, the log differences are analyzed by an automated application that bundles up the potential issues (super-secret data scrubbed) and sends them over the wall to us.
The next challenge is triaging the “potential” issues. Every real bug is going to trip up more than one org’s tests, so duplicates abound. Some of these get automatically filtered; some do not. There are also quite a few “red herring” items that do not necessarily represent actual regressions, but that show up as differences between old & new test runs. Thus begins the manual + automatic phase, where the results are inspected by hand and fed back through a slightly smarter analyzer for several cycles until we’ve sifted out the true results of the entire operation.
After using borrowed database space to run 60 million tests, and after doing multiple passes on automation, we are left with several very tired people and about 30 unique bugs requiring a fix. We have three weeks to sort through those before sandbox release shows up, so that you do not have one of your tests fail after the upgrade.
Our Commitment To You
This complex juggling act happens three times a year. We make a commitment to you that your customizations will continue to run as you expect, no matter how much we do to change the system. In all my years of being in enterprise software, I have never heard of such a commitment. I’ve never heard of such a smooth upgrade path: you go to bed on Friday, you wake up Saturday, and your system has been upgraded. I still have flashbacks to lenghty, painful upgrade projects for other companies I have worked for – those were never easy times. To save you from these flashbacks, we do the complex ballet described here (which still makes it sound easier than it is) three times a year.
You can help us fulfill this commitment by using data silo. Tests that utilize the new data silo paradigm (seeAllData=false) create their own data, and don’t actually need access to the production data stores. These tests are much faster for us to run, and much more reliable. We don’t need to worry about reading your debug statements, because they’re outputting fake data. We don’t need to worry about having a copy of your production data, because data silo tests do not rely on that data to succeed. Org-independent tests make The Hammer smile.
Nobody’s Perfect, But We Try To Get Close
Despite the significant effort we put behind hammer tests, we don’t catch every single problem. Our analysis tools can incorrectly ignore actual failures, and the team will sometimes do the same. There are areas of custom org code that are left uncovered by Apex tests (tsk tsk). There are callouts to web services we cannot predict. There are environmental issues not revealed in tests. However, we do catch a great number of the issues we missed with our normal testing when we run The Hammer.
For the 60 million tests we ran this past release, we uncovered 25 potential regressions in time to fix them prior to release. On the surface, that sounds like a lot of work for a tiny result. Each of those 25 issues we solved were going to impact someone in some way, potentially in some critical way, and they often span multiple customers. The result is quite large from the point of view of each customer who never had to know that they were saved from a big headache.
Considering that we are running six billion transactions a month, I believe we can lay claim to six sigma ninja status, in an actual statistical six-sigma sense. There are still a few issues uncaught by hammer which make it to production. As you can see from the process described here, we try our hardest to help you avoid them.