Solving Java Memory Regressions with Zero Overhead and High Accuracy

Customer trust is Salesforce’s highest priority. Our customers trust us with their data and that our software platform will perform reliably. They also trust that our applications and architecture will be the fastest and most responsive user experience. That’s why performance is at the top of our priorities.

The Salesforce Performance Engineering team is tasked with ensuring that the platform and SaaS applications perform at the highest level. Our team conducts extensive performance tests continually. We monitor and analyze the results and resolve any regressions that are found. Even a few percentage points degradation in performance is not allowed to go into production.

The performance testing is done in the form of workloads. A workload is a repeatable load test consisting of a set of user requests that exercise specific features or functionalities (Apex cache, Visualforce pages, or Chatter feeds for example). A given workload is run periodically, usually daily, on the latest code version at that time. We achieve repeatability and high accuracy of the test through full automation of the run, data collection, and data analysis. The performance engineering team relies mostly on open source tools (e.g., JMeter for generating load) and tools developed in-house (e.g., test automation orchestration, data collection, and results processing).

The code of core application servers is built with performance in mind. It is extensively instrumented to provide various performance metrics that supply extremely valuable information, especially for monitoring production health and troubleshooting incidents. This information is recorded in the server logs and is collected and analyzed. Besides the server logs, our test automation collects and analyzes system performance metrics provided by OS (e.g. CPU utilization) and JVM (e.g. garbage collection logs). Collected data are aggregated into a set of workload performance parameters that are closely watched from test to test. When a degradation (also called a regression) in performance of existing functionality or metric is observed, the performance team opens an investigation into the regression and drives it to full resolution.

Memory Allocations Heavily Influence Application Performance

Application performance depends on many factors, including: the architecture of the system, algorithms used to achieve given functionality, efficiencies of the code and database queries, cache system, the database, and so on. Among these factors, object allocations play an important role in Java application performance, or any other application utilizing a VM that manages application memory. An increase in the number and/or size of objects allocated may take more operations by the application code. Also, a higher object allocation rate usually leads to an increase in the overhead of memory management by the host VM.

Therefore, object allocations and JVM heap performance are one of the key metrics closely watched in the internal test workloads run by the Performance Engineering team. They are also closely monitored in production. A memory regression in a Java application usually results in an increase in the number of garbage collections (GCs) and their duration. Thus, basic GC statistics available through JVM logs (which are always turned on in our application using -XX:+PrintGCDetails flag of the JVM) can be used to monitor and detect Java memory regressions.

Solving Memory Regressions in a Complex Application

While detecting memory regressions is a relatively easy task, finding the root cause of the increase in memory allocations is usually a very hard problem to tackle. A number of commercial and open source tools exist that aim to help in solving this problem. Commercial tools like YourKit can track object allocations by instrumenting bytecode of the application. Instrumentation is done by an agent attached to the JVM at startup. Another approach to solve memory regressions are heap dumps taken at runtime of the app and inspected later with tools like Eclipse Memory Analyser (MAT), YourKit, etc. In addition to that, ThreadMXBean which is part of JMX MBeans can be used to estimate amount of memory allocated in a given transaction. This usually requires embedding an instrumentation framework in the application that collects and records this data in the logs for every transaction executed by the application.

Our experience shows that unfortunately, for a complex Java application, none of these methods guarantee solving memory allocation regressions. This is even more evident for minor regressions where the difference in memory allocation between compared code implementations is less pronounced, and here is why.

Java profilers that track memory allocations through instrumentation of bytecode, are usually not suitable for complex applications due to a very significant (10x-100x) overhead they add at runtime. The more complex the application is, and hence the more objects allocated during the run, the larger the overhead is. The overhead may be reduced by filtering out allocations of non-interesting classes. However, that requires significant research of the profiled code to identify classes that might be causing the memory regression. Even then, overhead may be significant. Also, due to complexity of the code, there is a chance that the classes whose objects caused the regression may be deemed as non-interesting and therefore be filtered out.

Another common approach, analyzing memory regressions with heap dumps taken at random moments of application runtime, rarely reveals the source of memory regression. This method may succeed when the regression is caused by a new class type introduced in the regressed version of the code. Therefore, comparing objects’ class names found in the two heap dumps taken on different versions of the application might reveal the new class as the source of the memory regression. However, in a general case this approach of comparing heap dumps taken at random moments is rarely successful, especially when the difference in memory allocations is relatively small and the class names didn’t change. Content of a heap dump, even if taken at the same relative time during the workload run, highly depends on what and how many transactions were run, and how much time before the heap dump was taken a GC event happened. Hence, it is almost impossible to do apples-to-apples comparisons of two heap dumps taken during the workload run on different versions of the code.

Finally, the ThreadMXBean approach also has limitations because it only provides the amount of memory allocated by a given thread. It does not tell us what type of objects were allocated and in what part of the code. We use this approach, along with GC logs, as the first line of defense against memory regressions. Using ThreadMXBean, our code tracks the amount of memory allocated by every transaction and records it in the server log along with other performance parameters for the given transaction. Then, using log mining tools like Splunk, we analyze this data to pinpoint transactions that are the source of regression in a given workload.

Collecting Information About Allocated Objects with Zero Overhead

If these approaches do not help, what can we do to solve memory allocation regressions? Let’s summarize what we need to succeed:

We want to record all object allocations and associated parameters (e.g. object type and amount of bytes allocated) during the run of our workload.
We do not want our workload to be impaired by overhead either caused by bytecode instrumentation, or overhead associated with collecting the data about allocated objects by the profiling agent.
We need to ensure high accuracy of the results collected in test experiments.

At first sight, collecting all object allocations with no overhead might seem to be impossible to achieve, as any additional work requires some extra effort to accomplish the work. Unless… memory allocations are already recorded for us for free! Yes, all objects are allocated on the heap. Hence, a heap dump of all objects (including unreachable ones) would contain all objects allocated by the application since the last garbage collection cycle.

However, as we noted earlier, simply taking a heap dump even at a predefined time instance does not help much in solving memory regression. Why? Because it may not contain all objects produced by a set of transactions we would like to analyze. It may not contain all objects because a GC event might have removed some of them and we are not in control of when JVM triggers a GC – unless we can implicitly control it!

How can we avoid a GC in a Java application? Relatively easy: run it on a host with an infinite amount of memory (RAM)! Nowadays, it is not uncommon for a developer to own a workstation with 64Gb of RAM. That amount of RAM can practically be considered an infinite amount of memory for a limited set of transactions we want to investigate. All we need to do is properly configure parameters of the JVM heap to avoid a GC during execution of these transactions.

As an example, consider throughput (parallel) JVM collector. For the purpose of this discussion, it is enough to know that heap of the JVM is split into Young generation and Old generation and the Young generation is further partitioned into Eden space and two Survivor spaces. Size of the heap and its generations is specified by JVM parameters: -Xmn, the size of the Young generation, -Xms, the initial total size of the heap, and -Xmx, the maximum total size of the heap. Size of Eden and Survivor spaces can be controlled by JVM flag -XX:SurvivorRatio that defines the ratio between Eden and each Survivor space. Thus, setting -XX:SurvivorRatio to a large number, forces JVM to dedicate most of the Young generation to Eden space.

For the throughput collector, there are 2 types of GC events: minor and full garbage collection. Minor collection is triggered when there is no available space in Eden to allocate new objects. A full GC is triggered when the space in the Old generation is not enough to accommodate objects promoted from the Young generation. As we noted earlier, our goal is to avoid any GC, as it removes allocated objects from the heap. Thus, for the throughput garbage collector, we need:

Set the maximum and initial size of the heap for tested application to the maximum value not exceeding size of the RAM of the host where test is run.
Set the maximum size of Young generation very close to the maximum total size of heap.
Make Eden space occupy most of the Young generation by setting -XX:SurvivorRatio to a large number (e.g., 20).

Note that even if the application we need to test uses different garbage collectors in production (that may also happen to have a different heap layout), nothing prevents us from changing the type of garbage collector to throughput collector and use the configuration described above. Changing the GC collector should not affect object allocations in the application.

Once we configure heap for the size that allows the largest amount of object allocations without garbage collection, we need to focus on the workload where we observed the memory allocation regression:

Identify workload transactions that contribute the largest amount to the memory regression.
Limit memory allocations performed during the workload run to the amount available in the young generation of the heap we configured earlier.

To identify transactions that contribute the most to the memory regression, we use information from the application logs collected with the help of ThreadMXBean. If this information is not available in your application, transactions that contribute the most to the regression can be identified by re-running the workload with only a single type of transaction for each type involved in the workload, and then using GC logs to identify which run (and thus type of transaction) shows the largest regression in memory allocations.

Having identified type of transaction that is the largest contributor to memory regression, we need to modify the workload to run this type of transaction only. We will then run it for a duration and frequency that would allow fitting all objects allocated during the run in the Eden space we configured.

Runtime Phases of a Typical Workload

A typical workload has a startup (warm-up) phase followed by a steady state phase. In the startup phase the application is initialized, the cache is warmed up, and so on. In the steady state phase the load to the system under test does not vary much and the transaction response time has a relatively low variance. We are interested in investigating the steady state part of the workload, not the transactions and memory allocations happening in the startup phase.

app-phases-warmup-steadystate_jkeodq

Figure 1. Heap occupancy and different phases of a running Java app

Phases of running Java applications can be identified by monitoring the JVM heap, which can be done with the help of various tools (e.g., JConsole). JConsole can be attached to a running JVM process and allows us to see how heap occupancy changes over time as the workload runs. The startup phase and steady state phase generally differ by the memory allocation rate (see Figure 1). Looking at the chart of heap occupancy during the steady state we can determine how long our workload can run between consecutive GC events. If that duration is too short, the load and hence memory allocations can be reduced by tuning parameters of the workload (e.g., number of concurrent threads in JMeter that implement the transaction we choose to run). To increase accuracy of the test results, we recommend running a fixed number of transactions during the workload, as opposed to a fixed duration workload.

We now know what transactions we should run and how many of them we can run before a GC happens, which takes us to the algorithm for recording all memory allocations without any overhead:

Start the profiled application with heap parameters enabling the largest size of Eden space for the given hardware (RAM).
Attach a heap monitoring tool (JConsole) to the JVM process.
Monitor heap occupancy and identify when application startup phase is over.
If a warm-up of the application is required, run the workload with the set of transactions causing the largest memory regression for a sufficient amount of time.
After the warm-up period is over, trigger full GC for the JVM process to clean the heap from objects allocated during startup/warm-up phases.
Run the workload while monitoring heap occupancy to make sure no GC happens during the run of the workload.
Record objects allocated during the steady state of the workload by triggering heap dump for the JVM process (jmap -dump:format=b,file=hd.hprof <pid>) that will include all objects (live and unreachable).

This algorithm needs to be repeated for the baseline and regressed versions of the code. The two heap dumps produced as the result of the algorithm will contain all objects allocated during the steady state phase of the run. Thus, comparison of the content of the heap dumps with tools like Eclipse MAT or YourKit should reveal differences in the type/number/size of objects allocated.

Example

Consider an example of profiling a simple Java application that allocates double arrays wrapped in class MyDoubleArray (see Appendix for the code of this example). In the base run, the application allocates 2000 MyDoubleArray objects each containing a double[102400] array. The regressed version of the code allocates the same number of MyDoubleArray objects, but with the double array larger by 100 elements (i.e., double[102500]). Comparison of heap dumps produced as the result of the algorithm we discussed show this difference in the amount of memory occupied by double[] (see Figure 2). Note that other objects have the same count and occupy the same space in the 2 heap dumps.

Figure 2. Object allocations recorded in the heap dump of base run (left) and regressed run (right).

Key Takeaways

Object allocations in a Java application heavily influence its performance, hence avoiding regressions in this area is very important.
Existing tools and methods even when systematically combined, may not help in solving Java memory regressions; this is even more true for complex applications and smaller regressions.
The presented algorithm can record object allocations with zero overhead and high accuracy that can be used to identify root cause of any memory regression in a complex Java application.

Appendix

Java heap parameters provided at startup of the ObjectAllocator application:

-XX:+UseParallelOldGC -Xmn3900m -Xms4396m -Xmx4396m -XX:SurvivorRatio=20 -XX:MaxPermSize=100m

Code of ObjectAllocator application used in the example

1import java.io.*;
2
3public class MyDoubleArray {
4	private double[] arr;
5
6	MyDoubleArray(int size) {
7		arr=new double[size];
8	}
9}
10
11public class ObjectAllocator {
12	private static void runAllocationsSingleThread(long iterations, int objectSize, 
13                                                       boolean regressed) {
14		int regressedObjectSize=100;
15		int objects=10;
16		MyDoubleArray tmpObj;
17	    	long mainSleepTime=(long)(0.1*1000);
18
19		if (regressed) 
20			objectSize+=regressedObjectSize;
21		for (int i=0; i<iterations;i++) {
22			for(int j=0;j<objects;j++){
23				tmpObj=new MyDoubleArray(objectSize);
24			}
25		}
26	}
27
28	public static void main(String args[]) throws IOException {
29		BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
30		String s;
31		System.out.println("Press enter to start warmup");
32		s=in.readLine();
33		runAllocationsSingleThread(100,1500*1024,false);
34		System.out.println("Run increased memory allocations (true/false)?");
35		boolean regressed=Boolean.parseBoolean(in.readLine());
36		if (regressed)
37			System.out.println("Running regressed application");
38		else
39			System.out.println("Running baseline application");
40		runAllocationsSingleThread(200,100*1024,regressed);
41                System.out.println("main run completed, press enter");
42		s=in.readLine();
43	   }
44}