Troubleshooting JAVA Application Performance Using CPU Profiling and Memory Profiling

Profiling is a powerful tool to get inside the mind of your application and understand the root cause of bottlenecks. Profiling can bring down troubleshooting time by an order of magnitude. However, the workflow for setting it up, operating it and analyzing the results is usually not trivial. Setting up heap profiling in a continuous mode can impact the performance of the application whereas setting it up in an on-demand mode is arduous. Users typically use multiple tools to trigger heap dumps, parse them and analyze them.

To address these issues, we’ve built CPU and Memory Profiling feature for Java right into SnappyFlow to provide a seamless monitoring experience for SRE and Performance engineers. It is easy to instrument profiling into an application, trigger profiling on-demand and analyze the profiles right then and there – while remaining in the context of the application and easily accessing metrics, logs and tracing in an integrated workflow

What is Memory Profiling and CPU Profiling?

Memory Profiling is simply the process of analyzing the memory used by a JAVA process at a given point of time. To understand memory profiling better, let us look at how JAVA Virtual Machine (which runs the JAVA process) handles memory. A heap is where the JVM stores referenced objects as and when they are created, and the size of the heap grows (until the size reaches the predefined max heap size) and shrinks during runtime. A heap dump is a snapshot of the of the memory used by a JAVA process at a given point of time on the JAVA heap. This snapshot contains information on different objects and classes and individual memory usage at the time of triggering the heap dump. The heap dump can be triggered manually or can be automated when OutOfMemory exceptions occur or can be requested by a heap analysis tool on demand. An analysis of the heap dump helps developers pinpoint specific issues in the code such as large data structures, unused (but referenced) objects using memory etc.,

CPU Profiling provides a thread level CPU usage and helps identify

Synchronization Issues
- Deadlock Situation: when several threads running hold a synchronized block on a shared object
- Thread Contention: when a thread is blocked waiting for others to finish
Execution Issues:
- Abnormally high CPU usage: when a thread is in running state for a very long period
- Processing Performance is low: when many threads are blocked

Why is it Important?

Memory management in JAVA is handled by the JVM garbage collector – a big reason for JAVA’s growth and popularity. While it is generally an efficient and automatic process, it is quite common for applications to suffer from crippling memory leaks and out of memory exceptions. An over reliance on the garbage collector and possibly poor handling of object references, misconfigurations of heap sizes are typical reasons for memory leaks.

OutofMemory occurs when the allocated heap size is not enough for all the referenced objects or if there are any errors in the code making some objects way too large. It is important to note that garbage collection frees up only unused objects and classes. It will not clear objects that are in use. This simply means, there will be no stopping an object from growing until it reaches the overall heap size and throwing a runtime error.

A very large heap size can seemingly negate OutOfMemory issues at the expense of high memory requirement at the infrastructure level but however, one doesn’t get a clear picture in terms of what is causing the error. JVM also allows fine tuning garbage collection depending on the application – number of parallel collection threads, parallel garbage collection for scavenges/full collections, old/new generation size ratio and eden/survivor space ratio. A higher heap size increases garbage collection execution time but decreases number of executions and a smaller heap size decreases execution time but increases number of executions.

Why is it Important in Performance Troubleshooting?

During app development or testing phase, OutOfMemory exceptions occur quite often and even if they do, they can be identified and plugged, given the luxury of time. But in production, these OutOfMemory exceptions tend to occur after prolonged application run time and once they occur, the issue needs to be identified and plugged asap. In general, memory leaks are very gradual, and go unnoticed during dev/testing phase.

In an microservices architecture with multiple applications running in parallel, the overall performance is determined by the aggregate performance of every single application. Thus, it becomes important to drill down to an individual process level to troubleshoot performance issues.

Existing Tools & and Limitations

There are many standalone tools such as VisualVM, JProfiler or Eclipse Memory Analzer for Heap Dump Analysis and Profiling. While these tools are powerful by their own right, there are some major shortcomings

Performance/Uptime at the stack level is usually monitored by an APM tool and these APM tools don’t offer out of the box heap analysis or profiling features
Poor integration features with APMs
Need to switch between apps for troubleshooting

In a typical SRE use case, the troubleshooting workflow starts with the APM tracing data to identify bottlenecks. Once a process is identified as slow or stuck, a heap dump analysis and profiling of the process helps us drill down. In such scenarios, ability to quickly shift between tracing / heap dump and profiling data can significantly improve troubleshooting times.

SnappyFlow has been designed specifically for such use cases requiring fast, in-context troubleshooting.

Memory Profiling

CPU Profiling in SnappyFlow

‍

What is trace retention

Tracing is an indispensable tool for application performance management (APM) providing insights into how a certain transaction or a request performed – the services involved, the relationships between the services and the duration of each service. This is especially useful in a multi-cloud, distributed microservices environment with complex interdependent services. These data points in conjunction with logs and metrics from the entire stack provide crucial insights into the overall application performance and help debug applications and deliver a consistent end-user experience.

Amongst all observability ingest data, trace data is typically stored for an hour or two. This is because trace data by itself is humongous. For just one transaction, there will be multiple services or APIs involved and imagine an organization running thousands of business transactions an hour which translates to hundreds of millions of API calls an hour. Storing traces for all these transactions would need Tera Bytes of storage and extremely powerful compute engines for index, visualization, and search.

Why is it required

To strike a balance between storage/compute costs and troubleshooting ease, most organizations choose to retain only a couple of hours of trace data. What if we need historical traces? Today, modern APM tools like SnappyFlow have the advantage of intelligently and selectively retaining certain traces beyond this limit of a couple of hours. This is enabled for important API calls and certain calls which are deemed anomalous by the tool. In most troubleshooting scenarios, we do not need all the trace data. For example, a SaaS-based payment solutions provider would want to monitor more important APIs/services related to payments rather than say customer support services.

Intelligent trace retention with SnappyFlow

SnappyFlow by default retains traces for

HTTP requests with durations > 90th percentile (anomalous incidents)

In addition to these rules, users can specify additional rules to filter out services, transaction types, request methods, response codes and transaction duration. These rules are run every 30 minutes and all traces that satisfy these conditions are retained for future use.

With the built-in trace history retention and custom filters enabled, SREs and DevOps practitioners can look further to understand historical API performance, troubleshoot effectively and provide end-users with a consistent and delightful user experience.