What is Trace Retention?

Tracing is an indispensable tool for application performance management (APM) providing insights into how a certain transaction or a request performed – the services involved, the relationships between the services and the duration of each service. This is especially useful in a multi-cloud, distributed microservices environment with complex interdependent services. These data points in conjunction with logs and metrics from the entire stack provide crucial insights into the overall application performance and help debug applications and deliver a consistent end-user experience.

Amongst all observability ingest data, trace data is typically stored for an hour or two. This is because trace data by itself is humongous. For just one transaction, there will be multiple services or APIs involved and imagine an organization running thousands of business transactions an hour which translates to hundreds of millions of API calls an hour. Storing traces for all these transactions would need Tera Bytes of storage and extremely powerful compute engines for index, visualization, and search.

Why is it Required?

To strike a balance between storage/compute costs and troubleshooting ease, most organizations choose to retain only a couple of hours of trace data. What if we need historical traces? Today, modern APM tools like SnappyFlow have the advantage of intelligently and selectively retaining certain traces beyond this limit of a couple of hours. This is enabled for important API calls and certain calls which are deemed anomalous by the tool. In most troubleshooting scenarios, we do not need all the trace data. For example, a SaaS-based payment solutions provider would want to monitor more important APIs/services related to payments rather than say customer support services.

Intelligent Trace Retention with SnappyFlow

SnappyFlow by default retains traces for
HTTP requests with durations > 90th percentile (anomalous incidents)

In addition to these rules, users can specify additional rules to filter out services, transaction types, request methods, response codes and transaction duration. These rules are run every 30 minutes and all traces that satisfy these conditions are retained for future use.

With the built-in trace history retention and custom filters enabled, SREs and DevOps practitioners can look further to understand historical API performance, troubleshoot effectively and provide end-users with a consistent and delightful user experience.