Tracing in Application Performance Troubleshooting

Banner

This 3 part blog touches upon the basics of tracing, how you can use tracing data to troubleshoot and setting up tracing functionality in SnappyFlow

Why Tracing?

Automation of business processes has radically changed the application landscape. Traditionally, applications used to carry out a basic business procedure and store in a database the results of that process. Multiple such applications were built, each working independently in implementing a simple process. Collaboration between such applications used to take place offline. As the need to provide real-time and online services increased, applications became more complex and there was an increase in real-time interaction between applications. Applications evolved from simple sequential execution model to distributed concurrent execution model.

Tracing

Tracing

Tracking down problems with these distributed, asynchronous and concurrent applications is hard

Traditionally, failures were tracked by their symptoms - alerts based on time series data, or monitoring error events from logs. Once a symptom is found, identifying the root cause is done by analyzing logs and/or by correlating time series metrics from different applications and instances. Because of the asynchronous and concurrent nature of execution, it is very difficult to trace the exact sequence of events which led to the failure by these traditional means.

Tracing stitches together or aggregates the actions performed by applications to service a request. This aggregated data is then presented in a chronologically organized manner for analysis. A context is created when a request is received by an application and the request is tracked, using this context, through all the execution paths in the application(s). At each execution path, the entry and exit times along with other useful information are logged. The traces thus collected are analyzed using powerful visualizations to identify the hot spots and bottlenecks.

Trace logs contain spans, transactions and traces. Trace is a collection of transactions and spans that have a common starting point or root. Spans contain information about the activity in an execution path. It has the measurement of time from start to end of the activity and also includes parent child relationship with other spans. Transaction is a special span, which is captured at the entry of a service like http/rpc processor, message broker, cron job etc.

Using Trace data to troubleshoot effectively

The trace data is used to quickly identify the root-cause of the failure. In asynchronous concurrent applications, failures or delays occur in one of the many execution paths. To effectively, detect these failures or delays, powerful visualization and analysis tools are required. In order to trouble shoot a failure user needs to know:

  • Contextual view of execution – an easy to track view to understand the sequence in which the transaction execution progressed and the time taken in each step
  • Child transactions and spans – delays in child transactions can contribute to overall delays. Prior execution times in terms of average, median, 95 Percentile, 99 Percentile helps compare the current execution with reference to previous runs
  • Time spent in each span and comparison with prior execution - it is important to know how the current span duration rank in comparison to previous runs. Typically this is done by comparing the current duration value with the average, median, 95 Percentile, 99 Percentile values
  • Percentage of time spent by a span with respect to overall time - this will help identify the hot spots in execution
  • Cumulative span execution time – cumulative span execution time is computed after considering the span parallelism. This value measures delay contributed by all spans to the overall delay. The gap between the cumulative span execution time and the total transaction duration gives an indication about the time the transaction is either spending in additional processing or waiting on resources - I/O wait, DB locks, compute time, event loop saturation etc
  • Stack traces – stack traces are useful to quickly identify the error execution path and pin point the failure reason. Stack traces, provide a list of stack frames from the point, where execution failed up to the start of application

SnappyFlow supports distributed tracing compliant with Opentracing standard. Tracing allows users to visualize the sequence of steps a transaction (whether API or non-API such as a Celery job) takes during its execution. This analysis is extremely powerful and allows pinpointing the source of problems such as abnormal time being spent on an execution step or identifying point of failure in a transaction.

email

Get in touch

Or fill the form below, we will get back!

14

Is SnappyFlow right for you ?

logo
Subscribe to our newsletter