This 3 part blog touches upon the basics of tracing, how you can use tracing data to troubleshoot and setting up tracing functionality in SnappyFlow
Automation of business processes has radically changed the application landscape. Traditionally, applications used to carry out a basic business procedure and store in a database the results of that process. Multiple such applications were built, each working independently in implementing a simple process. Collaboration between such applications used to take place offline. As the need to provide real-time and online services increased, applications became more complex and there was an increase in real-time interaction between applications. Applications evolved from simple sequential execution model to distributed concurrent execution model.
Tracking down problems with these distributed, asynchronous and concurrent applications is hard
Traditionally, failures were tracked by their symptoms - alerts based on time series data, or monitoring error events from logs. Once a symptom is found, identifying the root cause is done by analyzing logs and/or by correlating time series metrics from different applications and instances. Because of the asynchronous and concurrent nature of execution, it is very difficult to trace the exact sequence of events which led to the failure by these traditional means.
Tracing stitches together or aggregates the actions performed by applications to service a request. This aggregated data is then presented in a chronologically organized manner for analysis. A context is created when a request is received by an application and the request is tracked, using this context, through all the execution paths in the application(s). At each execution path, the entry and exit times along with other useful information are logged. The traces thus collected are analyzed using powerful visualizations to identify the hot spots and bottlenecks.
Trace logs contain spans, transactions and traces. Trace is a collection of transactions and spans that have a common starting point or root. Spans contain information about the activity in an execution path. It has the measurement of time from start to end of the activity and also includes parent child relationship with other spans. Transaction is a special span, which is captured at the entry of a service like http/rpc processor, message broker, cron job etc.
The trace data is used to quickly identify the root-cause of the failure. In asynchronous concurrent applications, failures or delays occur in one of the many execution paths. To effectively, detect these failures or delays, powerful visualization and analysis tools are required. In order to trouble shoot a failure user needs to know:
SnappyFlow supports distributed tracing compliant with Opentracing standard. Tracing allows users to visualize the sequence of steps a transaction (whether API or non-API such as a Celery job) takes during its execution. This analysis is extremely powerful and allows pinpointing the source of problems such as abnormal time being spent on an execution step or identifying point of failure in a transaction.