To illustrate how SnappyFlow can be used to analyze application performance, we will consider a sample application and walk through the traces. The sample application provides users the ability to provision computing instances in different public cloud providers.
Application topology is shown in the figure below:
The execution flow for the various user requests is summarized in the table below:
||Request is processed by provision-manager. Task details are updated in DB. Sends task execution request to worker pool via Kafka. Responds to the request with 202 ‘Accepted’. One of the workers reads the request from Kafka. Worker accepts the task and sends “in-progress” notification to the manager via Kafka. Worker continues to process the request. Manager reads the request from notification-topic and updates the task status. Worker Completes the provisioning task and sends “completed” notification to the manager via Kafka. Manager reads the notification and updates the task status|
||Manager updates the task details synchronously and sends 200 OK response|
||Manager deletes the task details synchronously and sends 200 OK response|
||Manager accepts the request. Forwards the request to worker. Worker responds to the forwarded request with the required data. Manager responds to the original request using the data received from worker|
|GET||`/api/workers/reset-counters||Manager accepts the request. Forwards the request to worker. Worker responds to the forwarded request with the required data. Manager responds to the original request using the data received from worker|
In the example application, when a client request is received in provision-manager, a trace context is created and added to that request. When this request traverses through the processing functions like Kafka producer, Kafka consumer, dB query, http processing etc. the trace context is extracted and maintained. The trace data produced at each stage of execution will contain this trace context. The trace context helps stitching together the events and visualization of the execution path.
The figure above shows a flame graph showing the initial execution of the “POST /api/provision”. The sequence diagram shows the 3 steps:
All traces where captured by sfTrace, without any modification to the application code. sfTrace, instrumented the java classes at runtime to introduce the trace context and collect the start and end of each action.
Flame graph can help identify gaps in execution. If there are significant gaps in execution between transactions or spans those get highlighted in flame graph. These gaps may indicate probable delays caused because of resource constraints or other scheduling issues.
The journey of a trace through various services is summarized in the Transactions View. These services may get invoked through HTTP/RPC request, message broker service or similar inter-process communication frameworks. The transaction summary view, lists the related transactions in a chronological order along with duration spent in each transaction, performance stats from previous runs, and the percentage duration contributed by each related transaction with reference to the overall duration.
If a related transaction consumes higher time than the average or 95 percentile stats from previous runs, that transaction could be causing a potential delay and will need to be analyzed in detail using the Span view.
The percentage duration spent by each transaction provides an easy way to identify hot spots in execution.
Each transaction is broken down into spans. Each span represents an execution path which the transaction took to complete the run.
Spans are different types – DB, Cache, Template, External, Custom spans. Span list can be filtered by span type for analysis. Span view also shows.
Similar to transaction view, span view provides, the following metrics for each span:
The sfTrace collects traces related to all services which it monitors. Real-time view provides an insight into overall performance of the application. Filters can be used to analyze and drill down to problem transactions.
Response Time is often used as a measure of performance. By selecting a specific transaction, it is possible to see the variation in response time range. Once an transaction with very high variance between the minimum and maximum response time is identified, the response time filter slider can be used to select a range. All traces with response time within the selected range are shown, which then can be analyzed using flame graph, transaction list and span list.
The result of a transaction or response code for HTTP request is another important parameter monitored. If the result is FAILURE or HTTP Response codes indicate failures, those traces can be filtered separately and analyzed.
In aggregate view, summary of all traces is captured. The view shows, for each unique trace, the captured number of occurrences, average number of spans captured, number of errors, response time - average, median, 95 percentile, 99 percentile values.
By clicking a specific trace from Aggregate view, user can view the real-time stats and drill down to the specific instances of the trace for trouble shooting.
Transaction map visual summary of the interaction between different application components. A transaction in its execution depends on Databases, message buses and other application services. Transaction map shows you the connectivity between different components along with statistics on query rate, average, minimum and maximum response times. This high level view helps drilling down to the individual transaction flows and identify anomalies.
An example to illustrate how trace can help trouble shoot issues quickly. Consider the following flame graph of a trace. The flame graph shows the trace with all spans encountered during its execution. All spans show DB operations to Postgres, Redis and Elasticsearch. Closer view of the flame graph shows, many Elasticsearch DB spans with longer in duration.
To analyze this further, Span view of the trace is selected. In the span view shown below, notice that 95.3% of the total transaction duration is spent in DB Spans.
To view the spans contributing to this high percentage of time, the spans can be sorted based on duration percentage.
The sorted span list, clearly shows the eight spans contributing to this high percentage of execution time. Examining the stack trace for the affected DB span shows the actual Elasticsearch query. Analyzing the query confirms the behavior. Developers can now easily use this information to optimize the query and execution.
sfTrace captures exceptions generated during a transaction execution along with the stack trace. This makes the debugging easy and fast.
Using the filter, select traces which show Failure.
Select one of the traces to view the flame graph.
The flame graph shows that transaction encountered an error. An exception occurred almost at the end of execution.
In the Transaction view, the result status shows Failure.
Span list shows all the 7 spans. Error may have occurred in one of the spans.
The error tab shows the details of this exception with complete stack trace. No further debugging is needed, root cause is detailed here - one of the DB spans querying Elasticsearch was using an index which is not present in Elasticsearch.
By using the response time filter it was observed that some of the traces in a service were taking longer time to complete.
Viewing the flame graph for this trace showed two potential issues. The parent transaction of the trace is invoking another service and that service is contributing to some of the execution time.
Transaction view showed the time spent in executing the related service (41.31% of the total duration). Also, it highlighted that the child service took more time compared the previous median execution time (see the yellow highlight). It also showed that the total duration was beyond the previous high 99 percentile value.
In Span list, select the child transaction and see the list of spans associated with the child transaction. It was observed that there were many spans showing higher execution time compared to the prior runs. The child transaction had 93 spans.
To see which spans are contributing to the higher execution times, the list is sorted based on the % time spent column (IN(%)). Observe many spans runtime varying more than the last 95 Percentile or 99 Percentile values.
Click on the affected span, to view its details. The detail shows the DB query being executed along with full statement. Analyzing each of the highlighted span helps identify the delays and optimize code for further performance improvement.
From this blog series, we hope you could appreciate the power of tracing in performance troubleshooting.
SnappyFlow is designed to help organizations monitor a variety of applications, physical and virtual infrastructure including Kubernetes pods. The built in features such as ready to use dashboards, tracing, synthetics monitoring and alerting help monitor your infrastructure with complete application context and help troubleshoot problems with remarkable speed. SnappyFlow clients have been able to reduce troubleshooting times from weeks to hours and optimize their overall running costs.
To experience SnappyFlow in action, get in touch with us or sign up for a Free Trial.