SnappyFlow’s powerful Kubernetes monitoring capabilities helped troubleshoot performance issues and right size EFS storage providing significant cost savings.

The client has hundreds of thousands of hardware and software endpoints that routinely send Petabytes of logs and metrics to a centralized monitoring application. This application is built on a microservices architecture and runs on Kubernetes pods. The huge volumes of data ingestion and scalability of the application required a powerful monitoring solution to ensure faster troubleshooting, right sizing of Kubernetes Pods and lower storage costs.

PetaBytes of Data

While this is rather routine, the client faced some unique challenges.

Large size of diagnostics data per device

1000’s of devices sending data in bursts

Massive Storage requirement running into PetaBytes on EFS

This unique situation presented another unique issue. The system was designed to scale up as an when necessary. They key questions were – was the application scaling up efficiently? Was the scaling due to increased data load or due to inherent application issues? The large volume of data coming in at short bursts added another problem in the network layer – packets lost, connection time outs and response time degradation. If a pod were to fail and restart, it was impossible to attribute the problem to overload or to application issues.

How SnappyFlow Helped

SnappyFlow provided an application centric view of the overall system and provided a simplified view of application performance and metrics data. SnappyFlow helped in understanding system load, how it was balanced between pods, resource utilization and provided insights to fine tune individual systems to provide predictable performance, cost, and scaling.

SnappyFlow helped to

Detect root causes of out of memory issues through better observability of application metrics, container metrics and logs

Reduce infrastructure foot print through right sizing of containers and hosts which was made possible by understanding load and performance patterns

Huge savings in data costs by resolving performance bottlenecks and drastically reducing data buildup

Detect faulty elements and trigger support requests

Detect systemic patterns linked to quality of products/versions

Predict failures through signature analysis