This unique situation presented another unique issue. The system was designed to scale up as an when necessary. They key questions were – was the application scaling up efficiently? Was the scaling due to increased data load or due to inherent application issues? The large volume of data coming in at short bursts added another problem in the network layer – packets lost, connection time outs and response time degradation. If a pod were to fail and restart, it was impossible to attribute the problem to overload or to application issues.
SnappyFlow provided an application centric view of the overall system and provided a simplified view of application performance and metrics data. SnappyFlow helped in understanding system load, how it was balanced between pods, resource utilization and provided insights to fine tune individual systems to provide predictable performance, cost, and scaling.
Reduction in Storage Costs
Reduction in troubleshoot times: From weeks to hours
2TB of raw data parsed to 75GB structured data and stored in Elasticsearch per day
2TB of data per day
3000 requests at peak (transferring data of 250MB to 3GB per request)
The overall system was fine tuned to cater to Large point loads from 1000’s of devices
Overall data pipeline was streamlined
Right sizing of pods for proper scaling
Provide a hierarchical view from a stack to application to pods to containers
Linking metrics to bring actionable insights – application, Kubernetes, logs
Powerful Kubernetes, node.js/express, Java monitoring with APM
Huge reduction in EFS storage costs streamlining data pipeline
Use of tiered S3 to save costs