Predictive performance management of a petabyte scale Kubernetes application for a global hi-tech major

Banner
While this is rather routine, the client faced some unique challenges.
  • Large size of diagnostics data per device
  • 1000’s of devices sending data in bursts
  • Massive Storage requirement running into PetaBytes on EFS
Challenge

This unique situation presented another unique issue. The system was designed to scale up as an when necessary. They key questions were – was the application scaling up efficiently? Was the scaling due to increased data load or due to inherent application issues? The large volume of data coming in at short bursts added another problem in the network layer – packets lost, connection time outs and response time degradation. If a pod were to fail and restart, it was impossible to attribute the problem to overload or to application issues.

How SnappyFlow Helped

SnappyFlow provided an application centric view of the overall system and provided a simplified view of application performance and metrics data. SnappyFlow helped in understanding system load, how it was balanced between pods, resource utilization and provided insights to fine tune individual systems to provide predictable performance, cost, and scaling.

SnappyFlow helped to
  • Detect root causes of out of memory issues through better observability of application metrics, container metrics and logs
  • Reduce infrastructure foot print through right sizing of containers and hosts which was made possible by understanding load and performance patterns
  • Huge savings in data costs by resolving performance bottlenecks and drastically reducing data buildup
  • Detect faulty elements and trigger support requests
  • Detect systemic patterns linked to quality of products/versions
  • Predict failures through signature analysis
Help
Cost

75%

Reduction in Storage Costs

5X Reduction

5X

Reduction in troubleshoot times: From weeks to hours

Daily Ingest
Daily ingest

2TB of raw data parsed to 75GB structured data and stored in Elasticsearch per day

Archive
Archive data

2TB of data per day

Ingest rate
Ingest rate

3000 requests at peak (transferring data of 250MB to 3GB per request)

Benefits

scalability
Scalability

The overall system was fine tuned to cater to Large point loads from 1000’s of devices

Overall data pipeline was streamlined

Right sizing of pods for proper scaling

debuggability
Debuggability

Provide a hierarchical view from a stack to application to pods to containers

Linking metrics to bring actionable insights – application, Kubernetes, logs

Powerful Kubernetes, node.js/express, Java monitoring with APM

Cost
Storage cost

Huge reduction in EFS storage costs streamlining data pipeline

Use of tiered S3 to save costs

email

Get in touch

Or fill the form below, we will get back!

14

Is SnappyFlow right for you ?