Cloud Performance Monitoring
Cloud Performance Monitoring is a service to aid in real-time performance tracking of the infrastructure and applications hosted on the cloud to ensure that the services are performing at optimum levels. Proactive monitoring is important to identify bottlenecks, resolve issues or prevent outages before they impact the end users.
Cloud Providers provide tools for monitoring like CloudWatch from AWS, Azure Monitor from Azure, and Stackdriver from GCP. These tools are useful for monitoring services provided by the cloud platform, but for application-level performance monitoring, it is important to use a 3rd party APM tool such as New Relic.
Case Study on Cloud Monitoring Service for one of the major US-based Customers
Tech Vedika developed a platform for verifying the credentials of third-party vendors and later it was migrated to AWS Cloud. We currently offer cloud-managed services to the client and one of the important aspects of it is performance monitoring. The client wanted the application to work smoothly so our job was to ensure that the infra and application worked without any interruptions or outages.
- Infrastructure Monitoring – We used a combination of OpsRamp and CloudWatch
- For Application Monitoring we used NewRelic
- Database performance monitoring using OpsRamp and NewRelic
Key initiatives taken by Tech Vedika
- Proactive Monitoring – Set up a 24 X 7 cloud monitoring team for incident tracking and management. For each cloud service, different metrics need to be monitored. The metrics must be configured with threshold values to generate warnings and critical alerts. These triggers help us to address the issue before it hampers the application. We created a template to add standard metrics and applied it to the service instead of configuring each metric separately:
- In the case of EC2 service, we configured different metrics to monitor the performance of CPU, memory, hard disk, system state, network in/out, etc.
- For services such as AWS RDS, we monitored CPU, memory, free storage space, slow-running queries, RD DB connections, etc.
- For application load balancers we monitored 4XX, and 5XX errors, the healthy, and unhealthy status of the target machines on ALB.
- Infrastructure Monitoring – Proposed and onboarded OpsRamp tool for infrastructure monitoring. The tool enabled the monitoring of services including EC2, RDS, Application Load balancers, and application URLs. OpsRamp is an ITOM tool with embedded lightweight ITSM that helps in incident, problem, and change management apart from infra monitoring. We also used cloud-native monitoring tools such as cloud watch to monitor some services
- Application Monitoring – New Relic was used to monitor metrics such as web transaction time, response time, error rate, Apdex score, etc. New Relic was integrated with OpsRamp so that any alert generated from New Relic can be sent to OpsRamp. OpsRamp was used as a centralized monitoring tool by the Cloud Monitoring team to monitor all types of incidents.
- Monitor resource utilization – We monitored the existing resources utilization trend and observed the usage and right-size the instances
- New Service onboarding – Performance testing was done to identify and right size as per the specific client’s project requirement.
- Identified long-running queries and carried out a root-cause analysis to fix the issues
- 4XX and 5XX issues monitored – The alerts helped us in identify website-related issues. To mitigate 5XX errors from Amazon SNS, we retry logic in the application for all failed requests
- Performance Testing – We tested various metrics to check different aspects of performance:
- Stress Test – To check the stability, reliability, and responsiveness during a high workload
- Load test – Check system performance for multiple user logins simultaneously
Migration to the cloud doesn’t guarantee performance, it is not the provider’s responsibility and sometimes the inbuilt tools might not be sufficient. An experienced managed service provider can monitor the performance of the cloud infrastructure and ensure that the application and services are operating effectively without any outages and issues.
Cloud performance monitoring helps you gain visibility into the cloud infrastructure including application, and database performance, and gives you metric-based insights. We also gain insights into how the application is performing over a period. You can leverage this information, to avoid errors in the future and continuously optimize performance.
As part of its managed services, Tech Vedika supports customers in monitoring servers and services in a multi-cloud environment such as AWS, Azure, and GCP as well as on-premise environments. Our cloud monitoring tool offers a single pane, common dashboard view which helps development and support teams in tracking key metrics and system health. The tool uses AI and ML models that are trained to correlate alerts in real-time to identify the probable root cause of issues for quick resolution by Tech Vedika cloud engineering teams.