How do you know that the IaaS Platform (be it on-prem or in the cloud) is serving its workload well? If you depend on complaint, then you run a complaint-based operations.
Changing from reactive to proactive is unfortunately a complex undertaking, especially in large organisation where there are many roles and people. It requires transformation and resetting behaviour. It is not easy to get customers to agree on SLA when you’ve promised them “good” for years.
So what can you do?
- Measure your actual performance
- Improve it if the reality is not what you expect from a decent IaaS. If there is no complaint, this is even better as it means you do not have to panic and rush the improvement.
- Get buy in from your management on the SLA. Establish this Internal SLA.
The diagram below shows how Internal SLA represents the intermediate step. The Formal SLA is typically less stringent, giving yourself buffer.
You do not need the following in this intermediate step:
- Class of Service. You don’t have to have Gold, Silver, etc tiers. You can keep your mixed workloads in the same clusters or datastores. This means in vR Ops, you do not need to create policy. Just use the base, active policy. This simplifies adoption.
- Per VM SLA. You’re measuring the Infra for now. If you mix workload, then per VM is impossible to achieve.
Since you’re measuring just the infra, it’s a lot easier to implement. You start by establishing the KPI of a cluster. Below is my recommendation.
The above has Disk, hence no need to look at Datastore. You just need to focus on cluster object. This also removes complexity when a cluster has multiple datastores, and vice versa.
A Cluster KPI is simply average of the above metrics. Now that you a single metric for a cluster that combines its Performance and Utilization, you can aggregate. I’d recommend you count the number of Clusters who KPI falls into the Red zone. If your environment is healthy, then the count will result in 0. You can repeat this check every 5 minutes. In a performing IaaS, you will get a flat line as shown below. Since it’s a trend line, you can see the performance over time, giving you insight if there is a pattern. Since you’re counting the number of clusters in the red zone, this metric is scalable across thousand of clusters. You certainly expect all your clusters to perform!
The table complement the line chart. It lists each cluster, sorted by the worst performing cluster. You can click into the cluster, and drill down into the cluster KPI, if you customize the cluster summary page.
The above requires super metric, my favourite feature in vR Ops. Here is what it looks like:
I removed the disk metrics as I want to avoid double counting when used with vSAN. I also remove 1 network metric.
Below is what it looks like. I did a preview on one of the cluster to show you what it looks like.
Hope it helps you taking the first step toward proactive operations.