This post is part of Operationalize Your World program. Do read it first so you get the context.
You provide IaaS to your customers. A cluster, be it VMware vSphere Cluster or Microsoft Hyper-V Cluster, is a key component to this service offering. In hyper-converged era, your cluster does storage too. You take time to architect it, making sure all best practices are considered. You also consider performance and ensure it’s not a slow system.
Why is it then, when a VM owner complains that her VM is slow and she blames your infrastructure, you start troubleshooting? Doesn’t that show your lack of confidence in your own creation? If your cluster is indeed fast, why can’t you just show it and be done in 60 seconds?
Worth pondering, isn’t it? 😉
It’s hard to reach formal agreement with customers quickly and consistently when the line is not clearly drawn. If you have disagreement with your customers, especially paying customers, guess who win 🙂
You need to be able to show something like this.
In the above chart, there is a threshold. It defines what is acceptable level of performance. It quantifies what exactly you mean when you promise “fast”. It is your Performance SLA. Review this if you need more details.
You assure them it will be fast, and you’ve got it backup with measureable metrics. You prove that with the 2nd line. That’s the actual performance. Fast or not is no longer debatable.
You measure performance every 5 minutes, not every hour. In a month, that is 12 x 24 x 30 = 8650 proofs. Having that many data points backing you up helps in showing that you’re doing your job.
Now that you’ve got the Performance SLA, how do you implement it in vRealize Operations?
I’ll take disk latency as an example, as it’s easy to understand.
The chart below shows the various disk latency among 6 VMs, from 9:00 am until 9:35 am. What do you spot?
The average is good. They are mostly below 5 ms.
The worst is bad. It is hovering around 20 ms. It is 4x higher than average, indicating a VM is hit. The storage subsystem is unable to serve all VMs. It’s struggling to deliver.
Let’s plot a line along the worst (highest) disk latency. The bold red line is the maximum among all the disk latency from all the VM. We call this Max (VM Disk Latency) in the cluster.
A cluster typically have a lot more VMs than 6. It’s common to see >100 VMs. Plotting >100 lines will make the chart unreadable. Plus, at this junction, you’re interested in the big picture first. You want to know if the cluster is performing fast.
This is the power of super metric. it tracks the maximum among all VMs, creating its own metric as a result. You lose information on which metric in the super metric, as it’s made of >1 VM.
The next chart has all the details removed. We just have the Maximum and the Average. It’s clear now that the max is much higher than average.
We added 3 dotted line in the above chart. They are the 3 possible outcome. If your Maximum is:
- below the line, then you are good. The cluster is serving all its VM well.
- near the threshold, then your capacity is full. Do not add more VM.
- above the threshold, then your cluster is full. Move VM to reduce demand before VM Owner complains.
Can you see the importance of the Performance SLA?
It’s there to protect your job. Without the line, your reputation is at risk. Say you’ve been delivering Disk Latency at <1 ms on your all flash SSD array. Everyone is happy. Of course! 🙂
You then do a storage maintenance for just 1 hour. During that period, disk latency went up to 4 ms. It is still a respectable number. In fact, it’s a pretty damn good number. But you got a complaint. It happened to coincide with the time you did the maintenance.
Can you guess who is responsible for the slowness experience by business?
You bet. Your fault 🙁
But if you have established a Performance SLA, you’re protected. Say you promise 5 ms. You will be able to say “Yes, I knew it would go up as we’re doing maintenance. I’ve catered for this in my design. I knew we could still deliver as per agreed SLA.”
Let’s now show a real example. This is what it actually looks like in vR Ops 6.4.
Notice the Maximum is >10x higher than the average, and the average is very stable. Once the Cluster is unable to cope, you’d see pattern like this. Almost all VMs can be served, but 1-2 were not served well. The maximum is high because there is always 1 VM that wasn’t served.
Only when the Cluster is unable to serve ~50% of the VMs, will average become high too.
BTW, do you notice the metric names differ?
- The Max is a super metric.
- The Average is a regular metric
This is because metric at higher-object (e.g. cluster, Host) are all average. None of them is the real peak. Review this “when is a peak not a true peak” article.
The above is for Disk. IaaS consists of providing the following as a service:
Hence we need to display 4 line charts, showing that each service is delivered well.
As every Service Tier performance is different, you need to show it per service tier. A Gold Tier delivers faster performance than Silver Tier, but if it’s higher than its SLA, it’s still not performing. Performance is relative to what you promise.
Since VMs move around in a cluster due to DRS and HA, we need to track at Cluster level. Tracking at Resource Pool level is operationally challenging. Do not mix service tier, as Tier 3 performance can impact Tier 1. The only way you can protect higher tier is with Reservation, which has its own complication operationally.
Once I know what to display, I’d normally do a whiteboard, often with customers. It helps me to think clearly.
This is what the dashboard looks like. It starts with a list of clusters. Selecting a cluster, will automatically show the performance. It shows CPU, RAM and Disk. Network drop packet should be 0 at all times, hence not shown. You can track it at data center level, not cluster.