This post is part of Operationalize Your World program. Do read it first to get the context.
In the previous post, I covered the reason why over-provisioned VMs are bad. We also talked about the technique. Let’s discuss the implementation for CPU in this blog. I’m using the CPU Demand counter here. For better accuracy, you should use CPU Run – CPU Overlap. Read this for the reason. Yes, create a super metric for it.
Create a dynamic group that capture all the large VMs. Depending on your environment, you can either grab those 8 or more vCPU VM, or 6 or more. In the screenshot below, I’m using 8 vCPU.
What do you notice about the CPU Utilization in the following screenshot?
The Large VMs as a group is only using 7.61% max!
That means not a single one of them used >8% CPU over a period of 24 hours. This is an example of severe over provisioning.
Once you have the super metrics, you can display them in the dashboard. You can use line chart or View. I use View as I do not need to show them as 2 separate charts. What do you see from the following example?
- The area marked 1 is not what you want to see. None of the Large VMs are doing any work. This means they are all oversized.
- The area marked 2 is healthier. At any given moment, one of the VM are doing work. Demand counter can go above 100% as hypervisor performs IO (storage or network) using another core.
- The average remains low all the time. This is over 1 month period, with 5 minute granularity. It shows majority of the VMs are over sized.
Now, the above is good as overall. But it’s missing something. Can you guess what?
Yes, it’s missing the VMs themselves. What if upper management want to see at a glance all the VMs utilisation?
We can create a table that has data transformation. The table complements the line chart by listing all the VMs. From the list, you can see which VMs is the most over provisioned, because the list is sorted. You can sort by 5-minute or 1 hour peak.
What’s the limitation of the table?
- It does not show the VM distribution. Where are the large VMs? Do they exist in a cluster where they are not supposed to exist?
This is where the Heat Map comes in.
- We group them by Cluster, then by ESXi, so we can see where they are. You want to see them well spread in your clusters, and not concentrated in just 1 host.
- The heat map is sized by vCPU configuration. In this way, the bigger VM will have bigger box. A 32 vCPU VM will have a box that is 4x larger than a 8 vCPU VM, so it will stand out. You can see in the following example that some large VMs are much larger than the rest. I have monster VMs in this environment.
A great feature of heat map is color. It’s very visual. We know that both under provisioning and over provisioning are bad. So I set the color spectrum. I choose
- black for 0
- red for 100
- green for 50
If I do the right sizing, I’d see mostly green. If I under provision, I’d see mostly red. If I over provision, which you can expect in most environment, guess what? They are black!
That’s all you need to see the overall picture.
A VM Owner does not care about your overall picture. She just cares about her VM. That means we need to drill down into individual VM.
To facilitate that, we need a list of VMs. I use a Top-N as enables me to sort the VM. The good thing about Top-N is you can go back to any period in time. Heat map only allows you to see current data.
The time line in Top-N is set to just 1 hour. No point setting it longer as it will average it. What you want is already provided by the View List. Use that to pick the VM to downsize. The Top-N is merely to drive the widgets.
We also add table. It shows the individual vCPU peak utilisation. It’s showing in seconds, following vCenter real-time chart. 20 seconds = 100%.
The table does not answer quickly what is the CPU utilisation 95% of the time. This is where the Forensic comes in. It shows the 95th percentile. You expect that green vertical line to be at 80% mark, indicating it’s correctly size.
The table and Forensic are useful. What’s their limitation?
- They not as user friendly as a line chart.
- Plus, VM Owner wants to see the utilization of each vCPU. This lets her clearly if a specific peak was genuine demand or not.
The chart is busy as the 5 minute granularity is maintained. No roll up. You can zoom into any specific time of your interest.
I’m only showing the first 16 vCPU. You can configure to show the rest. My screen not big enough to show all 16 vCPU. If yours is not big enough, or you need to show >16, create multiple View widgets.
How do they fit together on the dashboard? Here is how they look like.
I hope you found it useful. Happy rightsizing!