Part 1 explained a new concept, where we added Availability SLA and Performance SLA as the basis of Capacity Management. In this part, I will now provide the formula for each charts. We will cover Tier 1, followed by Tier 2 and 3.
You should be performing capacity planning at Cluster level, not Data Center or Host level.
Compute (Tier 1)
To recap, we do not have over-subscription in Tier 1. We only have it in Tier 2 and 3. As a result, it becomes simpler, as we are following Allocation model essentially.
Super Metric Formula: Max no of allowed VM in 1 cluster – No of VM in the cluster
- Apply the Availability policy at cluster level since it makes more sense. Applying at ESXi Host level is less applicable due to HA. Yes, the chance of a host going down is higher than entire cluster going down. However, HA will reboot the VMs, and VM owners may not notice if service is not affected. On the other hand, if a cluster goes down, it’s a major issue.
- The limitation of this formula is it assumes your cluster size may vary. This is a fair assumption. You should keep things consistent. If for some reasons you have say 3 cluster sizes (e.g. 8, 10, 12), then you have 3 super metrics.
Supply: Total physical cores of all ESXi Hosts – HA buffer
- We can choose physical Core or physical Threads. One will be conservative, while the other aggressive. Ideal number is 1.5 of physical core. My recommendation: take the core, not the Threads. This is because it is Tier 1, your highest & best tier.
- Threshold: 10% of your capacity, as it takes time to buy cluster (which also needs storage). You are also not aiming to run your ESXi at 100% utilization.
- We do not have to build your threshold (which is your buffer actually) into the super metric formula as it’s dynamic. Once it’s hard coded in the super metric, changing it does not change the history. It is dynamic because it depends on business situation. If there is a large project going live in a few weeks, then your buffer needs to cater for it. This is why we need to stay close to the business. It’s also something you should know, based on your actual experience in your company. You have that gut feel and estimate.
Demand: Total vCPU for all the VMs.
- If we are using virtual threads in your VM, then count them as if they are a full vCPU. For example, a VM with 2 vCPU and 2 threads per core should be counted as 4 vCPU.
Supply: Total physical RAM of all ESXi Hosts – HA buffer
- No need to include ESXi vmkernel RAM as it’s negligible. If you are using VSAN & NSX, you can add some buffer. You do not need to include virtual appliance as they take the form of a VM, hence it will be included in the Demand.
- Threshold: set the name number, which is 10% in this example.
Demand: Total vRAM for all the VMs
Super Metric Formula: Max (ESXi Host vmnic utilization) in the cluster
This number has to be below your physical capacity. Ideally, it has buffer so it can handle spike from network intensive events.
- The above formula is all you need for Tier 1.
- In emergency, temporary solution, you can still deploy VM while waiting for your new cluster to arrive. This is because you have HA buffer. ESXi host is known for its high uptime.
Tier 2 and 3
Tier 2 and 3 will be different, as there is oversubscription. Since we overcommit CPU and RAM, we can no longer use allocation model. We need to take into account performance.
- Super Metric Formula: Maximum (VM CPU Contention) in the cluster
- Super Metric Formula: Average (VM CPU Contention) in the cluster
- Super Metric Formula: Maximum (VM RAM Contention) in the cluster
- Super Metric Formula: Average (VM RAM Contention) in the cluster
For the total number of VM left in the cluster, see Tier 1. It’s the same formula, just a different policy. You have higher threshold naturally.
For the ESXi vmnic utilization, see Tier 1. Identical formula is used.
Indeed, a few line charts is all you need to manage capacity. I am aware it is not a fully automated solution. However, my customers found it logical and ease to understand. It is following an 80/20 principle, where you are given the 20% room to make the judgement call as the expert.
To see the actual super metric examples, proceed to part 3.