Let’s elaborate on peaks. How do you define peak utilization or contention without being overly conservative or aggressive?
There are 2 dimensions of peaks. You can measure them across time or members of the group. Let’s take a cluster with 8 ESXi hosts as an example:
- You measure across members of the group. For each sample data, take the utilization from the host with the highest utilization. In our cluster example, let’s say at 1:30 pm, host number 7 has the highest utilization among all hosts. It hits 80 percent. We then take it that the cluster peak utilization at 1:30 pm is also 80%. You repeat this process for each sample period. You may get different hosts at different times. You will not know which host provides the peak value as that varies from time to time. This method results in over-reporting, as it is the peak of a member. You can technically argue that this is the true peak.
- You measure across time. You take the average utilization of the cluster, roll up the time period to a longer time period, and take the peak of that longer time period. For example, the cluster average utilization peaks at 80% at 1:30 pm. You roll up the data for 1 day. This means the peak utilization for that day is 80%. This is the most common approach. The problem of this approach is that it is actually an average. For the cluster to hit 80% average utilization, some hosts have to hit over 80%. That means you can’t rule out the possibility that one host might hit near 100%. The same logic applies to a VM. If a VM with 16 vCPUs hits 80% utilization, some cores probably hit 100 percent. This method results in under-reporting as it is an average.
The first approach is useful if you want to know detailed information. You retain the 5-minute granularity. With the second approach, you lose the granularity and each sample becomes one day (or one month, depending on your timeline). You do not know what time of the day it hits the peak. The first approach will result in higher average than the second approach, because in most cases, your cluster is not perfectly balanced (identical utilization). In the tier 1 cluster, where you do not oversubscribe, I’d recommend the first approach as it will capture the host with the highest peak. The first approach can be achieved by using super metrics in vRealize Operations. The second approach requires the View widget with data transformation.
Does this mean you always use the first approach? The answer is no. The first approach can be too aggressive when the number of members is high. If your data center has 500 hosts and you use the first approach, then your overall data center peak utilization will always be high. All it takes is just one host to hit a peak at any given time. The same situation applies in contention. All it takes is 1 big VM, which tends to have higher contention, to skew the peak contention figure in the cluster.
The first approach fits a use case where automatic load balancing should happen. So you expect an overall balanced distribution. A DRS cluster is a good example.