Continuing the blog on vSphere ESXi Performance dashboard, here is the Cluster Performance dashboard:
It’s designed to be similar to the ESXi dashboard, so it’s easier to learn. So do read the ESXi first, as it’s a building block for the cluster.
While a cluster is technically a collection of ESXi, it does have its own characteristic. So here are the changes:
- Added VM Disk Latency. This covers HCI scenario, or design where the datastores do not span across clusters.
- Cater for scenario where there is unbalanced in the cluster. The root cause could be reservation, limit, VM affinity, etc. But the first thing is to determine if there is unbalance to begin with. So for CPU, I plot both the cluster average, and the highest among its hosts. In perfectly balance, the 2 average and highest will be very similar in value and pattern. In unbalance, either their pattern or value is not the same, or both.
- For RAM, since there are 2 counters (Consumed and Active), it will be confusing if I plot the Average and Max for both. You will end up with 4 line charts. So I simply plot the Consume (average) and Active (average).
The above dashboard helps you troubleshooting a specific cluster. If you have many clusters, how do you know which ones to look at first? You need to have a table listing all clusters. You want to compare their performance, not their utilization. The table below does not list their utilization, as it’s not a primary information. It will clutter this table, and may even mislead you to look at the wrong cluster.
The above is good if you have <100 clusters. What if you have a lot more? The View List lets you filter into a specific vCenter or Datacenter.
The above table is good, but what if you can’t look at it every 5 minutes. What if you look at it once a day? Or once a week? If you look at it on Sunday morning when there is no load, what data do we show?
- We can show the current data, which may not show problem in the past.
- We can show the average of the week, which will be good.
- We can show the worst of the week, which will be bad but not relevant as it could be a one time, 5 minute peak.
This is where Percentile coming handy. You can ignore the outlier.
Just like the ESXi dashboard, find this dashboard in VMware Sample Exchange
vRealize Operations 7.0 enhances the widgets and dashboard, which enables us to create better user experience. With that, happy to share the VMware ESXi Performance dashboard:
The above dashboard is color coded. The idea is you just need to glance that everything is green. You only need to look at the counter if they are not green.
Layout wise, it’s split into 4 levels. Do click to enlarge it as there is description added on the image. The dashboard shows Performance first, then utilization. Can you guess why?
Performance: What counters define your ESXi Performance?
- We know that utilization is not performance. It’s related, but it’s not the same thing. An ESXi with low utilization could be a sign of something wrong. Could it be CPU and RAM are waiting for Disk? Could it be networks are dropping packets?
- A high performing ESXi is one that does its job well. It serves its workload easily. It’s not struggling juggling the demands from all the VMs running on it. So performance must be measured in terms of how the VMs are being served. There are 2 sub-dimension to this.
- How bad is the problem? This covers the depth.
- How widespread is the problem? This covers the breadth.
- How bad is the problem can be quantified by taking the worst CPU Contention or RAM Contention experienced by all the VMs.
- How widespread is the problem can be quantified by the percentage of VMs facing contention.
- The 2 sub-dimensions complement each other. It gives you an insight into the performance of your ESXi. If you have a very bad contention, but it only impact a small percentage, then the problem is narrow. This could be sign of monster VMs. If the worst contention is not that bad, but it impacts almost all VMs, then the ESXi itself is struggling.
- Do you know why I don’t add VM Disk Latency? Even on vSAN, the solution may not be on the ESXi you’re looking at.
Utilization: Drive it high as you paid for the whole box
- Now that you can measure Performance, you have confidence to drive utilization high. No need to artificially put headroom. Hence Utilization is shown below Performance as it’s secondary.
- For RAM, both Consumed and Active are shown. If active is low, no need to upgrade RAM as Consumed contains disk cache. For me, it’s fine for Consumed to be 95% so long RAM Contention is 0.
- For CPU, both Demand and Usage are high. Do you know the difference between both?
- Download the dashboard from VMware code.
- Import the dashboard, view, and supermetric.
- Enable the supermetric in your base policy. Hope it’s a good introduction to the awesome power of supermetric!
- Replace your ESXi Summary Page with this. Sunny my brother has documented here.
Hope you find it useful. Next is vSphere Cluster Performance dashboard.
Folks like Daniel in Hong Kong, Sajag in Thailand, and Ramandeep in US have noticed that I shifted my recommendation from CPU Contention to CPU Ready as Performance SLA. The reason is essentially Change Management. Moving from complaint-based operations to SLA-based is a transformation. It’s not something you do in a month. You need to enlighten your boss and your customers. It’s a paradigm shift that can take months.
As a result, CPU Ready is a better start than CPU Contention. Your IaaS business is not ready for Contention, pun intended.
CPU Ready is more stable than CPU Contention, as it’s not affected by Hyper Threading and Power Management.
- Running both HT on a core reduces the amount of CPU cycle by 50%. Since HT gives only 1.25x boost, each HT gets 62.5% when both are running. That reduction is accounted for in CPU Contention, which is why it can spike to >35% when Ready is not even 1%. Test this by running 2 large VMs in 1 ESXi. If the ESXi is 16 cores 32 treads, then you run 2x 16 vCPU VM. Run both at 100%. Set Power Management to Max so you eliminate frequency scaling from impacting CPU Contention. Both should experience minimal CPU Ready but high CPU Contention. My guess is CPU Ready will be <1%, while CPU Contention will be >35%.
- Power Management. As you can see here, in general you should take advantage of power savings. The performance degradation is minimal while the savings is substantial. CPU Contention accounts for this frequency drop. My guess is frequency drop of 25% will result in CPU Contention of 25%. I wrote guess as I have not seen a test.
Considering the above, Ready is a lot less volatile. This makes it more suitable as SLA. Operationally, it’s easier to implement. It’s easier to explain to folks less familiar with VMkernel CPU Scheduler.
If you use CPU Contention as formal SLA, you may be spending a lot of time troubleshooting when the business don’t even notice the performance degradation.
Where do you use CPU Contention then?
- If the value is low, then you don’t need to check CPU Ready, Co-Stop, Power Management and CPU overcommit. The reason is they are all accounted for in CPU Contention.
- If the value is high (my take is > 37.5%), then follow these steps:
- Check CPU Run Queue, CPU Context Switch, “Guest OS CPU Usage“, CPU Ready and CPU Co-Stop. Ensure all the CPU counters are good. If they are all low, then it’s Frequency Scaling and HT. If they are not low, check VM CPU Limit and CPU Share.
- Check ESXi power management. If they are set to Maximum correctly, then Frequency Scaling is out (you’re left with HT as the factor), else HT could be at play. A simple solution for apps who are sensitive to frequency scaling is to set power management to max.
- Check CPU Overcommit at the time of issue. If there is more vCPU than pCore on that ESXi, then HT could be impacting, else HT not impacting. IMHO, it’s rare that an application does not tolerate HT as it’s transparent to it. While HT reduces the CPU time by 37.5%, a CPU that is 37.5% faster will logically make up for it.
Unfortunately, there is no way to check directly the individual impact of HT and Frequency Scaling. There is no separate counter for each. You can see it indirectly by checking CPU Demand or CPU Usage. If there is a dip at the same CPU Contention went up, but CPU Run does not dip, then it’s HT or Frequency Scaling impacted the VM.
Hope that clarifies. If your observation in production differ to the above, do email me.