Continuing the blog on vSphere ESXi Performance dashboard, here is the Cluster Performance dashboard:
It’s designed to be similar to the ESXi dashboard, so it’s easier to learn. So do read the ESXi first, as it’s a building block for the cluster.
While a cluster is technically a collection of ESXi, it does have its own characteristic. So here are the changes:
- Added VM Disk Latency. This covers HCI scenario, or design where the datastores do not span across clusters.
- Cater for scenario where there is unbalanced in the cluster. The root cause could be reservation, limit, VM affinity, etc. But the first thing is to determine if there is unbalance to begin with. So for CPU, I plot both the cluster average, and the highest among its hosts. In perfectly balance, the 2 average and highest will be very similar in value and pattern. In unbalance, either their pattern or value is not the same, or both.
- For RAM, since there are 2 counters (Consumed and Active), it will be confusing if I plot the Average and Max for both. You will end up with 4 line charts. So I simply plot the Consume (average) and Active (average).
The above dashboard helps you troubleshooting a specific cluster. If you have many clusters, how do you know which ones to look at first? You need to have a table listing all clusters. You want to compare their performance, not their utilization. The table below does not list their utilization, as it’s not a primary information. It will clutter this table, and may even mislead you to look at the wrong cluster.
The above is good if you have <100 clusters. What if you have a lot more? The View List lets you filter into a specific vCenter or Datacenter.
The above table is good, but what if you can’t look at it every 5 minutes. What if you look at it once a day? Or once a week? If you look at it on Sunday morning when there is no load, what data do we show?
- We can show the current data, which may not show problem in the past.
- We can show the average of the week, which will be good.
- We can show the worst of the week, which will be bad but not relevant as it could be a one time, 5 minute peak.
This is where Percentile coming handy. You can ignore the outlier.
Just like the ESXi dashboard, find this dashboard in VMware Sample Exchange