Tag Archives: VMware vSphere

vSphere Cluster Performance Dashboard

Continuing the blog on vSphere ESXi Performance dashboard, here is the Cluster Performance dashboard:

It’s designed to be similar to the ESXi dashboard, so it’s easier to learn. So do read the ESXi first, as it’s a building block for the cluster.

While a cluster is technically a collection of ESXi, it does have its own characteristic. So here are the changes:

  • Added VM Disk Latency. This covers HCI scenario, or design where the datastores do not span across clusters.
  • Cater for scenario where there is unbalanced in the cluster. The root cause could be reservation, limit, VM affinity, etc. But the first thing is to determine if there is unbalance to begin with. So for CPU, I plot both the cluster average, and the highest among its hosts. In perfectly balance, the 2 average and highest will be very similar in value and pattern. In unbalance, either their pattern or value is not the same, or both.
  • For RAM, since there are 2 counters (Consumed and Active), it will be confusing if I plot the Average and Max for both. You will end up with 4 line charts. So I simply plot the Consume (average) and Active (average).

The above dashboard helps you troubleshooting a specific cluster. If you have many clusters,  how do you know which ones to look at first? You need to have a table listing all clusters. You want to compare their performance, not their utilization. The table below does not list their utilization, as it’s not a primary information. It will clutter this table, and may even mislead you to look at the wrong cluster.

The above is good if you have <100 clusters. What if you have a lot more? The View List lets you filter into a specific vCenter or Datacenter.

The above table is good, but what if you can’t look at it every 5 minutes. What if you look at it once a day? Or once a week? If you look at it on Sunday morning when there is no load, what data do we show?

  • We can show the current data, which may not show problem in the past.
  • We can show the average of the week, which will be good.
  • We can show the worst of the week, which will be bad but not relevant as it could be a one time, 5 minute peak.

This is where Percentile coming handy. You can ignore the outlier.

Just like the ESXi dashboard, find this dashboard in VMware Sample Exchange

ESXi Performance Dashboard

vRealize Operations 7.0 enhances the widgets and dashboard, which enables us to create better user experience. With that, happy to share the VMware ESXi Performance dashboard:

The above dashboard is color coded. The idea is you just need to glance that everything is green. You only need to look at the counter if they are not green.

Layout wise, it’s split into 4 levels. Do click to enlarge it as there is description added on the image. The dashboard shows Performance first, then utilization. Can you guess why?

Performance: What counters define your ESXi Performance? 

  • We know that utilization is not performance. It’s related, but it’s not the same thing. An ESXi with low utilization could be a sign of something wrong. Could it be CPU and RAM are waiting for Disk? Could it be networks are dropping packets?
  • A high performing ESXi is one that does its job well. It serves its workload easily. It’s not struggling juggling the demands from all the VMs running on it. So performance must be measured in terms of how the VMs are being served. There are 2 sub-dimension to this.
    • How bad is the problem? This covers the depth.
    • How widespread is the problem? This covers the breadth.
  • How bad is the problem can be quantified by taking the worst CPU Contention or RAM Contention experienced by all the VMs.
  • How widespread is the problem can be quantified by the percentage of VMs facing contention.
  • The 2 sub-dimensions complement each other. It gives you an insight into the performance of your ESXi. If you have a very bad contention, but it only impact a small percentage, then the problem is narrow. This could be sign of monster VMs. If the worst contention is not that bad, but it impacts almost all VMs, then the ESXi itself is struggling.
  • Do you know why I don’t add VM Disk Latency? Even on vSAN, the solution may not be on the ESXi you’re looking at.

Utilization: Drive it high as you paid for the whole box

  • Now that you can measure Performance, you have confidence to drive utilization high. No need to artificially put headroom. Hence Utilization is shown below Performance as it’s secondary.
  • For RAM, both Consumed and Active are shown. If active is low, no need to upgrade RAM as Consumed contains disk cache. For me, it’s fine for Consumed to be 95% so long RAM Contention is 0.
  • For CPU, both Demand and Usage are high. Do you know the difference between both?

Installation

  • Download the dashboard from VMware code.
  • Import the dashboard, view, and supermetric.
  • Enable the supermetric in your base policy. Hope it’s a good introduction to the awesome power of supermetric!
  • Replace your ESXi Summary Page with this. Sunny my brother has documented here.

Hope you find it useful. Next is vSphere Cluster Performance dashboard.

Allocation Model in vSphere

Allocation model, using vCPU:pCore and vRAM:pRAM ratio, is one of the 2 capacity models used in VMware vSphere. Together with the Utilization model, they help Infra team manage capacity. The problem with both models is neither of them measure performance. While they correlate to performance, they are not the counter for it.

As part of Operationalize Your World, we proposed a measurement for performance. We modeled performance and developed a counter for it. For the very first time, Performance can be defined and quantified. We also add an availability concept, in the form of concentration risk ratio. Most business cannot tolerate too many critical VMs going down at the same time, especially if they are revenue generating.

Since the debut of Operationalize Your World at VMworld 2015, hundreds of customers have validated this new metric. With performance added, we are in the position to revise VMware vSphere capacity management.

We can now refine Capacity Management and split it into Planning, Monitoring and Troubleshooting.

Planning Stage

At this stage, we do not know what the future workload will be. We can plan that we will deliver a certain level of performance at some level of utilization. We use the allocation ratio at this stage. Allocation Ratio directly relates to your cost, hence your price. If a physical core costs $10 per month, and you do 5:1 over-commit, then each vCPU should be priced at least $2 per month. Lower than this, and you will make a loss. It has to be higher than $2 unless you can sell all resources on Day 1 for 3 years.

We also consider availability at this stage. For example, if the business can only tolerate 100 mission critical VMs going down when a cluster goes down, then we plan our cluster size accordingly. No point planning a large cluster when you can only put 100 VMs. 100 VMs, at average size of 8 vCPUs, results in 400 cores in 2:1 over-commit. Using 40 core ESXi, that’s only 10 ESXi. No point building a cluster of 16.

Monitoring Stage

This is where you check if Plan meets Actual. You have live VMs running, so you have real data, not spreadsheet 🙂 . There are 2 possible situation:

  1. Over-commit
  2. No over-commit.

With no-overcommit, the utilization of the cluster will never exceed 100%. Hence there is no point measuring utilization. There will be no performance issue too, since none of the VMs will compete for resource. No contention means ideal performance. So there is no point measuring performance. The only relevant metrics are availability and allocation.

With over-commit, the opposite happens. The Ratio is no longer valid, as we can have performance issue. It’s also not relevant since we have real data. If you plan on 8:1 over-commit, but at 4:1 you have performance issue, do you keep going? You don’t, even if you make a loss as your financial plan was based on 8:1. You need to figure out why and solve it. If you cannot solve it, then you remain at 4:1. What you learn is your plan did not pan out as planned 😉

There are 3 reasons why ratio (read: allocation model) can be wrong:

Mark Achtemichuk, VMware performance guru, summaries well here. Quoting him:

There is no common ratio and in fact, this line of thinking will cause you operational pain.

Troubleshooting Stage

If you have plenty of capacity, but you have performance problem, you enter capacity troubleshooting. A typical cause of poor performance at when utilization is not high is contention. The VMs are competing for resource. This is where the Cluster Performance (%) counter comes into play. It gives an early warning, hence acting as Leading Indicator

Summary

You no longer have to build buffer to ensure performance. You can go higher on consolidation ratio as you can now measure performance.

If you are Service Provider, you can now offer a premium pricing, as you can back it up with Performance SLA.

If you are customers of an SP, then you can demand a performance SLA. You do not need to rely on ratio as proxy.