SDDC Dashboards: The Dining Area

This post continues from the Operationalize Your World post. Do read it first so you get the context.

If your customers are happy, your internal problem is secondary.

To ensure that your customers are happy, there are a few proof you must be able to show:

  1. Are the VMs up?
    • This is the #1 Job. It is more important than security and performance. If the VM is dead, there is nothing to talk about 🙂
  2. Are they fast?
    • Just because they are up does not mean they are fast!
    • Is your IaaS serving them well?
  3. If not, which VMs are hit? By what and when?
    • Who are the victims?
  4. Who’s causing the problem?
    • Who are the villain?
    • We saw a performance degradation on a cluster of 500 VM when just 1-2 VM did an IOmeter.
    • VMs with excessive usage hurts the business.
  5. When a VM Owner complains, can your Help Desk value add, within 1 minute?
    • Who have time to play corporate ping pong game when there is so much to do?
  6. We know we have Over Provisioning disease.
    • But how bad is it exactly? It impacts both Performance and Capacity
    • Can you Right-sizing VM, without impacting performance?
  7. Are they any configuration issue you need to be aware of?

Let’s go through the dashboards that answer those questions, starting from Question 1.

Are the VMs up?

The dashboard helps in the following area:

  1. What’s the overall uptime? CIO may ask you to give the overall uptime across time. You can provide a line chart, showing the aggregate uptime among all the VMs.
  2. What’s the Uptime for each VM per month? The table on the dashboard is grouped by month. It’s showing Sep 2016. All VMs are showing 100%, which is what you want to see before you go for lunch or holiday 🙂
  3. What’s the VM availability now? The heat map provides an easy visualisation. You just expect green for all VMs.
  4. If a VM Uptime is <100%, when was it down and how long? You can click on the heat map, and a line chart will be shown automatically. What you want to see is a straight line.

Availability - VM

Are they fast?

The dashboard helps in the following area:

  1. Is your IaaS serving them well? If not, when does it fail to deliver?
    • If you do not define well, you have not quantified fast. If you have not defined it, you have not set measureable expectation. That’s not a position you want to take, unless you enjoy performance troubleshooting 🙂
    • Measurable expectation = Performance SLA. Review this to help you.
  2. Which part of your IaaS business fails to deliver the promise?
    • In IaaS, you are only selling CPU, RAM, Disk and Network.
    • The VMs are consuming these 4 resources. Make sure they get what you promise them.
  3. How is the performance per cluster?
    • vSphere Cluster is the smallest logical building block, due to DRS and HA.

picture1

In the dashboard, the Performance SLA is dynamic. When you select a cluster of different tier, notice the SLA changed too. You can adjust the SLA to your actual number.

picture2

You can easily customize the dashboard. In the example below, the dashboard now has Network performance and related objects.

performance-monitoring

Yes, the dashboard is all you need to know if the cluster is performing well or not.

  • If there is no VM affected, it’s good. No need to analyse further.
  • If there are VMs affected, we want to know which ones. Listing the Top 30 VM in terms of:
    • CPU Contention
    • RAM Contention
    • Disk Latency
    • Network drop packet (ensure it is 0)
    • Network latency (this needs NetFlow, which vR Ops cannot do)

Who are the victims?

Once you know which cluster has the problem, and the time & type of problem, you can drill down.

Your IaaS can fail to deliver different resources at different time. For example, it has CPU performance issue at 12:35 pm and Disk performance issue at 22:40 pm. The performance line chart shows you any correlation, if any. In the above example, the selected cluster has Storage performance issue, but doing well on RAM.

During the same time interval, different VMs can be hit by different problems. If your IaaS fails to deliver on CPU and Disk at 12:35 pm, VM 007 can be hit with CPU problem while VM 747 can be hit with Disk problem. This is why you need to be able to see each resource (CPU, RAM, Disk, Network) independently.

picture3

[e1: there is a known bug that prevent you from having 4-equal column]

This dashboard depends on the previous dashboard. You select a cluster, then navigate to this dashboard. It will only show VMs from that cluster. You can see which VMs are hit by what (CPU, RAM, Disk, Network). This lets you take the appropriate action, before VM Owner complains.

Network latency cannot be done in vR Ops 6.5. Use vRealize Network Insight for it.

The packet drop counter can be unreliable if you are not at the right patch level. See this KB. This issue is resolved in:

  • ESXi 6.0, Patch ESXi-6.0.0-20160804001-standard
  • ESXi 5.5, Patch ESXi550-201312401-BG: Updates esx-base,

Who are the villain?

Which VMs were generating excessive workload? When and for how long?

You can see it by tracking the maximum workload generated by any VM on a line chart. The example below shows an excessive IOPS. It jumped to 13,212 IOPS when the average did not even touch 15 IOPS.

picture1

VMs can only generate excessive workload on IOPS and Network. It can’t abuse CPU and RAM, as it can’t go beyond the configuration. The dashboard tracks IOPS and Network. Once you see a peak, you use the Top-N to list the VMs.

picture7

For details on how this dashboard helped a customer who was hit by IOmeter, see this.

When a VM Owner complains

A VM Owner only cares about her VM. The fact that you have 1001 other VMs is irrelevant. As a result, the fact that your VMware cluster is working hard at 100% utilization is also relevant. That’s why the following dashboard does not show other VM and your Infrastructure.

Using the dashboard, a Help Desk operator can search for the VM, or browse the list. For every VM, we show the key properties, such as No of vCPU, RAM size, CPU Contention, RAM Contention, etc. The columns can be customised.

Once found, he simply selects that VM. How well your IaaS platform serves it will be automatically shown. The dashboard uses line chart, and not a single number, so you can see if there is any pattern.

Below is all the dashboard shows! That’s all, because it’s about Monitoring, not Troubleshooting.

The dashboard sports Performance SLA.

  • This is the most important feature of this dashboard, as it allows you to clear performance issue quickly.
  • That SLA line is dynamic. It varies, depending on which tier the VM belongs to.
  • To change the threshold, simply change the value in the super metric named Performance SLA. There is no need to modify policy.
  • For networking, the general expectation is 0 dropped packet, hence there is no need for SLA Line. We show both the TX and RX instead, so you can see deeper where the issue is

From the example below, it’s clearly showing the IaaS unable to meet its promise on CPU but do well on RAM. It failed for around 20 minutes on Disk. You don’t even have to wait for VM Owner to complain. You can be proactive and discuss the need for additional hardware or upgrade.

If you need to see how the dashboard works, here is a short video.

The above dashboard clearly tells if you are serving your customer well. It’s suitable for Help Desk Operator. All they need to see if it’s above the threshold or not. Once you have operationalized IaaS, this dashboard is the easy part. You can actually make it self service if you have a formal agreement.

What if you need to find out why. Another word, you move from monitoring to troubleshooting. From this dashboard, you can navigate to the VM Troubleshooting dashboard.

picture5

Troubleshooting a world by itself. The diagram below shows partial list that can cause performance issue.

Performance problem can be caused by only 2 main reasons:

  1. The VM itself
  2. The Infra is unable to serve the VM.

For the VM, here are some possible reasons

  • Utilization is high. It does not have enough capacity.
  • VM too big. Processes were ping-pong among the vCPU. The context switch is very high
  • Configuration is wrong. Examples: NUMA, storage driver, network driver, app configuration
  • Bug. This can result in high utilization (e.g. memory leak)
  • The app does not scale well. It’s not able to take advantage of all vCPU and are concentrated on just a few.

For Infrastructure, looks for sign if it’s heavily loaded or too small for the VM:

  • vCPU too big relative to Host cores?
  • Was there vMotion at the time of issue?

Do take note what is considered “high” is relative. This is where performance troubleshooting is not just science, but also art. Also, not all counters indicate performance problem. Examples:

  • A high number of Process ID inside Guest OS does not correlate to performance issue, if they are mostly idle and do not cause a lot of context switches. On the other hand, you can have a process being ping-pong among the vCPU even though there aren’t many processes running.
  • VM RAM being ballooned out does not mean the VM experiences performance degradation. RAM performance only happen when CPU wants to access the page and its waiting for RAM. It has to wait if the page was not available in RAM, because it was ballooned out, swapped, compressed, etc. So track the swap in, not swap out.

We can’t show all the metrics and possibilities shown above. Here is what we can do. You can customize it to show more. You should also build a custom Log Insight dashboard to complement this.

The VM is selected from the previous dashboard (Single VM Monitoring). Related Datastores are automatically shown, along with their KPI. Related ESXi cannot be shown as the ESXi where the VM is running might not be the ESXi where the VM was running. On the dashboard, choose the metric Parent Host manually, as shown above. You can see if the VM was on a different ESXi.

  • Compute: ESXi
    • CPU Contention, RAM Contention
    • CPU Demand, RAM Consumed, RAM Active
  • Storage: Datastore
    • Read Latency, Write Latency
    • Outstanding IO request
  • Network:
    • We are not showing network because Network should have 0 dropped packet, plus in general it’s hard to saturate 2x 10 GE

Again, line chart is used, and not a single number, because they give you a lot more info.

Do note that VM CPU Workload can exceed 100% as it accounts for CPU contention and overhead.

Over Provisioning disease

If you take all the large VMs in your environment, and plot the maximum utilization among them, what do you expect?

You are right. It depends whether they are over provisioned or not. If they are, the max among them will be low. The average will be even lower.

In a healthy, right-sized environment, there is bound to be 1 VM who have high utilization at any given time. This is especially true in a large environment.

The line charts below show the Max and Average utilizations among the large VMs. We can tell easily the degree of provisioning.

The line chart does not show the VMs. That’s where the table comes in. It shows the max utilization of each VM in a given period.

The table does not show relative comparison among these large VMs. If you want to expose the largest VMs, the heat map shows that. The larger the VM, the larger the box.

picture1

What about undersized? Generally speaking, this is not your problem. But if you want to answer “Which VMs hit high CPU usage when?”, you can use the following dashboard:

picture3

The above is what you want to see, indicating only 2 VMs had the problem in the past >1 month. In an environment where many VMs are undersized, you will see something like this. Notice this is not 2 months. This is just 6 hours, and each bar is only 10 minutes!

picture4

Right-sizing VM without impacting performance

The previous dashboard give you the overall situation. To right size, you need to deal with individual VM. This gives you the confidence that performance will not be affected.

You can select any of the large VMs, starting from the one with the least utilization. The dashboard below will automatically lists the VM utilization.

  • Each vCPU of the VM are listed in table. It shows the maximum utilisation of individual vCPU in the timeline you are interested.
  • It shows analysis of the utilization of the VM. The Forensic chart shows 95% of the VM utilization. You expect that number to be >80% as a VM can’t be spending 95% of the time doing just 20% utilization. The Forensic also shows you the remaining 5%, so you can be convinced.

1-vm

Most VM Owners will ask for a detailed line chart showing each vCPU utilisation. The line chart below will be automatically shown when a VM is selected. It retains a 5-minute granularity.

picture2

RAM right sizing is more challenging as you need Guest OS metric, not VM metric. vR Ops 6.3 sports the ability to pull this data with just using VMware Tools.

vm-right-sizing-memory

Are they configured consistently?

Any “bad” config matters we need to know? The dashboard lists VMs configuration that needs attention:

  • Do I have large VMs? If yes, what’s their configurations? We cover CPU, RAM and Disk separately.
  • Do I have VM connected to >1 network? They can bridge your network, so it should be reserved for only Networking VM.
  • Do I have VMs with large snapshots? If yes, which VM and how big?
  • Do I have VMs with old virtual hardwares? If yes, which versions and how many?

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

Further Reading

Hope you find the blog useful. For more info, you can refer to chapters 4 – 7 in this book.