If your customers are happy, your internal problem is secondary.
To ensure that your customers are happy, there are a few proof you must be able to show:
- Are the VMs up?
- This is the #1 Job. It is more important than security and performance. If the VM is dead, there is nothing to talk about 🙂
- Are they fast?
- Just because they are up does not mean they are fast 🙂
- Is your IaaS serving them well?
- If not, which VMs are hit? By what and when?
- Who are the victims?
- Who’s causing the problem?
- Who are the villain?
- When a VM Owner complains, can your Help Desk value add, within 1 minute?
- We know we have Over Provisioning disease.
- But how bad is it?
- Can you Right-sizing VM, without impacting performance?
Let’s go through the dashboards that answer those questions, starting from Question 1.
Are the VMs up?
The dashboard helps in the following area:
- What’s the overall uptime? CIO may ask you to give the overall uptime across time. You can provide a line chart, showing the aggregate uptime among all the VMs.
- What’s the Uptime for each VM per month? The table on the dashboard is grouped by month. It’s showing Sep 2016. All VMs are showing 100%, which is what you want to see before you go for lunch or holiday 🙂
- What’s the VM availability now? The heat map provides an easy visualisation. You just expect green for all VMs.
- If a VM Uptime is <100%, when was it down and how long? You can click on the heat map, and a line chart will be shown automatically. What you want to see is a straight line.
Are they fast?
The dashboard helps in the following area:
- Is your IaaS serving them well? If not, when does it fail to deliver?
- If you do not define well, you have not defined fast. If you have not defined it, you have not set measureable expectation. That’s not a position you want to take, unless you enjoy performance troubleshooting 🙂
- Measureable expectation = Performance SLA. Review this to help you.
- Which part of your IaaS business fails to deliver the promise?
- In IaaS, you are only selling CPU, RAM, Disk and Network. The VMs are consuming these 4 resources. Make sure they get what you promise them
- How is the performance per cluster?
- vSphere Cluster is the smallest logical building block, due to DRS and HA.
The Performance SLA is dynamic. When you select a cluster of different tier, notice the SLA changed too. You can adjust the SLA to your actual number.
Who are the victims?
Your IaaS can fail to deliver different resources at different time. For example, it has CPU performance issue at 12:35 pm and Disk performance issue at 22:40 pm. The performance line chart shows you any correlation, if any. In the above example, the selected cluster has Storage performance issue, but doing well on CPU and RAM.
During the same time interval, different VMs can be hit by different problems. If your IaaS fails to deliver on CPU and Disk at 12:35 pm, VM 007 can be hit with CPU problem while VM 747 can be hit with Disk problem. This is why you need to be able to see each resource (CPU, RAM, Disk, Network) independently.
This dashboard depends on the previous dashboard. You select a cluster, then navigate to this dashboard. It will only show VMs from that cluster. You can see which VMs are hit by what (CPU, RAM, Disk, Network). This lets you take the appropriate action, before VM Owner complains.
Who are the villain?
Which VMs were generating excessive workload? When and for how long? You can see it by tracking the maximum workload generated by any VM on a line chart. The example below shows an excessive IOPS. It jumped to 13,212 IOPS when the average did not even touch 15 IOPS.
VMs can only generate excessive workload on IOPS and Network. It can’t abuse CPU and RAM, as it can’t go beyond the configuration. The dashboard tracks IOPS and Network. Once you see a peak, you use the Top-N to list the VMs.
When a VM Owner complains
A VM Owner only cares about her VM. The fact that you have 1001 other VMs is irrelevant. As a result, the fact that your VMware cluster is working hard at 100% utilization is also relevant. That’s the following dashboard does not show other VM and your Infrastructure.
A Help Desk operator can simply search for the VM, or browse the list. Once found, he simply selects that VM. How well your IaaS platform serves it will be automatically shown. The dashboard uses line chart, and not a single number, so you can if there is any pattern. From the example below, it’s clearly showing the IaaS unable to meet its promise on CPU but do well on RAM. It failed for around 20 minutes on Disk.
The above dashboard clearly tells if you are serving your customer well. It’s suitable for Help Desk Operator. What if you need to find out why. Another word, you move from monitoring to troubleshooting. From this dashboard, you can navigate to the VM Troubleshooting dashboard
The troubleshooting dashboard provides additional counters. Performance problem can be caused by only 2 main reasons:
- The VM itself
- It is too small, using wrong driver, apps not leveraging resource.
- The Infra
- It is heavily loaded (normally due to lots of small VMs)
- It is unable to cope (normally due to large VM).
- Infra means ESXi, Storage and Network.
The dashboard automatically shows the relevant ESXi and Datastore, with their KPI.
Again, line chart is used, and not a single number, because they give you a lot more info.
Over Provisioning disease
If you take all the large VMs in your environment, and plot the maximum utilization among them, what do you expect?
You are right. It depends whether they are over provisioned or not. If they are, the max among them will be low. The average will be even lower.
In a healthy, right-sized environment, there is bound to be 1 VM who have high utilization at any given time. This is especially true in a large environment.
The line charts below show the Max and Average utilizations among the large VMs. We can tell easily the degree of provisioning.
The line chart does not show the VMs. That’s where the table comes in. It shows the max utilization of each VM in a given period.
The table does not show relative comparison among these large VMs. If you want to expose the largest VMs, the heat map shows that. The larger the VM, the larger the box.
What about undersized? Generally speaking, this is not your problem. But if you want to answer “Which VMs hit high CPU usage when?”, you can use the following dashboard:
The above is what you want to see, indicating only 2 VMs had the problem in the past >1 month. In an environment where many VMs are undersized, you will see something like this. Notice this is not 2 months. This is just 6 hours, and each bar is only 10 minutes!
Right-sizing VM without impacting performance
The previous dashboard give you the overall situation. To right size, you need to deal with individual VM. This gives you the confidence that performance will not be affected.
You can select any of the large VM, starting from the one with the least utilization. The dashboard below will automatically lists the VM utilization.
- Each vCPU of the VM are listed in table. It shows the maximum utilisation of each vCPU in the timeline you are interested.
- It shows analysis of the utilization of the VM. The Forensic chart shows 95% of the VM utilization. You expect that number to be >80% as a VM can’t be spending 95% of the time doing just 20% utilization. The Forensic also shows you the remaining 5%, so you can be convinced.
Most VM Owners will ask for a detailed line chart showing each vCPU utilisation. The line chart below will be automatically shown when a VM is selected. It retains a 5-minute granularity.
RAM right sizing is more challenging, hence it’s not covered here yet. Review this to know more why its not so simple.
Hope you find the blog useful. For more info, you can refer to chapters 4 – 7 in this book.