Tag Archives: Log Insight

SDDC Dashboards: The Kitchen

This post is part of Operationalize Your World post. Do read it first to get the context.

There are only 4 parts in the IaaS Monitoring:

  1. Capacity
  2. Configuration (with Inventory)
  3. Audit and Compliance
  4. Availability

Can you figure out why we do not have Performance in “the kitchen” area of your restaurant business?

Performance SLA concept explains why. I’ve also applied it to VDI use case and give an example.

Capacity

The Capacity dashboards below take into account Performance SLA and Availability SLA. Only when these 2 are satisfied, that it considers Utilization. Review this series of blogs for an extensive coverage on this new model.

The set of dashboards answer questions such as:

  • What’s the capacity of my clusters?
  • What’s the consumption on the clusters?
  • Which clusters are running low?
  • Is the cluster still coping well with demands?
  • Does a cluster consist of mostly large VMs?

Here it the dashboard for Tier 1, where we do not overcommit. As a result, both performance & utilization are irrelevant. It is driven by Availability SLA.

The vCPU and vRAM remaining is based on allocation model. It takes into account HA setting.

BTW, the lines in the 2 line chart above do not gradually come down (or up) because this is a lab, not a real life environment. Your production environment will have line chart that makes sense 🙂

Here it the dashboard for Tier 2 or 3. Since we overcommit, we now have to take into account performance, and then utilization.

capacity-tier-2-compute

As you can see from the above dashboard, it has 3 sections:

  • Availability SLA. Do we reach the concentration risk?
  • Performance SLA. Do we serve existing workload well?
  • Utilisation. It uses the net usable capacity as the ceiling. This ceiling takes into account your HA settings and Buffer. The default value for buffer is 10%, which you can change via policy.

Can you spot a limitation on the capacity dashboards I’ve shown so far?

Yes, it’s hard to compare across clusters. If you have many clusters, you want to know which clusters to check first. This dashboard lets you compare. It’s color coded so it’s easier for you to see.

For implementation details, refer to this post.

The twin-sister of Capacity is Reclamation. What can you reclaim and from which VMs?

reclamation

For implementation details, refer to this post.

Configuration 

In the world of Software-Defined, configurations are easy to change. So consistency and drift become 2 areas you need to watch.

The set of dashboards answer questions such as:

  • Are my ESXi config consistent, especially if they are member of the same cluster?
  • Are my ESXi & Clusters configured to follow best practice?
  • Do I have too many combination, which increase complexity?
  • What have I got?

The dashboard below is for ESXi:

configuration-esxi

The dashboard below is for Cluster:

configuration-cluster

That’s all you can do in vR Ops 6.4. If you need more details, you need to deploy VCM. The latest release is 5.8.3. For the list of configuration that it can track per object, review this.

Inventory differs to Configuration.

  • Configuration has Standard, and hence Drift. Inventory does not.
  • Configuration has Compliance. Inventory does not. Well, not generally 😉
  • Configuration has value that can be bad (e.g. ESXi has no syslog). Inventory does not.
  • Inventory has stock take (typically annual). This can trigger work, which impact Configuration.
  • Inventory is typically reported on regular basis.

Because of the above, we’ve provided a purpose built dashboard to track inventory.

inventory

Audit and Compliance

You can check your environment compliance to vSphere Hardening Guide. The dashboard belows shows the summary of compliance, with ability to drill down to each object.

capture

vCenter tasks, events and alarms are 3 areas that you can mine to help answer compliance and audit. Log Insight complements vR Ops nicely here. For example, the following screenshot answer this audit question

  • Who shutdown what VM and when?

compliance

There are many things it can answer, and it’s covered in the workshop.

Availability

Because of HA and DRS, tracking Cluster makes more sense than tracking each ESXi. A cluster uptime remains 100% when 1 host is not available because you have HA. You have catered for that, and as a result, you should not be penalized.

The set of dashboards answer questions such as:

  1. What’s the availability (%) of each cluster in the last 24 hours? Each cluster has its own line chart, and it’s color coded. You expect a green bar, as shown below.
  2. What’s the availability now? The heatmap provides that answers quickly. You can drill down into the cluster if you spot a problem.
  3. Am I containing risk when there is a major outage. How many VMs am I willing to lose when a cluster or datastore goes down?

availability-cluster

The heat map also provides the ESXi uptime. You can toggle between Cluster and ESXi.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

SDDC Dashboards: The Dining Area

This post continues from the Operationalize Your World post. Do read it first so you get the context.

If your customers are happy, your internal problem is secondary.

To ensure that your customers are happy, there are a few proof you must be able to show:

  1. Are the VMs up?
    • This is the #1 Job. It is more important than security and performance. If the VM is dead, there is nothing to talk about 🙂
  2. Are they fast?
    • Just because they are up does not mean they are fast!
    • Is your IaaS serving them well?
  3. If not, which VMs are hit? By what and when?
    • Who are the victims?
  4. Who’s causing the problem?
    • Who are the villain?
    • We saw a performance degradation on a cluster of 500 VM when just 1-2 VM did an IOmeter.
    • VMs with excessive usage hurts the business.
  5. When a VM Owner complains, can your Help Desk value add, within 1 minute?
    • Who have time to play corporate ping pong game when there is so much to do?
  6. We know we have Over Provisioning disease.
    • But how bad is it exactly? It impacts both Performance and Capacity
    • Can you Right-sizing VM, without impacting performance?
  7. Are they any configuration issue you need to be aware of?

Let’s go through the dashboards that answer those questions, starting from Question 1.

Are the VMs up?

The dashboard helps in the following area:

  1. What’s the overall uptime? CIO may ask you to give the overall uptime across time. You can provide a line chart, showing the aggregate uptime among all the VMs.
  2. What’s the Uptime for each VM per month? The table on the dashboard is grouped by month. It’s showing Sep 2016. All VMs are showing 100%, which is what you want to see before you go for lunch or holiday 🙂
  3. What’s the VM availability now? The heat map provides an easy visualisation. You just expect green for all VMs.
  4. If a VM Uptime is <100%, when was it down and how long? You can click on the heat map, and a line chart will be shown automatically. What you want to see is a straight line.

Availability - VM

Are they fast?

The dashboard helps in the following area:

  1. Is your IaaS serving them well? If not, when does it fail to deliver?
    • If you do not define well, you have not quantified fast. If you have not defined it, you have not set measureable expectation. That’s not a position you want to take, unless you enjoy performance troubleshooting 🙂
    • Measurable expectation = Performance SLA. Review this to help you.
  2. Which part of your IaaS business fails to deliver the promise?
    • In IaaS, you are only selling CPU, RAM, Disk and Network.
    • The VMs are consuming these 4 resources. Make sure they get what you promise them.
  3. How is the performance per cluster?
    • vSphere Cluster is the smallest logical building block, due to DRS and HA.

picture1

In the dashboard, the Performance SLA is dynamic. When you select a cluster of different tier, notice the SLA changed too. You can adjust the SLA to your actual number.

picture2

You can easily customize the dashboard. In the example below, the dashboard now has Network performance and related objects.

performance-monitoring

Yes, the dashboard is all you need to know if the cluster is performing well or not.

  • If there is no VM affected, it’s good. No need to analyse further.
  • If there are VMs affected, we want to know which ones. Listing the Top 30 VM in terms of:
    • CPU Contention
    • RAM Contention
    • Disk Latency
    • Network drop packet (ensure it is 0)
    • Network latency (this needs NetFlow, which vR Ops cannot do)

Who are the victims?

Once you know which cluster has the problem, and the time & type of problem, you can drill down.

Your IaaS can fail to deliver different resources at different time. For example, it has CPU performance issue at 12:35 pm and Disk performance issue at 22:40 pm. The performance line chart shows you any correlation, if any. In the above example, the selected cluster has Storage performance issue, but doing well on RAM.

During the same time interval, different VMs can be hit by different problems. If your IaaS fails to deliver on CPU and Disk at 12:35 pm, VM 007 can be hit with CPU problem while VM 747 can be hit with Disk problem. This is why you need to be able to see each resource (CPU, RAM, Disk, Network) independently.

picture3

[e1: there is a known bug that prevent you from having 4-equal column]

This dashboard depends on the previous dashboard. You select a cluster, then navigate to this dashboard. It will only show VMs from that cluster. You can see which VMs are hit by what (CPU, RAM, Disk, Network). This lets you take the appropriate action, before VM Owner complains.

Network latency cannot be done in vR Ops 6.5. Use vRealize Network Insight for it.

The packet drop counter can be unreliable if you are not at the right patch level. See this KB. This issue is resolved in:

  • ESXi 6.0, Patch ESXi-6.0.0-20160804001-standard
  • ESXi 5.5, Patch ESXi550-201312401-BG: Updates esx-base,

Who are the villain?

Which VMs were generating excessive workload? When and for how long?

You can see it by tracking the maximum workload generated by any VM on a line chart. The example below shows an excessive IOPS. It jumped to 13,212 IOPS when the average did not even touch 15 IOPS.

picture1

VMs can only generate excessive workload on IOPS and Network. It can’t abuse CPU and RAM, as it can’t go beyond the configuration. The dashboard tracks IOPS and Network. Once you see a peak, you use the Top-N to list the VMs.

picture7

For details on how this dashboard helped a customer who was hit by IOmeter, see this.

When a VM Owner complains

A VM Owner only cares about her VM. The fact that you have 1001 other VMs is irrelevant. As a result, the fact that your VMware cluster is working hard at 100% utilization is also relevant. That’s why the following dashboard does not show other VM and your Infrastructure.

Using the dashboard, a Help Desk operator can search for the VM, or browse the list. For every VM, we show the key properties, such as No of vCPU, RAM size, CPU Contention, RAM Contention, etc. The columns can be customised.

Once found, he simply selects that VM. How well your IaaS platform serves it will be automatically shown. The dashboard uses line chart, and not a single number, so you can see if there is any pattern.

Below is all the dashboard shows! That’s all, because it’s about Monitoring, not Troubleshooting.

The dashboard sports Performance SLA.

  • This is the most important feature of this dashboard, as it allows you to clear performance issue quickly.
  • That SLA line is dynamic. It varies, depending on which tier the VM belongs to.
  • To change the threshold, simply change the value in the super metric named Performance SLA. There is no need to modify policy.
  • For networking, the general expectation is 0 dropped packet, hence there is no need for SLA Line. We show both the TX and RX instead, so you can see deeper where the issue is

From the example below, it’s clearly showing the IaaS unable to meet its promise on CPU but do well on RAM. It failed for around 20 minutes on Disk. You don’t even have to wait for VM Owner to complain. You can be proactive and discuss the need for additional hardware or upgrade.

If you need to see how the dashboard works, here is a short video.

The above dashboard clearly tells if you are serving your customer well. It’s suitable for Help Desk Operator. All they need to see if it’s above the threshold or not. Once you have operationalized IaaS, this dashboard is the easy part. You can actually make it self service if you have a formal agreement.

What if you need to find out why. Another word, you move from monitoring to troubleshooting. From this dashboard, you can navigate to the VM Troubleshooting dashboard.

picture5

Troubleshooting a world by itself. The diagram below shows partial list that can cause performance issue.

Performance problem can be caused by only 2 main reasons:

  1. The VM itself
  2. The Infra is unable to serve the VM.

For the VM, here are some possible reasons

  • Utilization is high. It does not have enough capacity.
  • VM too big. Processes were ping-pong among the vCPU. The context switch is very high
  • Configuration is wrong. Examples: NUMA, storage driver, network driver, app configuration
  • Bug. This can result in high utilization (e.g. memory leak)
  • The app does not scale well. It’s not able to take advantage of all vCPU and are concentrated on just a few.

For Infrastructure, looks for sign if it’s heavily loaded or too small for the VM:

  • vCPU too big relative to Host cores?
  • Was there vMotion at the time of issue?

Do take note what is considered “high” is relative. This is where performance troubleshooting is not just science, but also art. Also, not all counters indicate performance problem. Examples:

  • A high number of Process ID inside Guest OS does not correlate to performance issue, if they are mostly idle and do not cause a lot of context switches. On the other hand, you can have a process being ping-pong among the vCPU even though there aren’t many processes running.
  • VM RAM being ballooned out does not mean the VM experiences performance degradation. RAM performance only happen when CPU wants to access the page and its waiting for RAM. It has to wait if the page was not available in RAM, because it was ballooned out, swapped, compressed, etc. So track the swap in, not swap out.

We can’t show all the metrics and possibilities shown above. Here is what we can do. You can customize it to show more. You should also build a custom Log Insight dashboard to complement this.

The VM is selected from the previous dashboard (Single VM Monitoring). Related Datastores are automatically shown, along with their KPI. Related ESXi cannot be shown as the ESXi where the VM is running might not be the ESXi where the VM was running. On the dashboard, choose the metric Parent Host manually, as shown above. You can see if the VM was on a different ESXi.

  • Compute: ESXi
    • CPU Contention, RAM Contention
    • CPU Demand, RAM Consumed, RAM Active
  • Storage: Datastore
    • Read Latency, Write Latency
    • Outstanding IO request
  • Network:
    • We are not showing network because Network should have 0 dropped packet, plus in general it’s hard to saturate 2x 10 GE

Again, line chart is used, and not a single number, because they give you a lot more info.

Do note that VM CPU Workload can exceed 100% as it accounts for CPU contention and overhead.

Over Provisioning disease

If you take all the large VMs in your environment, and plot the maximum utilization among them, what do you expect?

You are right. It depends whether they are over provisioned or not. If they are, the max among them will be low. The average will be even lower.

In a healthy, right-sized environment, there is bound to be 1 VM who have high utilization at any given time. This is especially true in a large environment.

The line charts below show the Max and Average utilizations among the large VMs. We can tell easily the degree of provisioning.

The line chart does not show the VMs. That’s where the table comes in. It shows the max utilization of each VM in a given period.

The table does not show relative comparison among these large VMs. If you want to expose the largest VMs, the heat map shows that. The larger the VM, the larger the box.

picture1

What about undersized? Generally speaking, this is not your problem. But if you want to answer “Which VMs hit high CPU usage when?”, you can use the following dashboard:

picture3

The above is what you want to see, indicating only 2 VMs had the problem in the past >1 month. In an environment where many VMs are undersized, you will see something like this. Notice this is not 2 months. This is just 6 hours, and each bar is only 10 minutes!

picture4

Right-sizing VM without impacting performance

The previous dashboard give you the overall situation. To right size, you need to deal with individual VM. This gives you the confidence that performance will not be affected.

You can select any of the large VMs, starting from the one with the least utilization. The dashboard below will automatically lists the VM utilization.

  • Each vCPU of the VM are listed in table. It shows the maximum utilisation of individual vCPU in the timeline you are interested.
  • It shows analysis of the utilization of the VM. The Forensic chart shows 95% of the VM utilization. You expect that number to be >80% as a VM can’t be spending 95% of the time doing just 20% utilization. The Forensic also shows you the remaining 5%, so you can be convinced.

1-vm

Most VM Owners will ask for a detailed line chart showing each vCPU utilisation. The line chart below will be automatically shown when a VM is selected. It retains a 5-minute granularity.

picture2

RAM right sizing is more challenging as you need Guest OS metric, not VM metric. vR Ops 6.3 sports the ability to pull this data with just using VMware Tools.

vm-right-sizing-memory

Are they configured consistently?

Any “bad” config matters we need to know? The dashboard lists VMs configuration that needs attention:

  • Do I have large VMs? If yes, what’s their configurations? We cover CPU, RAM and Disk separately.
  • Do I have VM connected to >1 network? They can bridge your network, so it should be reserved for only Networking VM.
  • Do I have VMs with large snapshots? If yes, which VM and how big?
  • Do I have VMs with old virtual hardwares? If yes, which versions and how many?

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

Further Reading

Hope you find the blog useful. For more info, you can refer to chapters 4 – 7 in this book.

Sunny Dua and Simon Eady events

Sunny Dua and Simon Eady have been doing a monthly webex where they are sharing their knowledge on VMware vRealize Operations. The latest one is coming this Friday, Thursday, 25th August. It’s 1:30 PM – 2:45 PM Singapore time. I know it’s not a good time for certain cities. If you cannot make it, it’s recorded.

I’ll join them in the next session. We are hoping to answer questions like the following. We put some answers in light hearted words as you know it’s a serious question.

Capture

We live in an era where society is hypersensitive to people who are not sensitive. In the example above, I use her but I meant her/his/him.

The session aims to help you monitoring performance and capacity. Hopefully, you gain a new perspective, and questions like the following will make sense:

2

3

You will also be able to answer questions like this:

4

See you next week!