Any VM abusing your IaaS by doing excessive workload?

This post continues from the Operationalize Your World post. Do read it first so you get the context.

You provide IaaS to your customers. Typically, you are not given login access to the their VMs. As a result, you do not have practical control over the VMs. They can generate excessive workload. Anytime, without you knowing it.

I’ve seen how just 1% of VM population doing damage for the entire cluster. Yes, that means in a cluster of 300 VM, it can only take 3 VMs doing excessive workload for many VMs to suffer.

A VM consumes 5 resources:

  • vCPU
  • vRAM (GB)
  • Disk Space
  • Disk IOPS
  • Network (Mbps)

The first 3 you can bind and control. When you give a VM 4 vCPU, 16 GB vRAM and 100 GB of vDisk, that’s all it can take. The Guest OS can run at full speed, doing as much work as it can, and it won’t exceed 4 vCPU, 16 GB vRAM and 100 GB space.

The last 2 you can also control, but normally you don’t implement it. It takes a lot of effort as the default in VMware vSphere is no control (read: unlimited).

You should control it. In fact, you can turn it as additional revenue. Here are some ideas:

  • Every VM comes with 500 IOPS, averaged over 5 minute period, free of charge. But customer can pay extra for unlimited IOPS for a flat fee every month.
  • If you operate in a global environment, where WAN link and Internet bandwidth is not unlimited, you should charge for bandwidth. You define your service offering: every VM comes with 100 Mbps free of charge. Customer can pay extra for unlimited bandwidth for a flat fee every month.

The above is to encourage proper usage.

Application Team does not normally know how much IOPS or Network they need. They also do not know how much network bandwidth and Disk IOPS your IaaS can provide. So they may do a test regularly, to ensure your IaaS is good. They may download IOmeter and run 10K IOPS for 5 minutes every day, to ensure your IaaS can handle the load when they need it. That can hit your IaaS badly. You need to track this excessive usage.

Real Life example

It’s easier to understand how the dashboard helps by using a real case.

This is a story from a customer who was hit by a VM running IOmeter. There were 500 VMs in the cluster, and customer did not know when the hit took place. They only knew it was recent.

To find this, we plot the maximum IOPS from any VM in the cluster over 1 week period. The line chart shows the maximum IOPS of a VM. It does not matter which VM. If a VM, any VM, generated excessive IOPS, we would know when and how long. Even if they take turn to generate the IOPS, it will be captured, because the super metric evaluates the formula afresh every 5 minutes. At 8 am it can be VM 007, at 8:05 am it can be VM 8888.

We are tracking the counter at Virtual Disk, not lower level. That means the IO is coming from VM. It is not vSphere doing snapshot, vCenter doing storage vMotion, nor vSAN doing rebalancing.

What number do you expect?

My take is 1000 IOPS, unless your cluster has a lot of VMs doing heavy IOPS. 1000 IOPS can do damage, as it’s 5 minutes average. That is actually 1000 x 5 x 60 = 300K IO performed in 5 minutes! It’s not normal to issue 300K IOs in 5 minutes flat.

As you can see from the screen, we quickly found out the problem.

1

In the preceding 2 line charts, we see a very high peak at 13,212 IOPS. That’s a lot of IO. It hit near 4 million IO issued in 5 minutes flat.

I plot for 7 days so to see the extend of the peak relative to normal workload. As you can see below, this is not normal workload. It stood out.

The peak will take care of the excessive usage. But what about your environment as a whole? Also, how many VMs are doing this excessive IOPS?

The second line chart shows the Average. Notice it only went up to 15 IOPS. That means this is not a population issue. The peak is likely the job of 1 VM, as the average remains low.

In general, you should expect the average to be <50 IOPS. Remember, it’s 5 minutes sustained. A cluster with 300 VMs doing an average of 100 IOPS means your storage is hit by 30,000 IOPS sustained for 5 minutes. That’s a lot of IOPS. You need SSD to handle the load.

If the average is near the maximum, and the maximum is high, that means there are a lot of VMs doing high IOPS. Your infra is being hammered.

Let’s zoom into the peak. We can see that it peaked at around 3:17 am on 24 May. We can find out which VM did this. This is one reason why I find vRealize Operations powerful. I can zoom into any period of time and get any info about any object.

2

To list the VMs doing the IOPS at around 3:17 pm on 24 May, I use Top-N widget. I wanted to know not just the VM, but all the VMs. I wanted to verify my thought earlier, that it is only 1 VM doing excessive IOPS. The Top-N sorts the VMs by IOPS.

Bingo!

We got the culprit. Notice the number (13,212 IOPS) matches the Line Chart. Notice also that the next VM is doing much lower IOPS. At 715 IOPS, it is far lower.

3

You need to set to the 5 minute, as Top-N takes an average over the selected period. So do not select 1 hour, for example, as it will give the average of the entire hour.

Dashboard

This is what the dashboard looks like. We have Storage on the left, and Network on the right. You can

For Network, the number is in KBps as it follows what you see at vCenter. We convert into Mbps using $This super metric

Implementation

You can download the dashboard here. Do note that you get the whole 50 dashboard, as it’s part of a set.

Paul Armenakis gave a constructive feedback that I should include the super metric formula. Thanks Paul, here they are:

  • Max(${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=virtualDisk|commandsAveraged_average, depth=2})
  • Avg(${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=virtualDisk|commandsAveraged_average, depth=2})
  • Max(${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=net|usage_average, depth=2}) * 8 / 1024
  • Avg(${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=net|usage_average, depth=2}) * 8 / 1024

One thing I like about super metric is you can use it to convert value also. The default unit in vSphere is KB/s, and I convert it into Mbps.

As an example, here is how you create super metrics. For a step by step instruction on how to crease super metric, please see this.

Super metric

For Network, if this number is >1 Gb most of the time, it is high. Unless you have VM doing very high traffic, the line chart will not be consistently high. Please note that the Max is 2 Gb, as it’s full duplex.

For the Average utilisation, I’d expect this to be <100 Mbps. Remember, it’s 5 minutes sustained. Just like most VMs are not storage intensive, most VMs are not network intensive either.

Enhancement

The line chart tells you the pattern over time. But it does not tell you at a glance, the distribution among the VMs. The Top-N lists the VMs, but it’s not scalable. You can list the top 20 or 40, but you won’t list the top 2000. It won’t make sense also. Also, you cannot see if latency is affected.

This is where the Heat Map widget comes in handy. We can plot all the VMs. You can have >1000 VMs and be able to visually tell at a glance. Each box or square represents 1 VM. I’ve shown an example below. This is not from the same time period, and the peak has subsided.

4

I color the above heat map by Latency. You can set the threshold to any value you want. I set 15 ms as red. So I can tell quickly if any VM experience latency at 15 ms. From the above, I do have a few VMs with 15 ms or more latency, but in general they are good.

If you see a big box, that means you have a VM doing excessive IOPS. I do not have such scenario in the above. There are some VMs doing IOPS, but the box’ size are quite well spread. The tiny boxes are normal, as most VMs should be relatively idle from IOPS view point.

For Network, what do you use for the Color?

We do not have Network Latency counter in vSphere. You can use Drop Packets as the “latency” indicator. So expect to see green as VMs should not be dropping packets. If you know that your network is healthy, you can be creative and use Network Usage for color also. Why do we use Usage for both color and size? It does not seem logical right? I let you ponder for a while, and review the sample below. I use Network Usage for both color and size.

5

Managed to figure it out why?

Good, the issue with size is it is relative. It does not tell me the exact or absolute amount. I want to see at a glance, who is exceeding my threshold. If I set my threshold at 100 Mbps, then anyone exceeding that number will show up in bright red!

Limitation

Say you run a vSphere farm of 1000 VMs. You know that you do have large databases doing heavy IOPS. You also have high traffic web servers consuming high network. Even if you have 1 of these VMs each, it will render your Max (super metric) redundant. The result is dominated by these VMs.

So what can you do? You exclude them. Create 2 groups:

  1. First group is just these high-usage VMs.
  2. Second group is for the rest.

Once you separate them, it’s easier to manage.

BTW, this dashboard is at Cluster level. Its purpose is to complement the Performance Monitoring dashboard, which is based on Cluster level. It is not for overall monitoring. We have separate dashboards for your Storage Heavy Hitters and Network Top Consumers.

Hope you find the idea useful. Apply it in your environment, and let me know your finding! 🙂

2 thoughts on “Any VM abusing your IaaS by doing excessive workload?

  1. Pingback: Newsletter: May 30, 2015 | Notes from MWhite

  2. Pingback: SDDC Dashboards: The Dining Area - virtual red dot

Leave a Reply