Does the IaaS serve the VMs well?

That’s a loaded question. If your CIO is asking you that, can you give a convincing answer?

If you have 1000 VMs, how do you prove that the platform serves each and every one of them well? How can you show 1000 VMs on a single screen, no more than 1280 x 1024 pixels? Wouldn’t be awesome to project this on a live screen, where the performance of your platform can be appreciated by your management and customers? You take pride in architecting it, paying attention to details, but sadly infrastructure has becoming a thankless job due to commoditization. The only time they call you is when there is a problem.

Your hard work can be shown. You too can have live show 🙂

But first, you need to define “well“. A mission critical VM carrying the voice traffic won’t tolerate latency that a batch job happily accepts. Have you defined what “performance” exactly is? Because if you have not, how can you measure it? If you cannot measure it, how can you prove it’s good or bad?

Approach

  • Step 1: define IaaS Performance
    • IaaS performance differs to VM performance. Since the nature of IaaS is a service, it is performing well when it’s serving its workload well. I recommend you have 2 or 3 Service Tiers only.
    • Yes, if you have 1000 VMs, you do take 1000 independent measurement. You don’t take the average. You take the worst. Measure this, and agree with your management on what “well” is. Review this on Performance SLA.
  • Step 2: measure if each VM is served well
    • IaaS provides 4 services:
      • CPU
      • RAM
      • Disk
      • Network
    • Now that you have an SLA, you can measure if any VM is not served well.
  • Step 3: aggregate at cluster level
    • This enables you to report at the cluster level, which makes capacity management easier.
    • If you have thousands of VMs, grouping by cluster results in neater visualization. If your cluster has 300 VMs on average, a 60000 VMs will just result in 200 clusters. 200 objects can fit inside a heatmap on the projected screen easily.

Implementation

Here is how I did it. You don’t have to do all the preparation as it’s already created as part of Operationalize Your World. You simply import them. I’m showing how it’s done in this blog.

  1. I created 1 group for each service tier. I called it Tier 1 (Gold), Tier 2 (Silver) and Tier 3 (Bronze)
  2. In each group, I added the clusters and datastores that belong to the tier. For the VMs, I simply specified all VMs that belong in that cluster. This makes future VM automatically tagged.

The implementation uses super metric on top of super metric. In current release, this has to be done via super metric, as you need to do calculation on it. Also, regular metric and property are not user-editable.

Level 1 super metric: Performance SLA

I created super metrics for each tier, to define Performance SLA. 1 set for each service tier. Here is what it looks like in vR Ops 6.6.

I created 3 policies, 1 for each Service Tier. I enabled the respective super metrics in each policy. You don’t need super metric for network as you expect 0 drop packets. The VM vNIC RX may report dropped packet when there isn’t any, so I’m only using VM TX.

I mapped the policy to the respective group.

Level 2 super metric: count the breach for each VM

Once a VM inherits the SLA super metric, you can compare it with the actual performance.

If a VM is being served well, it will have no SLA breach. If not, it will have. This means the formula needs to return 0 or 1. For that, you need to use IF THEN ELSE.

If Actual > SLA Then return 1 Else return 0.

For network, since the expectation is 0 dropped packet, the formula is simpler:

IF Network RX > 0 Then return 1 Else return 0.

I’m not considering RX for reason given earlier.

The entire formula looks like this

Count of SLA breach = CPU SLA breach + RAM SLA breach + Disk SLA breach + Network SLA breach

If all 4 types are not served well, it will have a value of 4.

Here is what the super metric looks like:

Level 3 super metric: count the total SLA breach per cluster as %

This enables you to compare across clusters. You can tell which clusters are struggling to cope with demand, and it’s not meeting its demand.

There are 2 consideration:

  • Do you count per VM or per breach?       If a VM has 4 breaches, do you count 4 or 1? I prefer to count 4, as it tells the severity of the breach. If you have 2 clusters with identical number of VM unserved, you can’t tell which one has bigger problem. Plus, adding an ESXi host adds CPU, RAM and Network capacity.
  • Do you compare absolute or relative?     Say you have many clusters, and they vary in sizes and number of VMs. Everything being equal, the big cluster will appear as if they struggle more since they have more VMs. If you divide the number by the number of VM, you normalize it. You also get a percentage, which makes discussion with management easier. You can say “we’ve experienced 10% breach. As per capacity management policy, this is the time we trigger hardware purchase so IT is ahead of business”. Standardizing the unit into policy enables you to apply the same treatment across clusters.

You may ask, why add VM Disk Latency when it’s part of a datastore, and not cluster? It makes analysis easier. Otherwise, we have to present 2 sets of information. Plus, there is generally a correlation between cluster and datastore.

Here is the formula:

Quiz! Why divide by 4 SLA types?

Got the answer? 🙂

The number can theoretically touch 400%, as a VM can potentially have 4 breaches. To make it easier (100% is so much easier to explain than 400%), I divide the number by 4.

Here is the actual formula

If  you are curious what it looks like in vR Ops, here it is:

Visualising the data

Since you have the data at VM level and cluster level, you can show in 2 different ways

Heat map 1: Are the VMs being served well?

This shows all the VMs. Naturally, grouping will occur. You expect this heat map to be mostly green most of the time. It won’t be green 100% at all times, as you are maximizing your investment.

The color is by SLA breach, where red = 4 (all SLA breached). The size is ideally by service tier, so Tier 1 VMs stand out. You can use vCPU as the proxy. I prefer vCPU over criticality as the large VMs stands out. You can also use vRAM, so you have 2 heat maps.

What can you tell from the above?

  • The distribution of my VMs among my Tier 1, Tier 2 and Tier 3. The VMs are grouped by service tiers, the by clusters within that tier.
  • Majority of VMs are not getting the SLA promised. However, it’s mostly 1 failure. This is likely Network in my case above.

Heat map 2: are the clusters performing well?

This shows all the clusters. Because the units are standardized in %, you can compare clusters easily now. Color is by the Cluster Performance (%), where Green = 100% (no SLA breached).

Size is by the number of VMs, so the cluster serving more VMs stand out. You can also use number of hosts, so you have 2 heat maps.

What’s a guide? To me, 5% is serious enough to warrant a hardware expansion, as it takes >1 month to add. Remember, we divide the number by 4.

Hope that’s useful. Do get the whole suite of dashboards from Operationalize Your World.

Leave a Reply