VMworld 2017 presentations

As promised during my sessions at VMworld 2017, you can find the materials here.

I did 10 sessions/workshops covering 2 topics:

  • Achieve Troubleshooting Mastery with vRealize.
  • Operationalize Your World

The deck contains additional material that didn’t fit in the time given.

I made the presentations available in powerpoint, not PDF. Can you guess why?

Operations differ to Architecture. We can have 1000 VMware customers with identical architecture (e.g. they all use Validated Design), and you will see 1000 unique operations. Operations is like fingerprint. It’s personalized to each customer. That’s why Sunny Dua and I share in editable format. Feel free to change the 2 ugly photos with yours! 🙂

So go ahead to personalize it. Throw away the irrelevant ones, add your custom content, and make it yours. We would be honored to see your version, and how you improve our deck. I shared many times that the reason why customers find the solution resonating because it comes from you all. What we have done is simply collect the experience and wisdom from customers globally. Your war stories, your idea, your feedback, have become an integral part of the deck.

Operationalize Your World 

We provided 2 decks as we were asked to deliver in 1 hour and 2 hour format. If you need to see the full 1-day workshop, you can download it here.

A number of you have asked for a more details explanation on the billing model. I’m working on a deck that will cover that. Allow me to test the deck in VMworld at Barcelona first, as it’s just 1 week away.

Keeping VMware Tools current

Keeping VMware Tools current is one of the best practices of vSphere operations. VMware Tools interfaces with both ESXi and VM (the virtual motherboard or virtual machine). Hence, there are 2 comparisons to consider:

  1. VM Hardware version
  2. ESXi version

From the vSphere API, here is what you get when you query it:

  • Guest Tools Current
  • Guest Tools Not Installed
  • Guest Tools Supported New
  • Guest Tools Supported Old
  • Guest Tools Too Old
  • Guest Tools Unmanaged

What do they mean?

  • Guest Tools Not Installed:
    • Tools are not installed on the VM. You should install it as you get both drivers and visibility.
  • Current
    • Tools version matches with the Tools available with ESXi. Each ESXi has a version of Tools that comes with it. See this for the list. This is the ideal scenario.
  • Supported New
    • Newer than the ESXi VMware tools version, but it is supported.
  • Supported Old
    • The opposite of New. It is also supported. Even it is older by 0.0.1 is considered old. It does not have to far behind.
  • Too Old
    • Tools version is older than the minimum supported version of Tools across all ESXi versions. Minimum supported version is the oldest version of Tools we support. Basically, guest is running unsupported Tools. You should upgrade. As of now for Linux and Windows guests. minimum supported version is the Tools version bundled with ESXi 4.0 which is 8.0.1. Supporting such old versions is challenging. We are planning to change this in future to something newer. In the meantime, you should upgrade as might not work as expected
  • Unmanaged
    • Tools installed in the guest did not come from ESXi, so Tools is not being managed by ESXi host. It may be supported or maybe not, depends on what type of Tools is running in the guest. We support open-vm-tools packaged by Linux vendors and OSPs, which both show up as unmanaged.
    • If a customer builds their own open-vm-tools from source code, we may not support that because we will not know if they have done it correctly or not.

Operationalize Your World has a dashboard that highlights the VMs not running the current or supported new. You should expect the number to be minimal, or ideally none.

Does the IaaS serve the VMs well?

That’s a loaded question. If your CIO is asking you that, can you give a convincing answer?

If you have 1000 VMs, how do you prove that the platform serves each and every one of them well? How can you show 1000 VMs on a single screen, no more than 1280 x 1024 pixels? Wouldn’t be awesome to project this on a live screen, where the performance of your platform can be appreciated by your management and customers? You take pride in architecting it, paying attention to details, but sadly infrastructure has becoming a thankless job due to commoditization. The only time they call you is when there is a problem.

Your hard work can be shown. You too can have live show 🙂

But first, you need to define “well“. A mission critical VM carrying the voice traffic won’t tolerate latency that a batch job happily accepts. Have you defined what “performance” exactly is? Because if you have not, how can you measure it? If you cannot measure it, how can you prove it’s good or bad?


  • Step 1: define IaaS Performance
    • IaaS performance differs to VM performance. Since the nature of IaaS is a service, it is performing well when it’s serving its workload well. I recommend you have 2 or 3 Service Tiers only.
    • Yes, if you have 1000 VMs, you do take 1000 independent measurement. You don’t take the average. You take the worst. Measure this, and agree with your management on what “well” is. Review this on Performance SLA.
  • Step 2: measure if each VM is served well
    • IaaS provides 4 services:
      • CPU
      • RAM
      • Disk
      • Network
    • Now that you have an SLA, you can measure if any VM is not served well.
  • Step 3: aggregate at cluster level
    • This enables you to report at the cluster level, which makes capacity management easier.
    • If you have thousands of VMs, grouping by cluster results in neater visualization. If your cluster has 300 VMs on average, a 60000 VMs will just result in 200 clusters. 200 objects can fit inside a heatmap on the projected screen easily.


Here is how I did it. You don’t have to do all the preparation as it’s already created as part of Operationalize Your World. You simply import them. I’m showing how it’s done in this blog.

  1. I created 1 group for each service tier. I called it Tier 1 (Gold), Tier 2 (Silver) and Tier 3 (Bronze)
  2. In each group, I added the clusters and datastores that belong to the tier. For the VMs, I simply specified all VMs that belong in that cluster. This makes future VM automatically tagged.

The implementation uses super metric on top of super metric. In current release, this has to be done via super metric, as you need to do calculation on it. Also, regular metric and property are not user-editable.

Level 1 super metric: Performance SLA

I created super metrics for each tier, to define Performance SLA. 1 set for each service tier. Here is what it looks like in vR Ops 6.6.

I created 3 policies, 1 for each Service Tier. I enabled the respective super metrics in each policy. You don’t need super metric for network as you expect 0 drop packets. The VM vNIC RX may report dropped packet when there isn’t any, so I’m only using VM TX.

I mapped the policy to the respective group.

Level 2 super metric: count the breach for each VM

Once a VM inherits the SLA super metric, you can compare it with the actual performance.

If a VM is being served well, it will have no SLA breach. If not, it will have. This means the formula needs to return 0 or 1. For that, you need to use IF THEN ELSE.

If Actual > SLA Then return 1 Else return 0.

For network, since the expectation is 0 dropped packet, the formula is simpler:

IF Network RX > 0 Then return 1 Else return 0.

I’m not considering RX for reason given earlier.

The entire formula looks like this

Count of SLA breach = CPU SLA breach + RAM SLA breach + Disk SLA breach + Network SLA breach

If all 4 types are not served well, it will have a value of 4.

Here is what the super metric looks like:

Level 3 super metric: count the total SLA breach per cluster as %

This enables you to compare across clusters. You can tell which clusters are struggling to cope with demand, and it’s not meeting its demand.

There are 2 consideration:

  • Do you count per VM or per breach?       If a VM has 4 breaches, do you count 4 or 1? I prefer to count 4, as it tells the severity of the breach. If you have 2 clusters with identical number of VM unserved, you can’t tell which one has bigger problem. Plus, adding an ESXi host adds CPU, RAM and Network capacity.
  • Do you compare absolute or relative?     Say you have many clusters, and they vary in sizes and number of VMs. Everything being equal, the big cluster will appear as if they struggle more since they have more VMs. If you divide the number by the number of VM, you normalize it. You also get a percentage, which makes discussion with management easier. You can say “we’ve experienced 10% breach. As per capacity management policy, this is the time we trigger hardware purchase so IT is ahead of business”. Standardizing the unit into policy enables you to apply the same treatment across clusters.

You may ask, why add VM Disk Latency when it’s part of a datastore, and not cluster? It makes analysis easier. Otherwise, we have to present 2 sets of information. Plus, there is generally a correlation between cluster and datastore.

Here is the formula:

Quiz! Why divide by 4 SLA types?

Got the answer? 🙂

The number can theoretically touch 400%, as a VM can potentially have 4 breaches. To make it easier (100% is so much easier to explain than 400%), I divide the number by 4.

Here is the actual formula

If  you are curious what it looks like in vR Ops, here it is:

Visualising the data

Since you have the data at VM level and cluster level, you can show in 2 different ways

Heat map 1: Are the VMs being served well?

This shows all the VMs. Naturally, grouping will occur. You expect this heat map to be mostly green most of the time. It won’t be green 100% at all times, as you are maximizing your investment.

The color is by SLA breach, where red = 4 (all SLA breached). The size is ideally by service tier, so Tier 1 VMs stand out. You can use vCPU as the proxy. I prefer vCPU over criticality as the large VMs stands out. You can also use vRAM, so you have 2 heat maps.

What can you tell from the above?

  • The distribution of my VMs among my Tier 1, Tier 2 and Tier 3. The VMs are grouped by service tiers, the by clusters within that tier.
  • Majority of VMs are not getting the SLA promised. However, it’s mostly 1 failure. This is likely Network in my case above.

Heat map 2: are the clusters performing well?

This shows all the clusters. Because the units are standardized in %, you can compare clusters easily now. Color is by the Cluster Performance (%), where Green = 100% (no SLA breached).

Size is by the number of VMs, so the cluster serving more VMs stand out. You can also use number of hosts, so you have 2 heat maps.

What’s a guide? To me, 5% is serious enough to warrant a hardware expansion, as it takes >1 month to add. Remember, we divide the number by 4.

Hope that’s useful. Do get the whole suite of dashboards from Operationalize Your World.