VM Availability Monitoring

VM Availability is a common requirement, as IaaS team is bound by Availability SLA. The challenge is reporting this.

The up time of a VM is more complex than that of a physical machine. Just because the VM is powered on, does not mean the Guest OS is up and running. The VM could be stuck at BIOS, Windows hits BSOD or Guest OS simply hang. This means we need to check the Guest OS. If we have VMware Tools, we can check for heartbeat. But what if VMware Tools is not running or not even installed? Then we need to check for sign of life. Does the VM generate network packets, issue disk IOPS, consume RAM?

Another challenge is the frequency of reporting. If you report every 5 minutes, what if the VM was rebooted within that 5 minutes, and it comes back up before the 5th minute ends? You will miss that fact that it was down within that 5 minutes!

From the above, we can build a logic:

If VM Powered Off then
   Return 0. VM is definitely down.
  Calculate up time within the 300 seconds period

In the above logic, to calculate the up time, we need first to decide if the Guest OS is indeed up, since the VM is powered on.

We can deduce that Guest OS is up is it’s showing any sign of life. We can take

  • Heartbeat
  • RAM Usage
  • Network Usage
  • Disk IOPS

Can you guess why we can’t use CPU Usage?

VM does generate CPU even though it’s stuck at BIOS! We need a counter that shows 0, and not a very low number. An idle VM is up, not down.

So we need to know if the Guest OS is up or down. We are expecting binary, 1 or 0. Can you see the challenge here?

Yes, none of the counters above is giving you binary. Disk IOPS for example, can vary from 0.01 to 10000. The “sign of life” is not coming as binary.

We need to convert them into 0 or 1. 0 is the easy part, as they will be 0 if they are down.

I’d take Network Usage as example.

  • What if Network Usage is >1? We can use Min (Network Usage, 1) to return 1.
  • What if Network Usage is <1? We can use Round up (Network Usage, 1) to return 1.

So we can combine the above formula to get us 0 or 1.

The last part is to account for partial up time, when the VM was rebooted within the 300 seconds sampling period. The good thing is vR Ops tracks the OS up time every second. So every 5 minute, the value goes up by 300 seconds. As VM normally runs >5 minutes, you end up with a very large number. Our formula becomes:

If the up time is >300 seconds then return 300 else return it as it is.


Let’s now put the formula together. Here is the logical formula:

(Is VM Powered on?) x 
(Is Guest OS up?) x 
(period it is up within 300 seconds)

“Is VM Powered on” returns 0 or 1. This is perfect as the whole formula returns 0.

“Is Guest OS up” returns 0 or 1. It returns 1 is there is any sign of life.

We get the Maximum of OS Uptime, Tools Heartbeat, RAM Usage, Disk Usage, and Network Usage. If any of these is not 0, then the Guest OS is up

We use Minimum (Is Guest OS up, 1) to bring down the number to 1.

Since the VM can be idle, we use Round Up to 1. This will round up 0.0001 to 1 but not round up 0 to 1.

To determine how long the VM is up within the 300 seconds, we simply take Minimum (OS Uptime, 300)

To convert the number into percentage, we simply divide by 300 then multiple by 100.

Here is what the formula looks like

Can you write the above formula differently? Yes, you can use If Then Else. I do not use it as it makes the formula harder to read. It’s also more resource intensive.

Let’s now show the above formula using actual vR Ops super metric formula.

I’m using $This feature as the formula is referring to the VM itself. I’m using metric= and not attribute= as I only need 1 value.


Let’s now take a few scenario and run through the formula.

Scenario 1: the VM is powered off.

  • This will return 0 since the first result is already 0 and multiplying 0 with anything will give 0.

Scenario 2: ideal scenario. The VM is up, active and has VMware Tools

  • The OS Uptime metric, since it’s in seconds and it’s accumulative, will be much larger than other counters. The Max () will return 7368269142, but Min (7368269142, 1) will return 1.
  • The Min (300, 7368269142) will return 300.
  • So the result is 1 x 1 x 300 / 3 = 100. The VM Uptime for that 5 minute period is 100%.

Scenario 3: not ideal scenario. The VM is up, but is idle and has no VMware Tools


Let’s show an example of how the super metric detects the availability issue. Here is a VM that has availability problem. In this case, the VM was rebooted regularly.

The VM Uptime (%) super metric reports it correctly. It’s 100% when it’s up and 0% when it’s down. The super metric matches the Powered On metric and the OS Uptime metric.

Let’s check if the super metric detects the up time within the 5 minutes. To do that, we can zoom into the time it was down. From the Powered On and OS Uptime metric, we can see it’s down for around 10 – 15 minutes. The super metric detects that. The up time went down to 0 for 10 minutes, then partially up in the last 5 minutes.

Here is the uptime in the last 5 minutes. So it went up within this 5 minutes


The limitation is within a 5 minute period. The OS has to be up by the 5th minute. If the number is 0, it will be calculated as 0. So if the VM is up for 4:59 minutes, then went down at 5th minute exactly, the Powered On will return 0.

From Local to Global

I made a career move from local role to a global role mid last year. It’s been almost a year, so I’d like to share the pro and con of both roles. Hopefully it helps you in your career, and keeping the fun in work!

It’s common for a global corporation to have 4 levels of geographic coverage:

  • Local: cover a city or a small country. In my case, it’s just Singapore.
  • Regional: cover a region or large country. In my case, it’s ASEAN.
  • Continent: Asia Pacific, Europe, Africa, America
  • Global.

In vendor environment (as opposed to end-user), there are 2 large primary teams

  • Product team: develop the product. The sub-team are:
    • Product Management, R&D, QA, Sustaining, UX (focus on the UI), IX (focus on documentation).
    • They focus on releasing the next big thing.
  • Field team: sell, implement, support the product.
    • Sales, SE, Consulting, Technical Account Managers, Support, Education.
    • They focus on the quarterly target, closing large deals.

Of course, they are supported by many smaller & supporting team, such as marketing, pricing, and CTO Office.

I’ve never worked in Product Team, so when an opportunity arose, I took the leap of faith.

  • From Local to Global, bypassing ASEAN and Asia Pacific.
  • From Field to R&D. My boss no longer in Singapore, ASEAN and Pacific, but directly at our HQ in Palo Alto. A few months ago, there was a re-org and my boss now in Europe. I have not met him yet!
  • From generalist to specialist. I now do vR Ops full time.

Now that it’s almost a year, I can say with enough confidence that it was a good decision. We all work for 3 reasons. I call them the 3M of work:

  • Money
  • Meaning. It has to fill your spirit, not just your pocket.
  • Merriment. It’s gotta be fun, and you love your work.

The job at global level is harder, much harder. Instead of thinking for just 1 customer (my job was Account SE), or a few customers, I have to think of the world. While working on future version, I have to think of current version and previous version. Brownfield is much harder than greenfield. I learned from R&D team that there was many things to be considered before adding or removing a feature. The complexity makes the job meaningful. Life is short, and the journey is as important as the destination. I’ve never done product development before. Luckily, folks are kind and we got along really well. I work with R&D team in Armenia and Palo Alto. They have never, never asked me to accommodate their time zone. I’m truly grateful for that. Folks like Monica Sharma (Director, Product Management) and of course Sunny Dua provide a lot of coaching and guidance.

My perspective was widened. Before, I was just working with a few customers in Singapore, and a bit of ASEAN. Now I work with customers from Europe to US. What I accepted as the best before, has been reset. I’ve seen other region and customers achieved something better.

I didn’t know there was so much work! The demand for the role I took was apparently untapped. I had no idea since I was not busy when doing local role! There was so much request for help outside Singapore. I do webex regularly with customers, helping them remotely. They would login to their production environment, and we troubleshoot issue together. I get to see live environment, and insight into operation.

The downside is travel. Controlling the schedule is important, else I could travel non-stop and just spend weekend at home. My travel schedule is practically full 3 months in advance. Again, folks are generally accommodating. I learn when we explain to folks openly why we can’t be there (they are sponsoring my trip), they are willing to accommodate.

Travel can be sudden. I gotta a call to help large customer on Thursday morning, and on Sunday I was already in the plane to see them. If you have young kids, this can be deal breaker. My 2 kids are 12 and 15 already, but my Mum at 80 needs care.

Speaking of travel, gluten free is a challenge. I am allergic to both dairy and gluten, so keeping these 2 away is difficult when abroad. There is no easy solution today. Singapore Airlines changes the menu every 3 months, so I know in advance exactly what I’m getting 🙂

I hope the sharing is useful for those who are thinking of taking global role.

Which VMs need more resources?

You can reduce the following resources from a VM:

  • CPU
  • RAM
  • Storage

Network isn’t something you can reduce, but you know that already 🙂

You can check which VMs need more resources by building a dashboard like the one below. It’s a simple dashboard, which you can customize and enhance. It lets you reduce the resources independently.

I’ve marked the above dashboard with numbers, so we can refer to them:

  1. This is a table that lists all VMs. It’s sorted by the highest 1-hour average of CPU Demand and RAM Demand. The table also lists the VM CPU and RAM configuration, so you can see if the VMs are small or large. It also shows the cluster the VMs are located. The table is sorted by the highest CPU Demand. I’m showing both CPU and RAM in a single table. You can clone the view and split them if that suits your operations better.
  2. This is a table that lists all VMs, but focusing on storage only. With storage, we do not have the complexity of checking peak utilisation. We simply need to check the present situation.
  3. This lists the Top-15 VMs with highest CPU Demand and RAM Demand in a given period. The list is now split, as they can be different VMs. Do not that Top-N widget will average the number over the selected period. A VM with cyclical workload may not show up. The Top-N is complemented with a distribution chart. Select a VM from the Top-N, and you can see where the VM utilisation is.
  4. The distribution chart helps you see if the VM is really under resources or not. The 95th percentile is marked with a vertical green line. You expect that line to be at 100%, indicating that the VMs hit 100% utilisation frequently. If the 95th percentile is at a low number, and you do not see the number 100 in the x-axis, that means the VM is not under resourced.
  5. Storage is easier, as we can simply use the last data. As a result, we can show a distribution of all the VMs. We use a heat map as it can show 2 dimensions. Every VM is represented as a box. The bigger the box, the more storage the VM is configured with. The color indicates if the VM use it.
    • 0% = Black. Wastage
    • 10% = Green. Balanced usage
    • 100% = Red. Need more space!

The CPU and RAM have limitations. For example, they may show high utilisation during AV backup. You want to ignore those period. At this moment, the only way is to plot the high usage over a line chart. We use Log Insight for this. The chart below shows VMs that hit high CPU usage in a given period. Every time a VM hits high CPU usage, it will show up here. As you can see, there are only 4 VMs that hit high CPU usage. All other VMs do not need more CPU.

The above is an example from a healthy environment. What about an environment where a lot of VMs are under-sized? You expect to see lots of alarm! That’s what you have below

Hope the above is useful. If not, drop me an email.