Tag Archives: Mission Critical

VM Availability Monitoring

VM Availability is a common requirement, as IaaS team is bound by Availability SLA. The challenge is reporting this.

The up time of a VM is more complex than that of a physical machine. Just because the VM is powered on, does not mean the Guest OS is up and running. The VM could be stuck at BIOS, Windows hits BSOD or Guest OS simply hang. This means we need to check the Guest OS. If we have VMware Tools, we can check for heartbeat. But what if VMware Tools is not running or not even installed? Then we need to check for sign of life. Does the VM generate network packets, issue disk IOPS, consume RAM?

Another challenge is the frequency of reporting. If you report every 5 minutes, what if the VM was rebooted within that 5 minutes, and it comes back up before the 5th minute ends? You will miss that fact that it was down within that 5 minutes!

From the above, we can build a logic:

If VM Powered Off then
   Return 0. VM is definitely down.
Else
  Calculate up time within the 300 seconds period
End

In the above logic, to calculate the up time, we need first to decide if the Guest OS is indeed up, since the VM is powered on.

We can deduce that Guest OS is up is it’s showing any sign of life. We can take

  • Heartbeat
  • RAM Usage
  • Network Usage
  • Disk IOPS

Can you guess why we can’t use CPU Usage?

VM does generate CPU even though it’s stuck at BIOS! We need a counter that shows 0, and not a very low number. An idle VM is up, not down.

So we need to know if the Guest OS is up or down. We are expecting binary, 1 or 0. Can you see the challenge here?

Yes, none of the counters above is giving you binary. Disk IOPS for example, can vary from 0.01 to 10000. The “sign of life” is not coming as binary.

We need to convert them into 0 or 1. 0 is the easy part, as they will be 0 if they are down.

I’d take Network Usage as example.

  • What if Network Usage is >1? We can use Min (Network Usage, 1) to return 1.
  • What if Network Usage is <1? We can use Round up (Network Usage, 1) to return 1.

So we can combine the above formula to get us 0 or 1.

The last part is to account for partial up time, when the VM was rebooted within the 300 seconds sampling period. The good thing is vR Ops tracks the OS up time every second. So every 5 minute, the value goes up by 300 seconds. As VM normally runs >5 minutes, you end up with a very large number. Our formula becomes:

If the up time is >300 seconds then return 300 else return it as it is.

Implementation

Let’s now put the formula together. Here is the logical formula:

(Is VM Powered on?) x 
(Is Guest OS up?) x 
(period it is up within 300 seconds)

“Is VM Powered on” returns 0 or 1. This is perfect as the whole formula returns 0.

“Is Guest OS up” returns 0 or 1. It returns 1 is there is any sign of life.

We get the Maximum of OS Uptime, Tools Heartbeat, RAM Usage, Disk Usage, and Network Usage. If any of these is not 0, then the Guest OS is up

We use Minimum (Is Guest OS up, 1) to bring down the number to 1.

Since the VM can be idle, we use Round Up to 1. This will round up 0.0001 to 1 but not round up 0 to 1.

To determine how long the VM is up within the 300 seconds, we simply take Minimum (OS Uptime, 300)

To convert the number into percentage, we simply divide by 300 then multiple by 100.

Here is what the formula looks like

Can you write the above formula differently? Yes, you can use If Then Else. I do not use it as it makes the formula harder to read. It’s also more resource intensive.

Let’s now show the above formula using actual vR Ops super metric formula.

I’m using $This feature as the formula is referring to the VM itself. I’m using metric= and not attribute= as I only need 1 value.

Validation

Let’s now take a few scenario and run through the formula.

Scenario 1: the VM is powered off.

  • This will return 0 since the first result is already 0 and multiplying 0 with anything will give 0.

Scenario 2: ideal scenario. The VM is up, active and has VMware Tools

  • The OS Uptime metric, since it’s in seconds and it’s accumulative, will be much larger than other counters. The Max () will return 7368269142, but Min (7368269142, 1) will return 1.
  • The Min (300, 7368269142) will return 300.
  • So the result is 1 x 1 x 300 / 3 = 100. The VM Uptime for that 5 minute period is 100%.

Scenario 3: not ideal scenario. The VM is up, but is idle and has no VMware Tools

Example

Let’s show an example of how the super metric detects the availability issue. Here is a VM that has availability problem. In this case, the VM was rebooted regularly.

The VM Uptime (%) super metric reports it correctly. It’s 100% when it’s up and 0% when it’s down. The super metric matches the Powered On metric and the OS Uptime metric.

Let’s check if the super metric detects the up time within the 5 minutes. To do that, we can zoom into the time it was down. From the Powered On and OS Uptime metric, we can see it’s down for around 10 – 15 minutes. The super metric detects that. The up time went down to 0 for 10 minutes, then partially up in the last 5 minutes.

Here is the uptime in the last 5 minutes. So it went up within this 5 minutes

Limitation

The limitation is within a 5 minute period. The OS has to be up by the 5th minute. If the number is 0, it will be calculated as 0. So if the VM is up for 4:59 minutes, then went down at 5th minute exactly, the Powered On will return 0.