Does the IaaS serve the VMs well?

That’s a loaded question. If your CIO is asking you that, can you give a convincing answer?

If you have 1000 VMs, how do you prove that the platform serves each and every one of them well? How can you show 1000 VMs on a single screen, no more than 1280 x 1024 pixels? Wouldn’t be awesome to project this on a live screen, where the performance of your platform can be appreciated by your management and customers? You take pride in architecting it, paying attention to details, but sadly infrastructure has becoming a thankless job due to commoditization. The only time they call you is when there is a problem.

Your hard work can be shown. You too can have live show 🙂

But first, you need to define “well“. A mission critical VM carrying the voice traffic won’t tolerate latency that a batch job happily accepts. Have you defined what “performance” exactly is? Because if you have not, how can you measure it? If you cannot measure it, how can you prove it’s good or bad?

Approach

  • Step 1: define IaaS Performance
    • IaaS performance differs to VM performance. Since the nature of IaaS is a service, it is performing well when it’s serving its workload well. I recommend you have 2 or 3 Service Tiers only.
    • Yes, if you have 1000 VMs, you do take 1000 independent measurement. You don’t take the average. You take the worst. Measure this, and agree with your management on what “well” is. Review this on Performance SLA.
  • Step 2: measure if each VM is served well
    • IaaS provides 4 services:
      • CPU
      • RAM
      • Disk
      • Network
    • Now that you have an SLA, you can measure if any VM is not served well.
  • Step 3: aggregate at cluster level
    • This enables you to report at the cluster level, which makes capacity management easier.
    • If you have thousands of VMs, grouping by cluster results in neater visualization. If your cluster has 300 VMs on average, a 60000 VMs will just result in 200 clusters. 200 objects can fit inside a heatmap on the projected screen easily.

Implementation

Here is how I did it. You don’t have to do all the preparation as it’s already created as part of Operationalize Your World. You simply import them. I’m showing how it’s done in this blog.

  1. I created 1 group for each service tier. I called it Tier 1 (Gold), Tier 2 (Silver) and Tier 3 (Bronze)
  2. In each group, I added the clusters and datastores that belong to the tier. For the VMs, I simply specified all VMs that belong in that cluster. This makes future VM automatically tagged.

The implementation uses super metric on top of super metric. In current release, this has to be done via super metric, as you need to do calculation on it. Also, regular metric and property are not user-editable.

Level 1 super metric: Performance SLA

I created super metrics for each tier, to define Performance SLA. 1 set for each service tier. Here is what it looks like in vR Ops 6.6.

I created 3 policies, 1 for each Service Tier. I enabled the respective super metrics in each policy. You don’t need super metric for network as you expect 0 drop packets. The VM vNIC RX may report dropped packet when there isn’t any, so I’m only using VM TX.

I mapped the policy to the respective group.

Level 2 super metric: count the breach for each VM

Once a VM inherits the SLA super metric, you can compare it with the actual performance.

If a VM is being served well, it will have no SLA breach. If not, it will have. This means the formula needs to return 0 or 1. For that, you need to use IF THEN ELSE.

If Actual > SLA Then return 1 Else return 0.

For network, since the expectation is 0 dropped packet, the formula is simpler:

IF Network RX > 0 Then return 1 Else return 0.

I’m not considering RX for reason given earlier.

The entire formula looks like this

Count of SLA breach = CPU SLA breach + RAM SLA breach + Disk SLA breach + Network SLA breach

If all 4 types are not served well, it will have a value of 4.

Here is what the super metric looks like:

Level 3 super metric: count the total SLA breach per cluster as %

This enables you to compare across clusters. You can tell which clusters are struggling to cope with demand, and it’s not meeting its demand.

There are 2 consideration:

  • Do you count per VM or per breach?       If a VM has 4 breaches, do you count 4 or 1? I prefer to count 4, as it tells the severity of the breach. If you have 2 clusters with identical number of VM unserved, you can’t tell which one has bigger problem. Plus, adding an ESXi host adds CPU, RAM and Network capacity.
  • Do you compare absolute or relative?     Say you have many clusters, and they vary in sizes and number of VMs. Everything being equal, the big cluster will appear as if they struggle more since they have more VMs. If you divide the number by the number of VM, you normalize it. You also get a percentage, which makes discussion with management easier. You can say “we’ve experienced 10% breach. As per capacity management policy, this is the time we trigger hardware purchase so IT is ahead of business”. Standardizing the unit into policy enables you to apply the same treatment across clusters.

You may ask, why add VM Disk Latency when it’s part of a datastore, and not cluster? It makes analysis easier. Otherwise, we have to present 2 sets of information. Plus, there is generally a correlation between cluster and datastore.

Here is the formula:

Quiz! Why divide by 4 SLA types?

Got the answer? 🙂

The number can theoretically touch 400%, as a VM can potentially have 4 breaches. To make it easier (100% is so much easier to explain than 400%), I divide the number by 4.

Here is the actual formula

If  you are curious what it looks like in vR Ops, here it is:

Visualising the data

Since you have the data at VM level and cluster level, you can show in 2 different ways

Heat map 1: Are the VMs being served well?

This shows all the VMs. Naturally, grouping will occur. You expect this heat map to be mostly green most of the time. It won’t be green 100% at all times, as you are maximizing your investment.

The color is by SLA breach, where red = 4 (all SLA breached). The size is ideally by service tier, so Tier 1 VMs stand out. You can use vCPU as the proxy. I prefer vCPU over criticality as the large VMs stands out. You can also use vRAM, so you have 2 heat maps.

What can you tell from the above?

  • The distribution of my VMs among my Tier 1, Tier 2 and Tier 3. The VMs are grouped by service tiers, the by clusters within that tier.
  • Majority of VMs are not getting the SLA promised. However, it’s mostly 1 failure. This is likely Network in my case above.

Heat map 2: are the clusters performing well?

This shows all the clusters. Because the units are standardized in %, you can compare clusters easily now. Color is by the Cluster Performance (%), where Green = 100% (no SLA breached).

Size is by the number of VMs, so the cluster serving more VMs stand out. You can also use number of hosts, so you have 2 heat maps.

What’s a guide? To me, 5% is serious enough to warrant a hardware expansion, as it takes >1 month to add. Remember, we divide the number by 4.

Hope that’s useful. Do get the whole suite of dashboards from Operationalize Your World.

Armenia. Home of vR Ops

Most customers, and a good part of VMware own team, do not realize that vRealize Operations is developed out of Armenia. Who would have thought? Armenia is not as well-known as other cities in terms of software development. As an IT professionals, you may travel to India (Bangalore), China (Shanghai), US (Silicon Valley), but not Armenia. In this post, I hope to raise awareness, as I think it’s worth knowing. On a personal note, it’s worth visiting!

vR Ops is developed in 3 cities: Palo Alto (Silicon Valley), Yerevan (Armenia) and Bangalore (India). Yerevan has the bulk of team, with over 100 R&D engineers in 1 office location. Bangalore does most of the management pack and Endpoint Operations. The lead of vR Ops product management, Monica Sharma, is based in Palo Alto. She works closely with Sunny Dua, who is also based in the same building with her (I think they are in Prom E building). In US, we also have our PMM (Product Marketing Manager) and TMM (Technical Marketing Manager) team. Awesome folks such as Taruna Gandhi and John Diaz are part of this marketing specialist team.

Armenia actually does more than vR Ops. Part of Log Insight is also developed here. Stephen Flanders (whom you should know if you use Log Insight) is a regular visitor to Yerevan.

As part of development, you need QE. vR Ops integrates with many products, as all these testing needs to be automated. The ESO (Engineering Services Organisation) has developed an end-to-end automation. This enables the daily build to be tested. The product is automatically deployed and configured. It adds a number of vCenter servers automatically, performs various tests and reports the result. It tests various adapters too.

Karen Aghajanyan is the Director in charge in Armenia, and the sponsor of my Take 2. I wrote sponsor as he paid for it (I’m honored that R&D considers it worthy to fly me there). Take 2 is a cool program in VMware that allows you to take 2 weeks off your current role and do things that you normally do as your night job. In my case, it was 2 weeks stint in Armenia, working on the next release, code named Aria. It was intense, working with many developers, brainstorming the best way to improve the product while maintaining backward compatibility. On Day 1, Karen assembled all his management team and asked me to present an overview of the areas for improvement. Once we agree on the areas of work, it’s straight to collaboration with the respective developers. It’s both a pleasure and an honor to work with the developers. They are passionate and have code-level knowledge. If there is a part that they are not sure, they just open the source code! Years of working with vR Ops helped me, else I’d not be able to keep up with the depth they are expecting. Other than depth, you also need breadth, as you need to know the impact of the proposed changes. In fact, maintaining backward compatibility is more complex than adding the features, and we spent a lot of time deliberating to make sure customers can safely upgrade. I learned about the inner working of the product, which makes me appreciate it even more.

John Yaralian is the Product Owner, and my host. He was the one requesting my manager (Kamau Wanguhu) approval for my 2 weeks stint. As the Product Owner, John acts as the single contact to the R&D team. If you are hoping to do Take 2 in Armenia, John is also your contact point. John was also the lead for the inaugural vR Ops Boot Camp. It’s a 2-day event, where attendees get to spend time directly with the developers. We ran 5 tracks, each repeated 3x. Each track is capped to <10 people, so you got ample time for discussion. There is also open house, where booths are setup. The booth covers specific topics, such as UI, super metrics, scalability, and upgrade. You get to ask all those burning questions!

My Take 2 was part of overall planning session for project Aria. Monica and Sunny were also there. If you use vR Ops, you’ll come to know that Sunny is my fellow partner in crime. We respect each other deeply, and argue a lot! Shannon Klebart from VMware Office of the CTO calls us her Batman and Robin. For 2 weeks, including weekend, we discussed the solution for Aria in-depth. We brainstormed many parts of the products. Usability is top of mind for me. As a customer facing engineer, working closely with paying customers who use the products, I care a lot about usability. IMHO, usability goes beyond the UI. It starts before you even deploy the damn thing. What if sizing is hard? What if deployment architecture is complex? You won’t use the product as you don’t know how to do enterprise deployment.

On the UI part, we also developed mock up for the next generation UX. We looked at dashboard life cycle in details. We went through each widgets one by one, looked at the areas of improvement. We were mindful of keeping them consistent yet unique. Vahan Tadevosyan and Tigran Avagimyants, the 2 lead developers for UI, were passionate in bringing as much enhancement to their work.

I won’t steal the thunder from PM and PMM, so I will just say as an Engineer I’m excited with Aria. Stay tune!

Working with R&D will naturally involve patent. It’s inspiring to meet someone much younger than you but holds >20 patents! Naira Grigoryan and Nina Karapetyan reviewed a patent idea I shared and we’re collaborating on the submission. For me as an engineer, I think it’s just uber cool to have a patent under my name 🙂

Armenia isn’t just work. I was stunned by the rich history. It’s fascinating to learn the story behind the old churches. The monastery is also located in a remote place, so it’s solitaire. You’d enjoy the surrounding view and spaciousness. I came in June, so the weather was good, and air was fresh. In winter, it will be snowy. Not something I’d personally enjoy 🙂

Other than history, Sunny and I got a chance to see scenery too. This blog post profile photo was Jermuk. Picturesque, isn’t it? Robert Mesropyan, and Arthur Aghabekyan, drove 400 km on Sunday to take us there. Sunny and I were simply blown away by their kindness. It is a humbling experience. What’s even more amazing is this kindness is displayed by all the developers. 5 of them came out, on weekends, to take us to see their beautiful countries! Gagik Manukyan, Arshak Galstyan, Sevak Tsaturyan took 5 of us on Saturday. Talk about hospitality! We did a short video to thank the entire team. You got to see the first part, which has blooper, so hope you enjoy it.

Here is a place I want to see first hand in my next trip. I didn’t get to see it. It’s Lake Parz at Dillijan.

The city of Yerevan is relatively small. I live in Singapore, a modern metropolis, so I enjoy the striking difference. It’s also on a hill. You can walk up the cool open stairs of the Cascade, and see the city from a high vantage point. Just like Singapore, the city is safe. You’re not worried about pick pockets and personal safety. There is no Uber, but there is a local company called GG that provides similar service. There is no train, and I didn’t get the chance to try the buses. Next time I’m there, I might rent a car. It’s pretty easy to drive as traffic isn’t bad.

I hope the short blog gives you an appreciation of the software we’ve come to know and love. I’ve been using the software since 1.0 many blue moons ago, and the best is yet to be!

VM Availability Monitoring

VM Availability is a common requirement, as IaaS team is bound by Availability SLA. The challenge is reporting this.

The up time of a VM is more complex than that of a physical machine. Just because the VM is powered on, does not mean the Guest OS is up and running. The VM could be stuck at BIOS, Windows hits BSOD or Guest OS simply hang. This means we need to check the Guest OS. If we have VMware Tools, we can check for heartbeat. But what if VMware Tools is not running or not even installed? Then we need to check for sign of life. Does the VM generate network packets, issue disk IOPS, consume RAM?

Another challenge is the frequency of reporting. If you report every 5 minutes, what if the VM was rebooted within that 5 minutes, and it comes back up before the 5th minute ends? You will miss that fact that it was down within that 5 minutes!

From the above, we can build a logic:

If VM Powered Off then
   Return 0. VM is definitely down.
Else
  Calculate up time within the 300 seconds period
End

In the above logic, to calculate the up time, we need first to decide if the Guest OS is indeed up, since the VM is powered on.

We can deduce that Guest OS is up is it’s showing any sign of life. We can take

  • Heartbeat
  • RAM Usage
  • Network Usage
  • Disk IOPS

Can you guess why we can’t use CPU Usage?

VM does generate CPU even though it’s stuck at BIOS! We need a counter that shows 0, and not a very low number. An idle VM is up, not down.

So we need to know if the Guest OS is up or down. We are expecting binary, 1 or 0. Can you see the challenge here?

Yes, none of the counters above is giving you binary. Disk IOPS for example, can vary from 0.01 to 10000. The “sign of life” is not coming as binary.

We need to convert them into 0 or 1. 0 is the easy part, as they will be 0 if they are down.

I’d take Network Usage as example.

  • What if Network Usage is >1? We can use Min (Network Usage, 1) to return 1.
  • What if Network Usage is <1? We can use Round up (Network Usage, 1) to return 1.

So we can combine the above formula to get us 0 or 1.

The last part is to account for partial up time, when the VM was rebooted within the 300 seconds sampling period. The good thing is vR Ops tracks the OS up time every second. So every 5 minute, the value goes up by 300 seconds. As VM normally runs >5 minutes, you end up with a very large number. Our formula becomes:

If the up time is >300 seconds then return 300 else return it as it is.

Implementation

Let’s now put the formula together. Here is the logical formula:

(Is VM Powered on?) x 
(Is Guest OS up?) x 
(period it is up within 300 seconds)

“Is VM Powered on” returns 0 or 1. This is perfect as the whole formula returns 0.

“Is Guest OS up” returns 0 or 1. It returns 1 is there is any sign of life.

We get the Maximum of OS Uptime, Tools Heartbeat, RAM Usage, Disk Usage, and Network Usage. If any of these is not 0, then the Guest OS is up

We use Minimum (Is Guest OS up, 1) to bring down the number to 1.

Since the VM can be idle, we use Round Up to 1. This will round up 0.0001 to 1 but not round up 0 to 1.

To determine how long the VM is up within the 300 seconds, we simply take Minimum (OS Uptime, 300)

To convert the number into percentage, we simply divide by 300 then multiple by 100.

Here is what the formula looks like

Can you write the above formula differently? Yes, you can use If Then Else. I do not use it as it makes the formula harder to read. It’s also more resource intensive.

Let’s now show the above formula using actual vR Ops super metric formula.

I’m using $This feature as the formula is referring to the VM itself. I’m using metric= and not attribute= as I only need 1 value.

Validation

Let’s now take a few scenario and run through the formula.

Scenario 1: the VM is powered off.

  • This will return 0 since the first result is already 0 and multiplying 0 with anything will give 0.

Scenario 2: ideal scenario. The VM is up, active and has VMware Tools

  • The OS Uptime metric, since it’s in seconds and it’s accumulative, will be much larger than other counters. The Max () will return 7368269142, but Min (7368269142, 1) will return 1.
  • The Min (300, 7368269142) will return 300.
  • So the result is 1 x 1 x 300 / 3 = 100. The VM Uptime for that 5 minute period is 100%.

Scenario 3: not ideal scenario. The VM is up, but is idle and has no VMware Tools

Example

Let’s show an example of how the super metric detects the availability issue. Here is a VM that has availability problem. In this case, the VM was rebooted regularly.

The VM Uptime (%) super metric reports it correctly. It’s 100% when it’s up and 0% when it’s down. The super metric matches the Powered On metric and the OS Uptime metric.

Let’s check if the super metric detects the up time within the 5 minutes. To do that, we can zoom into the time it was down. From the Powered On and OS Uptime metric, we can see it’s down for around 10 – 15 minutes. The super metric detects that. The up time went down to 0 for 10 minutes, then partially up in the last 5 minutes.

Here is the uptime in the last 5 minutes. So it went up within this 5 minutes

Limitation

The limitation is within a 5 minute period. The OS has to be up by the 5th minute. If the number is 0, it will be calculated as 0. So if the VM is up for 4:59 minutes, then went down at 5th minute exactly, the Powered On will return 0.