Keeping VMware Tools current

Keeping VMware Tools current is one of the best practices of vSphere operations. VMware Tools interfaces with both ESXi and VM (the virtual motherboard or virtual machine). Hence, there are 2 comparisons to consider:

  1. VM Hardware version
  2. ESXi version

From the vSphere API, here is what you get when you query it:

  • Guest Tools Current
  • Guest Tools Not Installed
  • Guest Tools Supported New
  • Guest Tools Supported Old
  • Guest Tools Too Old
  • Guest Tools Unmanaged

What do they mean?

  • Guest Tools Not Installed:
    • Tools are not installed on the VM. You should install it as you get both drivers and visibility.
  • Current
    • Tools version matches with the Tools available with ESXi. Each ESXi has a version of Tools that comes with it. See this for the list. This is the ideal scenario.
  • Supported New
    • Newer than the ESXi VMware tools version, but it is supported.
  • Supported Old
    • The opposite of New. It is also supported. Even it is older by 0.0.1 is considered old. It does not have to far behind.
  • Too Old
    • Tools version is older than the minimum supported version of Tools across all ESXi versions. Minimum supported version is the oldest version of Tools we support. Basically, guest is running unsupported Tools. You should upgrade. As of now for Linux and Windows guests. minimum supported version is the Tools version bundled with ESXi 4.0 which is 8.0.1. Supporting such old versions is challenging. We are planning to change this in future to something newer. In the meantime, you should upgrade as might not work as expected
  • Unmanaged
    • Tools installed in the guest did not come from ESXi, so Tools is not being managed by ESXi host. It may be supported or maybe not, depends on what type of Tools is running in the guest. We support open-vm-tools packaged by Linux vendors and OSPs, which both show up as unmanaged.
    • If a customer builds their own open-vm-tools from source code, we may not support that because we will not know if they have done it correctly or not.

Operationalize Your World has a dashboard that highlights the VMs not running the current or supported new. You should expect the number to be minimal, or ideally none.

Does the IaaS serve the VMs well?

That’s a loaded question. If your CIO is asking you that, can you give a convincing answer?

If you have 1000 VMs, how do you prove that the platform serves each and every one of them well? How can you show 1000 VMs on a single screen, no more than 1280 x 1024 pixels? Wouldn’t be awesome to project this on a live screen, where the performance of your platform can be appreciated by your management and customers? You take pride in architecting it, paying attention to details, but sadly infrastructure has becoming a thankless job due to commoditization. The only time they call you is when there is a problem.

Your hard work can be shown. You too can have live show 🙂

But first, you need to define “well“. A mission critical VM carrying the voice traffic won’t tolerate latency that a batch job happily accepts. Have you defined what “performance” exactly is? Because if you have not, how can you measure it? If you cannot measure it, how can you prove it’s good or bad?

Approach

  • Step 1: define IaaS Performance
    • IaaS performance differs to VM performance. Since the nature of IaaS is a service, it is performing well when it’s serving its workload well. I recommend you have 2 or 3 Service Tiers only.
    • Yes, if you have 1000 VMs, you do take 1000 independent measurement. You don’t take the average. You take the worst. Measure this, and agree with your management on what “well” is. Review this on Performance SLA.
  • Step 2: measure if each VM is served well
    • IaaS provides 4 services:
      • CPU
      • RAM
      • Disk
      • Network
    • Now that you have an SLA, you can measure if any VM is not served well.
  • Step 3: aggregate at cluster level
    • This enables you to report at the cluster level, which makes capacity management easier.
    • If you have thousands of VMs, grouping by cluster results in neater visualization. If your cluster has 300 VMs on average, a 60000 VMs will just result in 200 clusters. 200 objects can fit inside a heatmap on the projected screen easily.

Implementation

Here is how I did it. You don’t have to do all the preparation as it’s already created as part of Operationalize Your World. You simply import them. I’m showing how it’s done in this blog.

  1. I created 1 group for each service tier. I called it Tier 1 (Gold), Tier 2 (Silver) and Tier 3 (Bronze)
  2. In each group, I added the clusters and datastores that belong to the tier. For the VMs, I simply specified all VMs that belong in that cluster. This makes future VM automatically tagged.

The implementation uses super metric on top of super metric. In current release, this has to be done via super metric, as you need to do calculation on it. Also, regular metric and property are not user-editable.

Level 1 super metric: Performance SLA

I created super metrics for each tier, to define Performance SLA. 1 set for each service tier. Here is what it looks like in vR Ops 6.6.

I created 3 policies, 1 for each Service Tier. I enabled the respective super metrics in each policy. You don’t need super metric for network as you expect 0 drop packets. The VM vNIC RX may report dropped packet when there isn’t any, so I’m only using VM TX.

I mapped the policy to the respective group.

Level 2 super metric: count the breach for each VM

Once a VM inherits the SLA super metric, you can compare it with the actual performance.

If a VM is being served well, it will have no SLA breach. If not, it will have. This means the formula needs to return 0 or 1. For that, you need to use IF THEN ELSE.

If Actual > SLA Then return 1 Else return 0.

For network, since the expectation is 0 dropped packet, the formula is simpler:

IF Network RX > 0 Then return 1 Else return 0.

I’m not considering RX for reason given earlier.

The entire formula looks like this

Count of SLA breach = CPU SLA breach + RAM SLA breach + Disk SLA breach + Network SLA breach

If all 4 types are not served well, it will have a value of 4.

Here is what the super metric looks like:

Level 3 super metric: count the total SLA breach per cluster as %

This enables you to compare across clusters. You can tell which clusters are struggling to cope with demand, and it’s not meeting its demand.

There are 2 consideration:

  • Do you count per VM or per breach?       If a VM has 4 breaches, do you count 4 or 1? I prefer to count 4, as it tells the severity of the breach. If you have 2 clusters with identical number of VM unserved, you can’t tell which one has bigger problem. Plus, adding an ESXi host adds CPU, RAM and Network capacity.
  • Do you compare absolute or relative?     Say you have many clusters, and they vary in sizes and number of VMs. Everything being equal, the big cluster will appear as if they struggle more since they have more VMs. If you divide the number by the number of VM, you normalize it. You also get a percentage, which makes discussion with management easier. You can say “we’ve experienced 10% breach. As per capacity management policy, this is the time we trigger hardware purchase so IT is ahead of business”. Standardizing the unit into policy enables you to apply the same treatment across clusters.

You may ask, why add VM Disk Latency when it’s part of a datastore, and not cluster? It makes analysis easier. Otherwise, we have to present 2 sets of information. Plus, there is generally a correlation between cluster and datastore.

Here is the formula:

Quiz! Why divide by 4 SLA types?

Got the answer? 🙂

The number can theoretically touch 400%, as a VM can potentially have 4 breaches. To make it easier (100% is so much easier to explain than 400%), I divide the number by 4.

Here is the actual formula

If  you are curious what it looks like in vR Ops, here it is:

Visualising the data

Since you have the data at VM level and cluster level, you can show in 2 different ways

Heat map 1: Are the VMs being served well?

This shows all the VMs. Naturally, grouping will occur. You expect this heat map to be mostly green most of the time. It won’t be green 100% at all times, as you are maximizing your investment.

The color is by SLA breach, where red = 4 (all SLA breached). The size is ideally by service tier, so Tier 1 VMs stand out. You can use vCPU as the proxy. I prefer vCPU over criticality as the large VMs stands out. You can also use vRAM, so you have 2 heat maps.

What can you tell from the above?

  • The distribution of my VMs among my Tier 1, Tier 2 and Tier 3. The VMs are grouped by service tiers, the by clusters within that tier.
  • Majority of VMs are not getting the SLA promised. However, it’s mostly 1 failure. This is likely Network in my case above.

Heat map 2: are the clusters performing well?

This shows all the clusters. Because the units are standardized in %, you can compare clusters easily now. Color is by the Cluster Performance (%), where Green = 100% (no SLA breached).

Size is by the number of VMs, so the cluster serving more VMs stand out. You can also use number of hosts, so you have 2 heat maps.

What’s a guide? To me, 5% is serious enough to warrant a hardware expansion, as it takes >1 month to add. Remember, we divide the number by 4.

Hope that’s useful. Do get the whole suite of dashboards from Operationalize Your World.

Armenia. Home of vR Ops

Most customers, and a good part of VMware own team, do not realize that vRealize Operations is developed out of Armenia. Who would have thought? Armenia is not as well-known as other cities in terms of software development. As an IT professionals, you may travel to India (Bangalore), China (Shanghai), US (Silicon Valley), but not Armenia. In this post, I hope to raise awareness, as I think it’s worth knowing. On a personal note, it’s worth visiting!

vR Ops is developed in 3 cities: Palo Alto (Silicon Valley), Yerevan (Armenia) and Bangalore (India). Yerevan has the bulk of team, with over 100 R&D engineers in 1 office location. Bangalore does most of the management pack and Endpoint Operations. The lead of vR Ops product management, Monica Sharma, is based in Palo Alto. She works closely with Sunny Dua, who is also based in the same building with her (I think they are in Prom E building). In US, we also have our PMM (Product Marketing Manager) and TMM (Technical Marketing Manager) team. Awesome folks such as Taruna Gandhi and John Diaz are part of this marketing specialist team.

Armenia actually does more than vR Ops. Part of Log Insight is also developed here. Stephen Flanders (whom you should know if you use Log Insight) is a regular visitor to Yerevan.

As part of development, you need QE. vR Ops integrates with many products, as all these testing needs to be automated. The ESO (Engineering Services Organisation) has developed an end-to-end automation. This enables the daily build to be tested. The product is automatically deployed and configured. It adds a number of vCenter servers automatically, performs various tests and reports the result. It tests various adapters too.

Karen Aghajanyan is the Director in charge in Armenia, and the sponsor of my Take 2. I wrote sponsor as he paid for it (I’m honored that R&D considers it worthy to fly me there). Take 2 is a cool program in VMware that allows you to take 2 weeks off your current role and do things that you normally do as your night job. In my case, it was 2 weeks stint in Armenia, working on the next release, code named Aria. It was intense, working with many developers, brainstorming the best way to improve the product while maintaining backward compatibility. On Day 1, Karen assembled all his management team and asked me to present an overview of the areas for improvement. Once we agree on the areas of work, it’s straight to collaboration with the respective developers. It’s both a pleasure and an honor to work with the developers. They are passionate and have code-level knowledge. If there is a part that they are not sure, they just open the source code! Years of working with vR Ops helped me, else I’d not be able to keep up with the depth they are expecting. Other than depth, you also need breadth, as you need to know the impact of the proposed changes. In fact, maintaining backward compatibility is more complex than adding the features, and we spent a lot of time deliberating to make sure customers can safely upgrade. I learned about the inner working of the product, which makes me appreciate it even more.

John Yaralian is the Product Owner, and my host. He was the one requesting my manager (Kamau Wanguhu) approval for my 2 weeks stint. As the Product Owner, John acts as the single contact to the R&D team. If you are hoping to do Take 2 in Armenia, John is also your contact point. John was also the lead for the inaugural vR Ops Boot Camp. It’s a 2-day event, where attendees get to spend time directly with the developers. We ran 5 tracks, each repeated 3x. Each track is capped to <10 people, so you got ample time for discussion. There is also open house, where booths are setup. The booth covers specific topics, such as UI, super metrics, scalability, and upgrade. You get to ask all those burning questions!

My Take 2 was part of overall planning session for project Aria. Monica and Sunny were also there. If you use vR Ops, you’ll come to know that Sunny is my fellow partner in crime. We respect each other deeply, and argue a lot! Shannon Klebart from VMware Office of the CTO calls us her Batman and Robin. For 2 weeks, including weekend, we discussed the solution for Aria in-depth. We brainstormed many parts of the products. Usability is top of mind for me. As a customer facing engineer, working closely with paying customers who use the products, I care a lot about usability. IMHO, usability goes beyond the UI. It starts before you even deploy the damn thing. What if sizing is hard? What if deployment architecture is complex? You won’t use the product as you don’t know how to do enterprise deployment.

On the UI part, we also developed mock up for the next generation UX. We looked at dashboard life cycle in details. We went through each widgets one by one, looked at the areas of improvement. We were mindful of keeping them consistent yet unique. Vahan Tadevosyan and Tigran Avagimyants, the 2 lead developers for UI, were passionate in bringing as much enhancement to their work.

I won’t steal the thunder from PM and PMM, so I will just say as an Engineer I’m excited with Aria. Stay tune!

Working with R&D will naturally involve patent. It’s inspiring to meet someone much younger than you but holds >20 patents! Naira Grigoryan and Nina Karapetyan reviewed a patent idea I shared and we’re collaborating on the submission. For me as an engineer, I think it’s just uber cool to have a patent under my name 🙂

Armenia isn’t just work. I was stunned by the rich history. It’s fascinating to learn the story behind the old churches. The monastery is also located in a remote place, so it’s solitaire. You’d enjoy the surrounding view and spaciousness. I came in June, so the weather was good, and air was fresh. In winter, it will be snowy. Not something I’d personally enjoy 🙂

Other than history, Sunny and I got a chance to see scenery too. This blog post profile photo was Jermuk. Picturesque, isn’t it? Robert Mesropyan, and Arthur Aghabekyan, drove 400 km on Sunday to take us there. Sunny and I were simply blown away by their kindness. It is a humbling experience. What’s even more amazing is this kindness is displayed by all the developers. 5 of them came out, on weekends, to take us to see their beautiful countries! Gagik Manukyan, Arshak Galstyan, Sevak Tsaturyan took 5 of us on Saturday. Talk about hospitality! We did a short video to thank the entire team. You got to see the first part, which has blooper, so hope you enjoy it.

Here is a place I want to see first hand in my next trip. I didn’t get to see it. It’s Lake Parz at Dillijan.

The city of Yerevan is relatively small. I live in Singapore, a modern metropolis, so I enjoy the striking difference. It’s also on a hill. You can walk up the cool open stairs of the Cascade, and see the city from a high vantage point. Just like Singapore, the city is safe. You’re not worried about pick pockets and personal safety. There is no Uber, but there is a local company called GG that provides similar service. There is no train, and I didn’t get the chance to try the buses. Next time I’m there, I might rent a car. It’s pretty easy to drive as traffic isn’t bad.

I hope the short blog gives you an appreciation of the software we’ve come to know and love. I’ve been using the software since 1.0 many blue moons ago, and the best is yet to be!