Tag Archives: vSphere

Keeping VMware Tools current

Keeping VMware Tools current is one of the best practices of vSphere operations. VMware Tools interfaces with both ESXi and VM (the virtual motherboard or virtual machine). Hence, there are 2 comparisons to consider:

  1. VM Hardware version
  2. ESXi version

From the vSphere API, here is what you get when you query it:

  • Guest Tools Current
  • Guest Tools Not Installed
  • Guest Tools Supported New
  • Guest Tools Supported Old
  • Guest Tools Too Old
  • Guest Tools Unmanaged

What do they mean?

  • Guest Tools Not Installed:
    • Tools are not installed on the VM. You should install it as you get both drivers and visibility.
  • Current
    • Tools version matches with the Tools available with ESXi. Each ESXi has a version of Tools that comes with it. See this for the list. This is the ideal scenario.
  • Supported New
    • Newer than the ESXi VMware tools version, but it is supported.
  • Supported Old
    • The opposite of New. It is also supported. Even it is older by 0.0.1 is considered old. It does not have to far behind.
  • Too Old
    • Tools version is older than the minimum supported version of Tools across all ESXi versions. Minimum supported version is the oldest version of Tools we support. Basically, guest is running unsupported Tools. You should upgrade. As of now for Linux and Windows guests. minimum supported version is the Tools version bundled with ESXi 4.0 which is 8.0.1. Supporting such old versions is challenging. We are planning to change this in future to something newer. In the meantime, you should upgrade as might not work as expected
  • Unmanaged
    • Tools installed in the guest did not come from ESXi, so Tools is not being managed by ESXi host. It may be supported or maybe not, depends on what type of Tools is running in the guest. We support open-vm-tools packaged by Linux vendors and OSPs, which both show up as unmanaged.
    • If a customer builds their own open-vm-tools from source code, we may not support that because we will not know if they have done it correctly or not.

Operationalize Your World has a dashboard that highlights the VMs not running the current or supported new. You should expect the number to be minimal, or ideally none.

Which VMs need more resources?

You can reduce the following resources from a VM:

  • CPU
  • RAM
  • Storage

Network isn’t something you can reduce, but you know that already 🙂

You can check which VMs need more resources by building a dashboard like the one below. It’s a simple dashboard, which you can customize and enhance. It lets you reduce the resources independently.

I’ve marked the above dashboard with numbers, so we can refer to them:

  1. This is a table that lists all VMs. It’s sorted by the highest 1-hour average of CPU Demand and RAM Demand. The table also lists the VM CPU and RAM configuration, so you can see if the VMs are small or large. It also shows the cluster the VMs are located. The table is sorted by the highest CPU Demand. I’m showing both CPU and RAM in a single table. You can clone the view and split them if that suits your operations better.
  2. This is a table that lists all VMs, but focusing on storage only. With storage, we do not have the complexity of checking peak utilisation. We simply need to check the present situation.
  3. This lists the Top-15 VMs with highest CPU Demand and RAM Demand in a given period. The list is now split, as they can be different VMs. Do not that Top-N widget will average the number over the selected period. A VM with cyclical workload may not show up. The Top-N is complemented with a distribution chart. Select a VM from the Top-N, and you can see where the VM utilisation is.
  4. The distribution chart helps you see if the VM is really under resources or not. The 95th percentile is marked with a vertical green line. You expect that line to be at 100%, indicating that the VMs hit 100% utilisation frequently. If the 95th percentile is at a low number, and you do not see the number 100 in the x-axis, that means the VM is not under resourced.
  5. Storage is easier, as we can simply use the last data. As a result, we can show a distribution of all the VMs. We use a heat map as it can show 2 dimensions. Every VM is represented as a box. The bigger the box, the more storage the VM is configured with. The color indicates if the VM use it.
    • 0% = Black. Wastage
    • 10% = Green. Balanced usage
    • 100% = Red. Need more space!

The CPU and RAM have limitations. For example, they may show high utilisation during AV backup. You want to ignore those period. At this moment, the only way is to plot the high usage over a line chart. We use Log Insight for this. The chart below shows VMs that hit high CPU usage in a given period. Every time a VM hits high CPU usage, it will show up here. As you can see, there are only 4 VMs that hit high CPU usage. All other VMs do not need more CPU.

The above is an example from a healthy environment. What about an environment where a lot of VMs are under-sized? You expect to see lots of alarm! That’s what you have below

Hope the above is useful. If not, drop me an email.

NOC Dashboards for SDDC – Part 2

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Dashboard: Performance

Is my IaaS performing?

That’s the key question that you need to answer. You need to show if the clusters are coping well. Show how the clusters are performing in terms of CPU, RAM and Disk.

The above dashboard is per Service Tier. Do you know why?

Yes, the threshold differs for each tier. What is acceptable in Tier 3 may not be acceptable in Tier 1.

The good thing about line chart is it provides visibility beyond present time. You can show the last 6 hours and still get good details. Showing >24 hours will result in visualization that is too static, not suitable for NOC use case.

Limitation & customisation:

  • You need 1 widget per Service Tier.
  • If you only have a few clusters, you can show multiple Service Tiers in 1 dashboard. 1 row per tier results in simpler visualisation.
  • In environment with >10 clusters, we can group them into Service Tier. Focus more on the highest tier (Tier 1).
  • In environment with >100 clusters, we need another grouping in between. Group the Tier 1 clusters into physical location.

When a cluster is unable to cope, is it because it’s having high utilization? I show CPU, RAM and Disk here. You can add Network as you know the physical limit of ESXi vmnic.

Disk is tricky for 2 reasons:

  • It is not in %. There is no 100% for IOPS or Throughput. The good thing is when you architected the array or vSAN, you did have certain IOPS number in mind right? 😉 Well, now you can get the storage vendor to prove it that it does indeed perform as per the PowerPoint 😉 If not, you get free hardware if they promise a fast performance that will meet business requirement.
  • You need to show both IOPS and Throughput. While they are related, you can have low IOPS and high throughput, and vice versa.

If the cluster utilization is high, the next dashboard drills into each ESXi.

We can also see if there are unbalanced. In theory, they should not, if you set DRS to Fully Automated, and pick at least the middle level sensitivity (3). DRS in vSphere 6.5 considers Network also, so you can have unbalanced CPU/RAM.

With the dashboard above, we can tell if ESXi CPU Utilisation is healthy or not.

  • Low value does not mean VM performs fast. A VM is concerned with VM Contention metric, not ESXi Utilization. Low value means we over invest. It is not healthy as we waste money and power.
  • High value means we risk performance (read: contention)

For ESXi, go with higher core count. You save license if you can reduce socket.

We can also tell if ESXi RAM Utilisation is healthy or not.

  • Customers tend to overspend on RAM and underspend on CPU. The reason is this.
  • For RAM, we have 2 metrics:
    • Active RAM
    • Mapped RAM
  • The value you want is somewhere in between Active and RAM.

In the dashboard, the 3 widgets have different range. The range I set is 30 – 90, 50 – 100 and 10 – 90.

Why not 0 – 100?

It is not 100% because you want to cater for HA. Your ESXi should not hit 100% as if you have HA, it would be beyond 100, meaning performance will be badly affected.

If the cluster or ESXi utilization is high, is it because there are VMs generating excessive workloads?

The dashboard above answers if we have VMs that dominate the shared environment.

  • CPU: show a heat map showing all VMs, sized by CPU Demand in GHz (not %), color by contention
  • RAM: show a heat map showing all VMs, sized by Active RAM, color by contention
  • Storage: show a heat map showing all VMs, sized by IOPS color by latency.

At a glance, we can tell the workload distribution among the VMs. We can also tell if they being served well or not.

Limitation & customisation:

  • You need 1 widget per Service Tier.
  • You can change the threshold anytime. If you want a brand new storage from Finance, set the max to 1 ms 😉
  • In larger environment, group your heatmap (e.g. by cluster, host, folder).
  • We can show individual VM, but we can’t show the history as there are too much data to show.
  • This needs to be done per Tier. 1 dashboard per Tier, as the threshold varies per tier.

Hope you find it useful. For the product-specific implementation, review this blog. To prevent vROps session from timing out, implement this trick by Sunny.