Tag Archives: counters

vCenter and vRealize Operations – counter fundamentals

This blog post was loosely derived from my book, titled VMware Performance and Capacity Management. It is published by Packt Publishing. You can buy it at Amazon or Packt.

In the book, I compared the metric groups between vCenter and vRealize Operations. We know that vRealize Operations provides a lot more detail, especially for larger objects such as vCenter, data center, and cluster. It also provides information about the distributed switch, which is not displayed in vCenter at all. This makes it useful for the big-picture analysis.

You should spend time understanding vCenter counters and esxtop counters. This blog post is not meant to duplicate the manual. I would encourage you to read the vSphere documentation on this topic, as it gives you the required foundation while working with vRealize Operations. The following are the links to this topic:

  • The link for vSphere 6.5 is here. If this link does not work, visit this and then navigate to vSphere Monitoring and Performance Guide | vSphere Monitoring and Performance | Monitoring Inventory Objects with Performance Charts.
  • The counters are documented in the vSphere API. If this link has changed and no longer works, open the vSphere online documentation and navigate to vSphere API/SDK Documentation | vSphere Management SDK | vSphere Web Services SDK Documentation | VMware vSphere API Reference | Managed Object Types | P. Here, choose Performance Manager from the list under the letter P.
  • The esxtop manual provides good information on the counters.

You should also be familiar with the architecture of ESXi, especially how the scheduler works.

vCenter has a different collection interval (sampling period) depending upon the timeline you are looking at. Most of the time you are looking at the real-time statistic (chart), as other timelines do not have enough counters. You will notice right away that most of the counters become unavailable once you choose a timeline. In the real-time chart, each data point has 20 seconds’ worth of data. That is as accurate as it gets in vCenter. Because all other performance management tools (including vRealize Operations) get their data from vCenter, they are not getting anything more granular than this. As mentioned previously, esxtop allows you to sample down to a minimum of 2 seconds.

Speaking of esxtop, you should be aware that not all counters are exposed in vCenter. For example, if you turn on 3D graphics, there is a separate SVGA thread created for that VM. This can consume CPU and it will not show up in vCenter. The Mouse, Keyboard, Screen (MKS) threads, which give you the VM console, also do not show up in vCenter.

The next screenshot shows how you lose most of your counters if you choose a time span other than real time. In the case of CPU, you are basically left with two counters, as Usage and Usage in MHz cover the same thing. You also lose the ability to monitor per core, as the target objects only list the host now and not the individual cores.

1 day time line

Because the real-time timespan only lasts for 1 hour, the performance troubleshooting has to be done at the present moment. If the performance issue cannot be recreated, there is no way to troubleshoot in vCenter. This is where vRealize Operations comes in, as it keeps your data for a much longer period. I was able to perform troubleshooting for a client on a problem that occurred more than a month ago!

vRealize Operations takes data every 5 minutes. This means it is not suitable to troubleshoot performance that does not last for 5 minutes. In fact, if the performance issue only lasts for 5 minutes, you may not get any alert, because the collection may happen exactly in the middle of those 5 minutes. For example, let’s assume the CPU is idle from 08:00:00 to 08:02:30, spikes from 08:02:30 to 08:07:30, and then again is idle from 08:07:30 to 08:10:00. If vRealize Operations is collecting at exactly 08:00, 08:05, and 08:10, you will not see the spike as it is spread over two data points. This means, for vRealize Operations to pick up the spike in its entirety without any idle data, the spike has to last for 10 minutes or more.

In some metrics, the unit is actually 20 seconds. vRealize Operations averages a set of 20-second data points into a single 5-minute data point.

The Rollups column in the preceding screenshot  is important. Average means the average of 5 minutes in the case of vRealize Operations. The summation value is actually an average for those counters where accumulation makes more sense. An example is CPU Ready time. It gets accumulated over the sampling period. Over a period of 20 seconds, a VM may accumulate 200 milliseconds of CPU ready time. This translates into 1%, which is why I said it is similar to average, as you lose the peak.

Latest, on the other hand, is different. It takes the last value of the sampling period. For example, in the sampling for 20 seconds, it takes the value between 19 and 20 seconds. This value can be lower or higher than the average of the entire 20-second period.

So what is missing here is the peak of the sampling period. In the 5-minute period, vRealize Operations does not collect low, average, and high from vCenter. It takes average only.

Let’s talk about the Units column now. Some common units are milliseconds, MHz, percent, KBps, and KB. Some counters are shown in MHz, which means you need to know your ESXi physical CPU frequency. This can be difficult due to CPU power saving features, which lower the CPU frequency when the demand is low. In large environments, this can be operationally difficult as you have different ESXi hosts from different generations (and hence, are likely to sport a different GHz). This is also the reason why I state that the cluster is the smallest logical building block. If your cluster has ESXi hosts with different frequencies, these MHz-based counters can be difficult to use, as the VMs get vMotion-ed by DRS.

vCenter and vRealize counters – part 3

Storage

If you look at the ESXi and VM metric groups for storage in the vCenter performance chart, it is not clear how they relate to one another at first glance. You have storage network, storage adapter, storage path, datastore, and disk metric groups that you need to check. How do they impact on one another?

I have created the following diagram to explain the relationship. The beige boxes are what you are likely to be familiar with. You have your ESXi host, and it can have NFS Datastore, VMFS Datastore, or RDM objects. The blue colored boxes represent the metric groups.

Book - chapter 4 - 01

NFS and VMFS datastores differ drastically in terms of counters, as NFS is file-based while VMFS is block-based. For NFS, it uses the vmnic, and so the adapter type (FC, FCoE, or iSCSI) is not applicable. Multipathing is handled by the network, so you don’t see it in the storage layer. For VMFS or RDM, you have more detailed visibility of the storage. To start off, each ESXi adapter is visible and you can check the counters for each of them. In terms of relationship, one adapter can have many devices (disk or CDROM). One device is typically accessed via two storage adapters (for availability and load balancing), and it is also accessed via two paths per adapter, with the paths diverging at the storage switch. A single path, which will come from a specific adapter, can naturally connect one adapter to one device. The following diagram shows the four paths:

Book - chapter 4 - 02

A storage path takes data from ESXi to the LUN (the term used by vSphere is Disk), not to the datastore. So if the datastore has multiple extents, there are four paths per extent. This is one reason why I did not use more than one extent, as each extent adds four paths. If you are not familiar with extent, Cormac Hogan explains it well on this blog post.

For VMFS, you can see the same counters at both the Datastore level and the Disk level. Their value will be identical if you follow the recommended configuration to create a 1:1 relationship between a datastore and a LUN. This means you present an entire LUN to a datastore (use all of its capacity).

The following screenshot shows how we manage the ESXi storage. Click on the ESXi you need to manage, select the Manage tab, and then the Storage subtab. In this subtab, we can see the adapters, devices, and the host cache. The screen shows an ESXi host with the list of its adapters. I have selected vmhba2, which is an FC HBA. Notice that it is connected to 5 devices. Each device has 4 paths, so I have 20 paths in total

ESX - Adapter

Let’s move on to the Storage Devices tab. The following screenshot shows the list of devices. Because NFS is not a disk, it does not appear here. I have selected one of the devices to show its properties.

ESXi - device

If you click on the Paths tab, you will be presented with the information shown in the next screenshot, including whether a path is active. Note that not all paths carry I/O; it depends on your configuration and multipathing software. Because each LUN typically has four paths, path management can be complicated if you have many LUNs.

ESXi - device path

The story is quite different on the VM layer. A VM does not see the underlying shared storage. It sees local disks only. So regardless of whether the underlying storage is NFS, VMFS, or RDM, it sees all of them as virtual disks. You lose visibility in the physical adapter (for example, you cannot tell how many IOPSs on vmhba2 are coming from a particular VM) and physical paths (for example, how many disk commands traveling on that path are coming from a particular VM). You can, however, see the impact at the Datastore level and the physical Disk level. The Datastore counter is especially useful. For example, if you notice that your IOPS is higher at the Datastore level than at the virtual Disk level, this means you have a snapshot. The snapshot IO is not visible at the virtual Disk level as the snapshot is stored on a different virtual disk.

Book - chapter 4 - 03

Network

My apology that I cannot publish information on Network as it’s not provided as free pages by the publisher. The information is covered in my book.

vCenter and vRealize counters – part 2

Compute

The following diagram shows how a VM gets its resource from ESXi. Unlike a physical server, a VM has dynamic resources given to it. It is not static. Contention, Demand and Entitlement are concepts that do not exist in the physical world. It is a pretty complex diagram, so let me walk you through it.

Book - chapter 3 - 02

The tall rectangular area represents a VM. Say this VM is given 8 GB of virtual RAM. The bottom line represents 0 GB and the top line represents 8 GB. The VM is configured with 8 GB RAM. We call this Provisioned. This is what the Guest OS sees, so if it is running Windows, you will see 8 GB RAM when you log into Windows.

Unlike a physical server, you can configure a Limit and a Reservation. This is done outside the Guest OS, so Windows or Linux does not know. You should minimize the use of Limit and Reservation as it makes the operation more complex.

Entitlement means what the VM is entitled to. In this example, the hypervisor entitles the VM to a certain amount of memory. I did not show a solid line and used an italic font style to mark that Entitlement is not a fixed value, but a dynamic value determined by the hypervisor. It varies every minute, determined by Limit, Entitlement, and Reservation of the VM itself and any shared allocation with other VMs running on the same host.

Obviously, a VM can only use what it is entitled to at any given point of time, so the Usage counter does not go higher than the Entitlement counter. The green line shows that Usage ranges from 0 to the Entitlement value.

In a healthy environment, the ESXi host has enough resources to meet the demands of all the VMs on it with sufficient overhead. In this case, you will see that the Entitlement, Usage, and Demand counters will be similar to one another when the VM is highly utilized. This is shown by the green line where Demand stops at Usage, and Usage stops at Entitlement. The numerical value may not be identical because vCenter reports Usage in percentage, and it is an average value of the sample period. vCenter reports Entitlement in MHz and it takes the latest value in the sample period. It reports Demand in MHz and it is an average value of the sample period. This also explains why you may see Usage a bit higher than Entitlement in highly-utilized vCPU. If the VM has low utilization, you will see the Entitlement counter is much higher than Usage.

An environment in which the ESXi host is resource constrained is unhealthy. It cannot give every VM the resources they ask for. The VMs demand more than they are entitled to use, so the Usage and Entitlement counters will be lower than the Demand counter. The Demand counter can go higher than Limit naturally. For example, if a VM is limited to 2 GB of RAM and it wants to use 14 GB, then Demand will exceed Limit. Obviously, Demand cannot exceed Provisioned. This is why the red line stops at Provisioned because that is as high as it can go.

The difference between what the VM demands and what it gets to use is the Contention counter. Contention is a special counter that tracks all these competition for resources. It’s a counter that only exists in the virtual world.

So Contention, simplistically speaking, is Demand – Usage. I said simplistically as that’s not the actual formula. The actual formula does not really matter for all practical purpose as it’s all relative to the expectation that you’ve set to your customers (VM Owner).

Contention happens when what the VM demands is more than it gets to use. So if the Contention is 0, the VM can use everything it demands. This is the ultimate goal, as performance will match the physical world. This Contention value is useful to demonstrate that the infrastructure provides a good service to the application team. If a VM owner comes to see you and says that your shared infrastructure is unable to serve his or her VM well, both of you can check the Contention counter.

The Contention counter should become a part of your Performance SLA or Key Performance Indicator (KPI). It is not sufficient to track utilization alone. When there is contention, it is possible that both your VM and ESXi host have low utilization, and yet your customers (VMs running on that host) perform poorly. This typically happens when the VMs are relatively large compared to the ESXi host. Let me give you a simple example to illustrate this. The ESXi host has two sockets and 20 cores. Hyper-threading is not enabled to keep this example simple. You run just 2 VMs, but each VM has 11 vCPUs. As a result, they will not be able to run concurrently. ESXi VMkernel will schedule them sequentially as there are only 20 physical cores to serve 22 vCPUs. Here, both VMs will experience high contention.

Hold on! You might say, “There is no Contention counter in vSphere and no memory Demand counter either.”

This is where vR Ops comes in. It does not just regurgitate the values in vCenter. It has implicit knowledge of vSphere and a set of derived counters with formulae that leverage that knowledge.

You need to have an understanding of how the vSphere CPU scheduler works.
The following diagram shows the various states that a VM can be in:

Book - chapter 3 - 03

The preceding diagram is taken from The CPU Scheduler in VMware vSphere®
5.1: Performance Study. This is a whitepaper that documents the CPU scheduler with
a good amount of depth for VMware administrators. I highly recommend you read this
paper as it will help you explain to your customers (the application team) how your
shared infrastructure juggles all those VMs at the same time. It will also help you pick
the right counters when you create your custom dashboards in vRealize Operations.