This blog post was loosely derived from my book, titled VMware Performance and Capacity Management. It is published by Packt Publishing. You can buy it at Amazon or Packt.
In the book, I compared the metric groups between vCenter and vRealize Operations. We know that vRealize Operations provides a lot more detail, especially for larger objects such as vCenter, data center, and cluster. It also provides information about the distributed switch, which is not displayed in vCenter at all. This makes it useful for the big-picture analysis.
You should spend time understanding vCenter counters and esxtop counters. This blog post is not meant to duplicate the manual. I would encourage you to read the vSphere documentation on this topic, as it gives you the required foundation while working with vRealize Operations. The following are the links to this topic:
- The link for vSphere 6.5 is here. If this link does not work, visit this and then navigate to vSphere Monitoring and Performance Guide | vSphere Monitoring and Performance | Monitoring Inventory Objects with Performance Charts.
- The counters are documented in the vSphere API. If this link has changed and no longer works, open the vSphere online documentation and navigate to vSphere API/SDK Documentation | vSphere Management SDK | vSphere Web Services SDK Documentation | VMware vSphere API Reference | Managed Object Types | P. Here, choose Performance Manager from the list under the letter P.
- The esxtop manual provides good information on the counters.
You should also be familiar with the architecture of ESXi, especially how the scheduler works.
vCenter has a different collection interval (sampling period) depending upon the timeline you are looking at. Most of the time you are looking at the real-time statistic (chart), as other timelines do not have enough counters. You will notice right away that most of the counters become unavailable once you choose a timeline. In the real-time chart, each data point has 20 seconds’ worth of data. That is as accurate as it gets in vCenter. Because all other performance management tools (including vRealize Operations) get their data from vCenter, they are not getting anything more granular than this. As mentioned previously, esxtop allows you to sample down to a minimum of 2 seconds.
Speaking of esxtop, you should be aware that not all counters are exposed in vCenter. For example, if you turn on 3D graphics, there is a separate SVGA thread created for that VM. This can consume CPU and it will not show up in vCenter. The Mouse, Keyboard, Screen (MKS) threads, which give you the VM console, also do not show up in vCenter.
The next screenshot shows how you lose most of your counters if you choose a time span other than real time. In the case of CPU, you are basically left with two counters, as Usage and Usage in MHz cover the same thing. You also lose the ability to monitor per core, as the target objects only list the host now and not the individual cores.
Because the real-time timespan only lasts for 1 hour, the performance troubleshooting has to be done at the present moment. If the performance issue cannot be recreated, there is no way to troubleshoot in vCenter. This is where vRealize Operations comes in, as it keeps your data for a much longer period. I was able to perform troubleshooting for a client on a problem that occurred more than a month ago!
vRealize Operations takes data every 5 minutes. This means it is not suitable to troubleshoot performance that does not last for 5 minutes. In fact, if the performance issue only lasts for 5 minutes, you may not get any alert, because the collection may happen exactly in the middle of those 5 minutes. For example, let’s assume the CPU is idle from 08:00:00 to 08:02:30, spikes from 08:02:30 to 08:07:30, and then again is idle from 08:07:30 to 08:10:00. If vRealize Operations is collecting at exactly 08:00, 08:05, and 08:10, you will not see the spike as it is spread over two data points. This means, for vRealize Operations to pick up the spike in its entirety without any idle data, the spike has to last for 10 minutes or more.
In some metrics, the unit is actually 20 seconds. vRealize Operations averages a set of 20-second data points into a single 5-minute data point.
The Rollups column in the preceding screenshot is important. Average means the average of 5 minutes in the case of vRealize Operations. The summation value is actually an average for those counters where accumulation makes more sense. An example is CPU Ready time. It gets accumulated over the sampling period. Over a period of 20 seconds, a VM may accumulate 200 milliseconds of CPU ready time. This translates into 1%, which is why I said it is similar to average, as you lose the peak.
Latest, on the other hand, is different. It takes the last value of the sampling period. For example, in the sampling for 20 seconds, it takes the value between 19 and 20 seconds. This value can be lower or higher than the average of the entire 20-second period.
So what is missing here is the peak of the sampling period. In the 5-minute period, vRealize Operations does not collect low, average, and high from vCenter. It takes average only.
Let’s talk about the Units column now. Some common units are milliseconds, MHz, percent, KBps, and KB. Some counters are shown in MHz, which means you need to know your ESXi physical CPU frequency. This can be difficult due to CPU power saving features, which lower the CPU frequency when the demand is low. In large environments, this can be operationally difficult as you have different ESXi hosts from different generations (and hence, are likely to sport a different GHz). This is also the reason why I state that the cluster is the smallest logical building block. If your cluster has ESXi hosts with different frequencies, these MHz-based counters can be difficult to use, as the VMs get vMotion-ed by DRS.