Tag Archives: capacity

vCenter and vRealize counters – part 1

This blog post is adapted from my book, titled VMware vRealize Operations Performance and Capacity Management. It is published by Packt Publishing

vSphere 6 comes with many counters, many more than what a physical server provides. There are new counters that do not have a physical equivalent, such as memory ballooning, CPU latency, and vSphere replication. In addition, some counters have the same name as their physical world counterpart but behave differently in vSphere. Memory usage is a common one, resulting in confusion among system administrators. For those counters that are similar to their physical world counterparts, vSphere may use different units, such as milliseconds.

As a result, experienced IT administrators find it hard to master vSphere counters by building on their existing knowledge. Instead of trying to relate each counter to its physical equivalent, I find it useful to group them according to their purpose. Virtualization formalizes the relationship between the infrastructure team and application team. The infrastructure team changes from the system builder to service provider. The application team no longer owns the physical infrastructure.

The application team becomes a consumer of a shared service—the virtual platform. Depending on the Service Level Agreement (SLA), the application team can be served as if they have dedicated access to the infrastructure, or they can take a performance hit in exchange for a lower price. For SLAs where performance matters, the VM running in the cluster should not be impacted by any other VMs. The performance must be as good as if it is the only VM running in the ESXi.

Because there are two different counter users, there are two different purposes.

  • The application team (developers and the VM owner) only cares about their own VM.
  • The infrastructure team has to care about both the VM and infrastructure, especially when they need to show that the shared infrastructure is not a bottleneck.

One set of counters is to monitor the VM; the other set is to monitor the infrastructure. The following diagram shows the two different purposes and what we should check for each. By knowing what matters on each layer, we can better manage the virtual environment.

Book - chapter 3 - 01

At the VM layer, we care whether the VM is being served well by the platform. Other VMs are irrelevant from the VM owner’s point of view. A VM owner only wants to make sure his or her VM is not contending for a resource. So the key counter here is contention. Only when we are satisfied that there is no contention can we proceed to check whether the VM is sized correctly or not. Most people check for utilization first because that is what they are used to monitoring in the physical infrastructure. In a virtual environment, we should check for contention first.

At the infrastructure layer, we care whether it serves everyone well. Make sure that there is no contention for resource among all the VMs in the platform. Only when the infrastructure is clear from contention can we troubleshoot a particular VM. If the infrastructure is having a hard time serving majority of the VMs, there is no point troubleshooting a particular VM.

This two-layer concept is also implemented by vSphere in compute and storage architectures. For example, there are two distinct layers of memory in vSphere. There is the individual VM memory provided by the hypervisor and there is the physical memory at the host level. For an individual VM, we care whether the VM is getting enough memory. At the host level, we care whether the host has enough memory for everyone. Because of the difference in goals, we look for a different set of counters.

In the previous diagram, there are 2 numbers shown in a large font, indicating that there are 2 main steps in monitoring. Each step applies to each layer (the VM layer and infrastructure layer), so there are two numbers for each step:

  1. Step 1 is used for performance management. It is useful during troubleshooting or when checking whether we are meeting performance SLAs or not.
  2. Step 2 is used for capacity management. It is useful as part of long-term capacity planning. The time period for step 2 is typically 3 months, as we are checking for overall utilization and not a one off spike.

With the preceding concept in mind, we are ready to dive into more detail. Let’s cover compute, network, and storage in the next post.

vRealize Operations 6: Top-N widget

The Top-N widget is used extensively in vRealize Operations 6. The default dashboards use them. For example, the Host Overview dashboard use it, as shown below:

Top N is average 0

Notice in the above screenshot (click if not clear) that there is a little text 24h. That means 24 hours. You can adjust, albeit manually, the time period. The value you are seeing in the above Top-N is the average of an entire 24 hour period. So if there is a peak during the period, it may get flatten. It is also the last 24 hours, not yesterday or today. Checking at 9 am or 6 pm will give you a different result. If you check them at 9 am, you’re looking at 9 am yesterday until 9 am today. You are not looking at yesterday (0000 – 2400).

Because the Top-N is an average, you may want to know a bit more details. This is where the Sparkline widget comes in. Clicking on any of the Top-N will show the corresponding object in the Sparkline widget.

You can certainly change the value from last 24 hours to any time period that fits your business needs. Changing the default dashboards do not impact the way vRealize Operations works (e.g. its dynamic threashold calculation). Dashboards are just way to present information.

You can also create your own dashboard and do your own style. For example, I do not use Sparkline as I like to have greater detail. I use Line Chart. I also use Line Chart first, then Top-N second. So it’s the other way around. This is because I have a preference to see details. I use Top-N when I need to zoom into a specific time line, to reveal the objects giving me the value in the Line Chart. I use this Line Chart + Top-N combo a lot when working with customers. You will find many examples in my book.

If you are curious if the Top-N value is really average of the selected time period, you can easily test it. I created a manual group. It has only 2 members. They are the 2 VMs shown below.

Top N is average

There are 2 Line Chart widgets in the above screenshot. Each of them has a corresponding Top-N widget below it. The first line chart shows a longer time horizon. I chose 7 days. From here I could see that there are some spikes. The Top-N, however, does not reveal that. This is because it is an average. It has flatten the data.

The second line chart shows a much shorter time. I have zoomed into 29 Dec around 1 pm. The line chart shows that BCDR-Prod-SRM-Server had a spike around 50% and then dropped to 4%.

Since I know the time period, I’m going to configure my Top-N to zoom into that specific time. You can see below that I’ve configured to 12:55 pm – 1:05 pm. So I’m taking only 2 values.  Top N is average - 2

Since the first value is 50.17%, and the second value is 4%, we would expect to see a Top-N showing 27.085%. And you got it right, the Top-N shows 27.085%.


Top N is average - 3