Tag Archives: metrics

vCenter and vRealize counters – part 2

Compute

The following diagram shows how a VM gets its resource from ESXi. Unlike a physical server, a VM has dynamic resources given to it. It is not static. Contention, Demand and Entitlement are concepts that do not exist in the physical world. It is a pretty complex diagram, so let me walk you through it.

Book - chapter 3 - 02

The tall rectangular area represents a VM. Say this VM is given 8 GB of virtual RAM. The bottom line represents 0 GB and the top line represents 8 GB. The VM is configured with 8 GB RAM. We call this Provisioned. This is what the Guest OS sees, so if it is running Windows, you will see 8 GB RAM when you log into Windows.

Unlike a physical server, you can configure a Limit and a Reservation. This is done outside the Guest OS, so Windows or Linux does not know. You should minimize the use of Limit and Reservation as it makes the operation more complex.

Entitlement means what the VM is entitled to. In this example, the hypervisor entitles the VM to a certain amount of memory. I did not show a solid line and used an italic font style to mark that Entitlement is not a fixed value, but a dynamic value determined by the hypervisor. It varies every minute, determined by Limit, Entitlement, and Reservation of the VM itself and any shared allocation with other VMs running on the same host.

Obviously, a VM can only use what it is entitled to at any given point of time, so the Usage counter does not go higher than the Entitlement counter. The green line shows that Usage ranges from 0 to the Entitlement value.

In a healthy environment, the ESXi host has enough resources to meet the demands of all the VMs on it with sufficient overhead. In this case, you will see that the Entitlement, Usage, and Demand counters will be similar to one another when the VM is highly utilized. This is shown by the green line where Demand stops at Usage, and Usage stops at Entitlement. The numerical value may not be identical because vCenter reports Usage in percentage, and it is an average value of the sample period. vCenter reports Entitlement in MHz and it takes the latest value in the sample period. It reports Demand in MHz and it is an average value of the sample period. This also explains why you may see Usage a bit higher than Entitlement in highly-utilized vCPU. If the VM has low utilization, you will see the Entitlement counter is much higher than Usage.

An environment in which the ESXi host is resource constrained is unhealthy. It cannot give every VM the resources they ask for. The VMs demand more than they are entitled to use, so the Usage and Entitlement counters will be lower than the Demand counter. The Demand counter can go higher than Limit naturally. For example, if a VM is limited to 2 GB of RAM and it wants to use 14 GB, then Demand will exceed Limit. Obviously, Demand cannot exceed Provisioned. This is why the red line stops at Provisioned because that is as high as it can go.

The difference between what the VM demands and what it gets to use is the Contention counter. Contention is a special counter that tracks all these competition for resources. It’s a counter that only exists in the virtual world.

So Contention, simplistically speaking, is Demand – Usage. I said simplistically as that’s not the actual formula. The actual formula does not really matter for all practical purpose as it’s all relative to the expectation that you’ve set to your customers (VM Owner).

Contention happens when what the VM demands is more than it gets to use. So if the Contention is 0, the VM can use everything it demands. This is the ultimate goal, as performance will match the physical world. This Contention value is useful to demonstrate that the infrastructure provides a good service to the application team. If a VM owner comes to see you and says that your shared infrastructure is unable to serve his or her VM well, both of you can check the Contention counter.

The Contention counter should become a part of your Performance SLA or Key Performance Indicator (KPI). It is not sufficient to track utilization alone. When there is contention, it is possible that both your VM and ESXi host have low utilization, and yet your customers (VMs running on that host) perform poorly. This typically happens when the VMs are relatively large compared to the ESXi host. Let me give you a simple example to illustrate this. The ESXi host has two sockets and 20 cores. Hyper-threading is not enabled to keep this example simple. You run just 2 VMs, but each VM has 11 vCPUs. As a result, they will not be able to run concurrently. ESXi VMkernel will schedule them sequentially as there are only 20 physical cores to serve 22 vCPUs. Here, both VMs will experience high contention.

Hold on! You might say, “There is no Contention counter in vSphere and no memory Demand counter either.”

This is where vR Ops comes in. It does not just regurgitate the values in vCenter. It has implicit knowledge of vSphere and a set of derived counters with formulae that leverage that knowledge.

You need to have an understanding of how the vSphere CPU scheduler works.
The following diagram shows the various states that a VM can be in:

Book - chapter 3 - 03

The preceding diagram is taken from The CPU Scheduler in VMware vSphere®
5.1: Performance Study. This is a whitepaper that documents the CPU scheduler with
a good amount of depth for VMware administrators. I highly recommend you read this
paper as it will help you explain to your customers (the application team) how your
shared infrastructure juggles all those VMs at the same time. It will also help you pick
the right counters when you create your custom dashboards in vRealize Operations.

vCenter and vRealize counters – part 1

This blog post is adapted from my book, titled VMware vRealize Operations Performance and Capacity Management. It is published by Packt Publishing

vSphere 6 comes with many counters, many more than what a physical server provides. There are new counters that do not have a physical equivalent, such as memory ballooning, CPU latency, and vSphere replication. In addition, some counters have the same name as their physical world counterpart but behave differently in vSphere. Memory usage is a common one, resulting in confusion among system administrators. For those counters that are similar to their physical world counterparts, vSphere may use different units, such as milliseconds.

As a result, experienced IT administrators find it hard to master vSphere counters by building on their existing knowledge. Instead of trying to relate each counter to its physical equivalent, I find it useful to group them according to their purpose. Virtualization formalizes the relationship between the infrastructure team and application team. The infrastructure team changes from the system builder to service provider. The application team no longer owns the physical infrastructure.

The application team becomes a consumer of a shared service—the virtual platform. Depending on the Service Level Agreement (SLA), the application team can be served as if they have dedicated access to the infrastructure, or they can take a performance hit in exchange for a lower price. For SLAs where performance matters, the VM running in the cluster should not be impacted by any other VMs. The performance must be as good as if it is the only VM running in the ESXi.

Because there are two different counter users, there are two different purposes.

  • The application team (developers and the VM owner) only cares about their own VM.
  • The infrastructure team has to care about both the VM and infrastructure, especially when they need to show that the shared infrastructure is not a bottleneck.

One set of counters is to monitor the VM; the other set is to monitor the infrastructure. The following diagram shows the two different purposes and what we should check for each. By knowing what matters on each layer, we can better manage the virtual environment.

Book - chapter 3 - 01

At the VM layer, we care whether the VM is being served well by the platform. Other VMs are irrelevant from the VM owner’s point of view. A VM owner only wants to make sure his or her VM is not contending for a resource. So the key counter here is contention. Only when we are satisfied that there is no contention can we proceed to check whether the VM is sized correctly or not. Most people check for utilization first because that is what they are used to monitoring in the physical infrastructure. In a virtual environment, we should check for contention first.

At the infrastructure layer, we care whether it serves everyone well. Make sure that there is no contention for resource among all the VMs in the platform. Only when the infrastructure is clear from contention can we troubleshoot a particular VM. If the infrastructure is having a hard time serving majority of the VMs, there is no point troubleshooting a particular VM.

This two-layer concept is also implemented by vSphere in compute and storage architectures. For example, there are two distinct layers of memory in vSphere. There is the individual VM memory provided by the hypervisor and there is the physical memory at the host level. For an individual VM, we care whether the VM is getting enough memory. At the host level, we care whether the host has enough memory for everyone. Because of the difference in goals, we look for a different set of counters.

In the previous diagram, there are 2 numbers shown in a large font, indicating that there are 2 main steps in monitoring. Each step applies to each layer (the VM layer and infrastructure layer), so there are two numbers for each step:

  1. Step 1 is used for performance management. It is useful during troubleshooting or when checking whether we are meeting performance SLAs or not.
  2. Step 2 is used for capacity management. It is useful as part of long-term capacity planning. The time period for step 2 is typically 3 months, as we are checking for overall utilization and not a one off spike.

With the preceding concept in mind, we are ready to dive into more detail. Let’s cover compute, network, and storage in the next post.

Performance and Capacity Management in virtual datacenter

[30 Nov 2015: this was the reason why I wrote the 1st edition. While the 2nd edition has major changes, the high level goals of the 2nd edition remain the same]

Ever wonder what those counters mean in vCenter and vCenter Operations? For example, what does Memory Contention really mean, where does it get its value from, and what values you should expect in a healthy environment?

I was puzzled myself. I’ve been with VMware for 6+ years, and I do not see them being documented. Scott Drummonds did document some of them at communities.vmware.com when he was doing performance. But I have not seen a complete one, especially one written from VMware Admin’s view point (as opposed from a product view point). I want to explain from the view point of the person whose hands are on the keyboard when the VMware platform is not performing as expected.

I’m writing a book to address it. As a field person, I work closely with customers. I see first hand that there are related issue which makes performance and capacity management difficult in virtual world. The gaps explain why customers struggle with virtualisation. Because there are several gaps, I address them “top down“, starting from high level then move to technical matters. There are >100 pages on counters, so this is going to get deep 🙂

Chapter 1 covers the big picture. I aim to correct some architecture and operations mistakes that customers make. It is very rare for customers to master the virtual world, both architecturally and operationally. Part of the reason is the software-Defined Data Center (SDDC) is not yet matured. The biggest reason, however, is the lack of deep understanding of what exactly virtualisation means to IT. As we all know, small leaks sink the ship, so I’m going to expose the details.

Chapter 2 continues the theme, but I dive into Capacity Management. I’m hoping to share how you do capacity management. Once you know what it takes, you can use a tool to automate it.

Chapter 3 dives into the metrics and how CPU, RAM, Disk and Network change once you add a layer called hypervisor. You need to know this, as there are behaviour that completely change.

Chapter 4 – 7 document the counters in vCenter and vC Ops. One chapter for CPU, RAM, Network and Disk. The book goes beyond documenting the metrics. It also applies them so customers can operationalise their virtual platform.

Chapter 8 provides examples of dashboards. The book was practically born in the field, sitting down with customers in creating all those dashboards and performing troubleshooting.

The book is with Packt Publishing. There are many reasons why I chose to work with them. The plan is to have the book out at the same time with the GA of vRealize Operations 6.