I couldn’t find a document that explains VM CPU counters in-depth. Having read many sources and consulted folks, here is what I got so far. Correction is most welcomed.
At the most basic level, a VM CPU is either being utilized or not being utilized by the Guest OS.
- When it’s being utilized, the hypervisor must schedule it.
- If VMkernel has no physical CPU to run it, then the VM is placed into Ready state. The VM is ready, but the hypervisor is not. The Ready counter is increased to account for this.
- If VMkernel has all the required physical CPUs to run it, then the VM gets to run. It is placed into Run state. The Run counter is increased to track this.
- If VMkernel has only some of the CPUs, then it will run the VM partially. Eventually, there will be unbalanced. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size. The Co-Stop counter is increased to account for this.
- When it’s not being utilized, there are 2 possible reasons
- The CPU is truly idle. It’s not doing work. The Idle Wait counter accounts for it.
- The CPU is waiting for IO.
- CPU, being faster than RAM, may have to wait for IO to be brought in. The IO Wait counter accounts for this
With the above explanation, the famous state diagram below makes more sense. I took the diagram from the technical whitepaper “The CPU Scheduler in VMware vSphere® 5.1”. It is a mandatory reading for all vSphere professionals, as it’s a foundational knowledge.
A VM CPU is on one of these 4 states: Run, Ready, Co-Stop and Wait.
- Run means it’s consuming CPU cycle.
- Ready means it is ready to run, but ESXi has no physical cores to run it.
- Co-Stop only applies to vSMP VM. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size.
- Wait means the CPU is idle. It can either be waiting for IO, or it really is idle.
If the above was all we need to know, monitoring VMware vSphere would have been easy. In reality, the following factors must be considered:
- Hyper Threading
- System time
- Power Management
Hyper Threading (HT)
Hyper Threading (HT) is known to deliver performance boost that is lower than what 2 physical cores deliver. Generally speaking, HT provides 1.25x performance boost in vSphere. That means if both threads are running, each thread only gets 62.5% of the shared physical core. This is a significant drop from the perspective of each VM. From the perspective of each VM, it is better if the second thread is not being used, because the VM can then get 100% performance instead of 62.5%. Because the drop is significant, enabling the latency sensitivity setting will result in a full core reservation. The CPU scheduler will not run any task on the second HT.
The following diagram shows 2 VMs. Each run on a thread of a shared core. There are 4 possible combinations of Run and Idle.
Each VM runs for half the time. The CPU Run counter = 50%, because it’s not aware of HT. But is that really what each VM gets, since they have to fight for the same core?
The answer is obviously no. Hence the need for another counter that accounts for this. The diagram below shows what VM A actually gets.
The CPU Used counter takes this into account. In the first part, VM A only gets 62.5% as VM B is also running. In the second part, VM A gets the full 100%. The total for the entire duration is 40.625%. CPU Used will report this number, while CPU Run will report 50%.
If both threads are running all the time, guest what CPU Used and CPU Run will report?
62.5% and 100% respectively.
Big difference. The counter matters.
The 2nd factor that impacts CPU Accounting is CPU speed. The higher the frequency (GHz), the faster the CPU run. All else being equal, a CPU that run at 1 GHz is 50% slower than when it runs at 2 GHz. On the other hand, Turbo Mode can kick in and the CPU clock speed is higher than stated frequency. Turbo Boost normally happens together with power saving on the same CPU socket. Some cores are put to sleep mode, and the power saving is used to turbo mode other cores. The overall power envelope within the socket remain the same.
In addition, it takes time to wake up from a deep C-State. For details on P-State and C-State, see Valentin Bondzio and Mark Achtemichuk, VMworld 2017, Extreme Performance Series.
Because of the above, power management must be accounted for. Just like Hyper-Threading case, CPU Run is not aware of this. CPU Used takes into CPU Frequency Scaling.
Does it mean we should always set power management to maximum?
- No. ESXi uses power management to save power without impacting performance. A VM running on lower clock speed does not mean it gets less done. You only set it to high performance on latency sensitive applications, where sub-seconds performance matters. VDI, VoIP & Telco NFV are some examples that require low latency.
Stolen or Overlap
When ESXi is running a VM, this activity might get interrupted with IO processing (e.g. incoming network packets). If there is no other available cores in ESXi, VMkernel has to schedule the work on a busy core. If that core happens to be running VM, the work on that VM is interrupted. The counter Overlap accounts for this, although some documentation in VMware may refer to Overlap as Stolen. Linux Guest OS tracks this as Stolen time.
A high overlap indicates the ESXi host is doing heavy IO (Storage or Network). Look at your NSX Edge clusters, and you will see the host has relatively higher Overlap value.
A VM may execute a privilege instruction, or issue IO. These 2 activities are performed by the hypervisor, on behalf of the VM. vSphere tracks this in separate counter called System. Since this work is not performed by any of the VM CPU, this is charged to the VM CPU 0. The system services are accounted to CPU 0. You may see higher Used on CPU 0 than others, although the CPU Run are balanced for all the VCPUs. So this is not a problem for CPU scheduling. It’s just the way VMKernel does the CPU accounting.
Note that this blog refers to CPU accounting, not Storage accounting. For example, vSphere 6.5 no longer charges the Storage vMotion to the VM.
The relationship among the counters are shown below. To make it simple, each VM is 1 CPU VM.
CPU Used can differ to CPU Run as it takes into account Hyper Threading and frequency changes, includes System time, and excludes Overlap. Because of this, CPU Used is a better reflection of the actual usage than CPU Run.
The counters above is for each CPU in the VM. A VM with 8 CPU will have 8 x 100% = 800%. Other than the CPU, a VM world has other ancillary processes (e.g. MKS world, VMX world) in the ESXi kernel, but they are typically negligible.
What counters are missing from the above diagram? There are 2 key counters, which are critical in Performance and Capacity. Can you name them?
You’re right. They are CPU Demand and CPU Contention.
Let’s talk about CPU Demand first. The diagram below now has CPU Demand. I’ve also added CPU Wait for completeness.
CPU Demand captures the full demand of a VM. It includes the CPU Ready and CPU Co-Stop. This is what you want to see if you want to see the full demand of a VM.
vCenter VM CPU Usage
Before we cover CPU Contention, which is a performance, there is 1 more utilization counter we need to check. Can you guess what is it?
Hint: it does not exist in ESXi. It only exists in vCenter.
This counter takes 2 forms: Usage in MHz and Usage in %.
CPU Usage (%) is a rounding of CPU Usage (MHz), not the other way around. The calculation is done first in GHz, then converted into %.
Since vCenter is only a reporting software, it has to base on ESXi. Mapping to Run and Demand do not seem logical. Mapping to Used makes the most sense. I plotted the 2 counters. They are not identical.
The reason is vCenter CPU Usage includes VMX world. Read this good article to understand it better. VMX world exists for each vCPU.
Guest OS CPU Utilization
Now that you know the hypervisor VM CPU counters, can you suggest how it impact the Guest OS? I consulted Valentin Bondzio, someone I consider the #1 authority on this topic. He said “What happens to you when time is frozen?”
That’s a great way to put it. As far as Guest OS is concerned, time is frozen when it is not scheduled.
- Guest OS experience frozen time when hypervisor deschedules it. Time jumps when it’s scheduled again.
- Guest OS CPU Usage isn’t aware of stolen time. For this counter to be aware, its code has to be modified. If you know that Microsoft and Linux has modified this counter, let me know in which version they make the change.
- Guest OS Stolen Time accounts for it. But that’s in Linux, not Windows.
The table below shows the impact of various scheduling events.
The diagram below shows the lack of visibility. Notice most of them are below the VM. The only counter that hypervisor cannot see is the Guest OS Run Queue, which is not counted by the Guest as utilization as it’s still in the queue.
I hope the above helps explains why you should not use Guest OS CPU Utilization counters.
To measure the Guest OS usage, use this formula:
Guest OS Usage = VM CPU Run - Overlap + Ready + CoStop.
Notice I do not use VM CPU Demand or VM CPU Usage counters. Can you guess why?
The problem is both counters are contaminated with components that can give inaccurate readings. The components are:
- CPU System. This workload is not coming from the VM. It’s coming from ESXi, executing on behalf of the VM. This does not run inside the Guest OS threads.
- Frequency scaling and HT. They are not relevant in the context of VM CPU utilization.
- A VM is consuming a CPU at 100%, regardless whether the 2nd HT runs or not. The fact that the 2nd HT runs at 100% does not mean the VM utilization is 62.5%. The guest is actually running at 100%
- The same applies to changes in frequency. It makes the VM faster or slower. We need to distinguish utilisation from capacity and performance use cases.
- VMX. This should not be charged.
How does the supermetric differ to CPU Usage? We can actually plot it. I take a sample of 480 VM in my lab. I use the View List widget to list VM Name, VM CPU Usage and VM Guest OS Usage. I exported into a spreadsheet, then use a simple formula to compare the 2 values.
The result is interesting.
There are situation where CPU Usage is over-reporting. Take example no 2 below. It’s reporting 85% when it’s only 72%. I’m not too worried about this, as this is simply a classic over-size.
There are situation where CPU Usage is under-reporting. Take the last example. It’s reporting 55%, but the reality is it is 93%. You would have thought the VM is fine, when the VM actually need more CPU. In this case, you need to ensure that Ready and Co-Stop aren’t a factor.
I plot the entire 480 values over a line chart, so I get the big picture. I notice that most of the time, it’s correct. That’s a good news. The bulk of the data is <5% difference. The black arrow indicates CPU Usage is over-reporting, while the red arrow indicates it is under-reporting.
You can do the profiling in your environment too, and discover interesting behaviour in production 🙂