Author Archives: Iwan Rahabok

About Iwan Rahabok

A father of 2 little girls, my pride and joy. The youngest one quickly said "I'm the joy!"

VM CPU Counters in vSphere

I couldn’t find a document that explains VM CPU counters in-depth. Having read many sources and consulted folks, here is what I got so far. Correction is most welcomed.

The Basic

At the most basic level, a VM CPU is either being utilized or not being utilized by the Guest OS.

  • When it’s being utilized, the hypervisor must schedule it.
    • If VMkernel has no physical CPU to run it, then the VM is placed into Ready state. The VM is ready, but the hypervisor is not. The Ready counter is increased to account for this.
    • If VMkernel has all the required physical CPUs to run it, then the VM gets to run. It is placed into Run state. The Run counter is increased to track this.
    • If VMkernel has only some of the CPUs, then it will run the VM partially. Eventually, there will be unbalanced. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size. The Co-Stop counter is increased to account for this.
  • When it’s not being utilized, there are 2 possible reasons
    • The CPU is truly idle. It’s not doing work. The Idle Wait counter accounts for it.
    • The CPU is waiting for IO.
      • CPU, being faster than RAM, may have to wait for IO to be brought in. The IO Wait counter accounts for this

With the above explanation, the famous state diagram below makes more sense. I took the diagram from the technical whitepaper “The CPU Scheduler in VMware vSphere® 5.1”. It is a mandatory reading for all vSphere professionals, as it’s a foundational knowledge.

A VM CPU is on one of these 4 states: Run, Ready, Co-Stop and Wait.

  1. Run means it’s consuming CPU cycle.
  2. Ready means it is ready to run, but ESXi has no physical cores to run it.
  3. Co-Stop only applies to vSMP VM. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size.
  4. Wait means the CPU is idle. It can either be waiting for IO, or it really is idle.

Advanced Topic

If the above was all we need to know, monitoring VMware vSphere would have been easy. In reality, the following factors must be considered:

  1. Hyper Threading
  2. Interrupt
  3. System time
  4. Power Management

Hyper Threading (HT)

Hyper Threading (HT) is known to deliver performance boost that is lower than what 2 physical cores deliver. Generally speaking, HT provides 1.25x performance boost in vSphere. That means if both threads are running, each thread only gets 62.5% of the shared physical core. This is a significant drop from the perspective of each VM. From the perspective of each VM, it is better if the second thread is not being used, because the VM can then get 100% performance instead of 62.5%. Because the drop is significant, enabling the latency sensitivity setting will result in a full core reservation. The CPU scheduler will not run any task on the second HT.

The following diagram shows 2 VMs. Each run on a thread of a shared core. There are 4 possible combinations of Run and Idle.

Each VM runs for half the time. The CPU Run counter = 50%, because it’s not aware of HT. But is that really what each VM gets, since they have to fight for the same core?

The answer is obviously no. Hence the need for another counter that accounts for this. The diagram below shows what VM A actually gets.

The CPU Used counter takes this into account. In the first part, VM A only gets 62.5% as VM B is also running. In the second part, VM A gets the full 100%. The total for the entire duration is 40.625%. CPU Used will report this number, while CPU Run will report 50%.

If both threads are running all the time, guest what CPU Used and CPU Run will report?

62.5% and 100% respectively.

Big difference. The counter matters.

Power Management

The 2nd factor that impacts CPU Accounting is CPU speed. The higher the frequency (GHz), the faster the CPU run. All else being equal, a CPU that run at 1 GHz is 50% slower than when it runs at 2 GHz. On the other hand, Turbo Mode can kick in and the CPU clock speed is higher than stated frequency. Turbo Boost normally happens together with power saving on the same CPU socket. Some cores are put to sleep mode, and the power saving is used to turbo mode other cores. The overall power envelope within the socket remain the same.

In addition, it takes time to wake up from a deep C-State. For details on P-State and C-State, see Valentin Bondzio and Mark Achtemichuk, VMworld 2017, Extreme Performance Series.

Because of the above, power management must be accounted for. Just like Hyper-Threading case, CPU Run is not aware of this. CPU Used takes into CPU Frequency Scaling.

Does it mean we should always set power management to maximum?

  • No. ESXi uses power management to save power without impacting performance. A VM running on lower clock speed does not mean it gets less done. You only set it to high performance on latency sensitive applications, where sub-seconds performance matters. VDI, VoIP & Telco NFV are some examples that require low latency.

Stolen or Overlap

When ESXi is running a VM, this activity might get interrupted with IO processing (e.g. incoming network packets). If there is no other available cores in ESXi, VMkernel has to schedule the work on a busy core. If that core happens to be running VM, the work on that VM is interrupted. The counter Overlap accounts for this, although some documentation in VMware may refer to Overlap as Stolen. Linux Guest OS tracks this as Stolen time.

A high overlap indicates the ESXi host is doing heavy IO (Storage or Network). Look at your NSX Edge clusters, and you will see the host has relatively higher Overlap value.

System Time

A VM may execute a privilege instruction, or issue IO. These 2 activities are performed by the hypervisor, on behalf of the VM. vSphere tracks this in separate counter called System. Since this work is not performed by any of the VM CPU, this is charged to the VM CPU 0. The system services are accounted to CPU 0. You may see higher Used on CPU 0 than others, although the CPU Run are balanced for all the VCPUs. So this is not a problem for CPU scheduling. It’s just the way VMKernel does the CPU accounting.

Note that this blog refers to CPU accounting, not Storage accounting. For example, vSphere 6.5 no longer charges the Storage vMotion to the VM.

Conclusion

The relationship among the counters are shown below. To make it simple, each VM is 1 CPU VM.

CPU Used can differ to CPU Run as it takes into account Hyper Threading and frequency changes, includes System time, and excludes Overlap. Because of this, CPU Used is a better reflection of the actual usage than CPU Run.

The counters above is for each CPU in the VM. A VM with 8 CPU will have 8 x 100% = 800%. Other than the CPU, a VM world has other ancillary processes (e.g. MKS world, VMX world) in the ESXi kernel, but they are typically negligible.

What counters are missing from the above diagram? There are 2 key counters, which are critical in Performance and Capacity. Can you name them?

You’re right. They are CPU Demand and CPU Contention.

Let’s talk about CPU Demand first. The diagram below now has CPU Demand. I’ve also added CPU Wait for completeness.

CPU Demand captures the full demand of a VM. It includes the CPU Ready and CPU Co-Stop. This is what you want to see if you want to see the full demand of a VM.

vCenter VM CPU Usage

Before we cover CPU Contention, which is a performance, there is 1 more utilization counter we need to check. Can you guess what is it?

Hint: it does not exist in ESXi. It only exists in vCenter.

CPU Usage.

The counter takes 2 forms: Usage in MHz and Usage in %.

CPU Usage (%) is a rounding of CPU Usage (MHz), not the other way around. The calculation is done first in GHz, then converted into %.

Since vCenter is only a reporting software, it has to base on ESXi. Mapping to Run and Demand do not seem logical. Mapping to Usage makes the most sense. I plotted the 2 counters. They are not identical. I will study them more. Luckily, they are pretty close.

Why Usage is higher than Demand, since Demand = Used + Latency? Based on this, Usage is even higher than Used. Something that can go “down” in Used is not considered in Usage. The suspect is HT and Power Management. It seems like Usage MHz is based on constant CPU clock speed. If that’s the case, then it’s not derived from Used. It’s derived from Run.

Guest OS CPU Utilization

Now that you know the hypervisor VM CPU counters, can you suggest how it impact the Guest OS? I consulted Valentin Bondzio, someone I consider the #1 authority on this topic. He said “What happens to you when time is frozen?”

That’s a great way to put it. As far as Guest OS is concerned, time is frozen when it is not scheduled. 

  • Guest OS experience frozen time when hypervisor deschedules it. Time jumps when it’s scheduled again.
  • Guest OS CPU Usage isn’t aware of stolen time. For this counter to be aware, its code has to be modified. If you know that Microsoft and Linux has modified this counter, let me know in which version they make the change.
  • Guest OS Stolen Time accounts for it. But that’s in Linux, not Windows.

The table below shows the impact of various scheduling events.

I hope the above helps explains why you should not use Guest OS CPU Usage counters. Use VM CPU Used and CPU Demand counters.

In future article, I hope to give guidance on what value considered good or bad. I need to do some profiling first.

Spectre Meltdown dashboards

First, thank you for the overwhelming response on the Spectre Meltdown monitoring dashboards that I created. Glad that it’s useful.

Technically, they are all simple dashboards. They do not leverage powerful features of vRealize Operations. In fact, some of you have created your own dashboards. That means you can also modify my dashboards after you import them. You need to update the Build Numbers to reflect the changes in VMware security recommendation. My dashboards are not meant to be the authority of which builds you should have. It was correct at the time of published, but I was told there was an update to the build numbers. So you need to update the filter.

Follow these 3 screenshots on how to do it.

This screenshot 1 shows that there are 3 filters used in the View widgets. To update the widget, just click as shown.

You get the screenshot below. A dialog box pop up. Wait for it to load. Click edit.

To modify the filter, follow this last screenshot.

Click Save.

That’s all you need to do to update the filter. I shared that they are just simple dashboards. So what does powerful one looks like?

Here is one idea. Your CIO care more about applications. An “Application” spans multiple tiers. A tier has multiple VMs. These “Applications” run >100s different software, but consumes the same underlying infrastructure. Since the patch may cause performance and capacity problem, how do we monitor across >100s applications?

For that, you can refer to this.

Operationalize Your World: Lite edition

I travel globally meeting VMware customers. A very popular request among customers is a simple set of dashboards that answer these questions:

  1. Are the VMs served well?
    1. If not, which VMs are affected? By what problems (CPU, RAM, Disk, Network)? How bad?
    2. Is it because of Villain VMs, consuming excessive amount of shared resource? If yes, who are they?
    3. Are the problems spread across clusters, networks, datastores? Or are they isolated to specific part of my IaaS?
    4. How long has the problem been happening? Is there a pattern?
  2. Is my Infrastructure running hot?
    1. This could be a reason why the VMs were not served well.
    2. If yes, which part? I need to see the 4 IaaS elements (CPU, RAM, Disk, Network), and easily spot where the problems are.
      • Blue = 0% = cold. Not used.
      • Red = 100% = hot. Highly utilized
    3. Compute: Which cluster are running hot? Is it CPU or RAM?
      1. Are the cluster balanced? Hosts in the cluster should have similar color.
      2. Are the hosts of equal capacity? The bigger the host, the bigger the box, so I can spot it easily.
      3. Select an ESXi, then click dashboard navigation to drill down into Troubleshoot a Host dashboard.
    4. Storage: Which datastores are running hot?
      1. Hot = busy processing lots of IOPS.
      2. Select a datastore, then click dashboard navigation to drill down into Troubleshoot a Datastore dashboard.
    5. Network: Which LAN or VXLAN carries a lot of traffic?
      1. The bigger the network (no of VMs or ports), the bigger the box.
      2. The higher the traffic, the redder the color.

Using the dashboard best practices covered here, I translated the above into 4 dashboards. I added a simple VM Reclamation dashboard to complete the functionality. The picture below shows the functional relationship among the dashboards.

The result was 5 simple dashboards. It’s a lite version of Operationalize Your World, which has 50 dashboards. As a result, the import step is much simpler. It’s also upgradeable to the full OYW.

Are the VMs served well?

The above shows the present data. It’s suitable for live NOC screen, where you can see from a distance. All you want to see is green! You can customize the threshold, simply edit each widget.

Easy to spot the villain VMs. They are the biggest! If you have a large box occupying a relatively large area, that means you have a VM consuming a large percentage of your shared environment.

The above is not so good to show The Past. Unlike The Present (which has 1 data point), the past has many. For that, we need to use line chart. This is why the next dashboard is required.

Were the VMs served well?

Are my Infra running hot?

Was my Infra running hot?

  • Which clusters had the problem? Is it CPU or RAM? How bad is it?
    • Both max and average lines are shown so you get better idea.
    • If max is high but average is low, no one may complain yet. This is your proactive window!
  • Which datastores had the problem?
    • How bad is the situation?
    • Is the IO stuck in the queue?

What can I easily reclaim?

I focus on powered off VMs and Idle VMs as they are easier than active VMs.

From this dashboard, you can select a VM, then click dashboard navigation to drill down into VM Utilization dashboard.

Implementation

Compare to the Operationalize Your World import step, this is much easier. It does not require preparation, which is time consuming. The strikethrough steps are not required.

  • Plan which clusters & datastores belong to what service tier
  • Create service tier policy
  • Create a Group Type call “VM Types”.
  • Import Groups
  • Import super metrics, using dummy policy
  • Import super metrics. Then enable them on your base policy
  • Enable super metrics on each service-tier policy
  • Create XML interaction, manually
  • Create text widget content, manually
  • Create roles. Assign users to roles
  • Import dashboards & Import views

To get the file, download it from here.