Tag Archives: VMware vSphere

VM CPU Counters in vSphere

I couldn’t find a document that explains VM CPU counters in-depth. Having read many sources and consulted folks, here is what I got so far. Correction is most welcomed.

The Basic

At the most basic level, a VM CPU is either being utilized or not being utilized by the Guest OS.

  • When it’s being utilized, the hypervisor must schedule it.
    • If VMkernel has no physical CPU to run it, then the VM is placed into Ready state. The VM is ready, but the hypervisor is not. The Ready counter is increased to account for this.
    • If VMkernel has all the required physical CPUs to run it, then the VM gets to run. It is placed into Run state. The Run counter is increased to track this.
    • If VMkernel has only some of the CPUs, then it will run the VM partially. Eventually, there will be unbalanced. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size. The Co-Stop counter is increased to account for this.
  • When it’s not being utilized, there are 2 possible reasons
    • The CPU is truly idle. It’s not doing work. The Idle Wait counter accounts for it.
    • The CPU is waiting for IO.
      • CPU, being faster than RAM, may have to wait for IO to be brought in. The IO Wait counter accounts for this

With the above explanation, the famous state diagram below makes more sense. I took the diagram from the technical whitepaper “The CPU Scheduler in VMware vSphere® 5.1”. It is a mandatory reading for all vSphere professionals, as it’s a foundational knowledge.

A VM CPU is on one of these 4 states: Run, Ready, Co-Stop and Wait.

  1. Run means it’s consuming CPU cycle.
  2. Ready means it is ready to run, but ESXi has no physical cores to run it.
  3. Co-Stop only applies to vSMP VM. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size.
  4. Wait means the CPU is idle. It can either be waiting for IO, or it really is idle.

Advanced Topic

If the above was all we need to know, monitoring VMware vSphere would have been easy. In reality, the following factors must be considered:

  1. Hyper Threading
  2. Interrupt
  3. System time
  4. Power Management

Hyper Threading (HT)

Hyper Threading (HT) is known to deliver performance boost that is lower than what 2 physical cores deliver. Generally speaking, HT provides 1.25x performance boost in vSphere. That means if both threads are running, each thread only gets 62.5% of the shared physical core. This is a significant drop from the perspective of each VM. From the perspective of each VM, it is better if the second thread is not being used, because the VM can then get 100% performance instead of 62.5%. Because the drop is significant, enabling the latency sensitivity setting will result in a full core reservation. The CPU scheduler will not run any task on the second HT.

The following diagram shows 2 VMs. Each run on a thread of a shared core. There are 4 possible combinations of Run and Idle.

Each VM runs for half the time. The CPU Run counter = 50%, because it’s not aware of HT. But is that really what each VM gets, since they have to fight for the same core?

The answer is obviously no. Hence the need for another counter that accounts for this. The diagram below shows what VM A actually gets.

The CPU Used counter takes this into account. In the first part, VM A only gets 62.5% as VM B is also running. In the second part, VM A gets the full 100%. The total for the entire duration is 40.625%. CPU Used will report this number, while CPU Run will report 50%.

If both threads are running all the time, guest what CPU Used and CPU Run will report?

62.5% and 100% respectively.

Big difference. The counter matters.

Power Management

The 2nd factor that impacts CPU Accounting is CPU speed. The higher the frequency (GHz), the faster the CPU run. All else being equal, a CPU that run at 1 GHz is 50% slower than when it runs at 2 GHz. On the other hand, Turbo Mode can kick in and the CPU clock speed is higher than stated frequency. Turbo Boost normally happens together with power saving on the same CPU socket. Some cores are put to sleep mode, and the power saving is used to turbo mode other cores. The overall power envelope within the socket remain the same.

In addition, it takes time to wake up from a deep C-State. For details on P-State and C-State, see Valentin Bondzio and Mark Achtemichuk, VMworld 2017, Extreme Performance Series.

Because of the above, power management must be accounted for. Just like Hyper-Threading case, CPU Run is not aware of this. CPU Used takes into CPU Frequency Scaling.

Does it mean we should always set power management to maximum?

  • No. ESXi uses power management to save power without impacting performance. A VM running on lower clock speed does not mean it gets less done. You only set it to high performance on latency sensitive applications, where sub-seconds performance matters. VDI, VoIP & Telco NFV are some examples that require low latency.

Stolen or Overlap

When ESXi is running a VM, this activity might get interrupted with IO processing (e.g. incoming network packets). If there is no other available cores in ESXi, VMkernel has to schedule the work on a busy core. If that core happens to be running VM, the work on that VM is interrupted. The counter Overlap accounts for this, although some documentation in VMware may refer to Overlap as Stolen. Linux Guest OS tracks this as Stolen time.

A high overlap indicates the ESXi host is doing heavy IO (Storage or Network). Look at your NSX Edge clusters, and you will see the host has relatively higher Overlap value.

System Time

A VM may execute a privilege instruction, or issue IO. These 2 activities are performed by the hypervisor, on behalf of the VM. vSphere tracks this in separate counter called System. Since this work is not performed by any of the VM CPU, this is charged to the VM CPU 0. The system services are accounted to CPU 0. You may see higher Used on CPU 0 than others, although the CPU Run are balanced for all the VCPUs. So this is not a problem for CPU scheduling. It’s just the way VMKernel does the CPU accounting.

Note that this blog refers to CPU accounting, not Storage accounting. For example, vSphere 6.5 no longer charges the Storage vMotion to the VM.

Conclusion

The relationship among the counters are shown below. To make it simple, each VM is 1 CPU VM.

CPU Used can differ to CPU Run as it takes into account Hyper Threading and frequency changes, includes System time, and excludes Overlap. Because of this, CPU Used is a better reflection of the actual usage than CPU Run.

The counters above is for each CPU in the VM. A VM with 8 CPU will have 8 x 100% = 800%. Other than the CPU, a VM world has other ancillary processes (e.g. MKS world, VMX world) in the ESXi kernel, but they are typically negligible.

What counters are missing from the above diagram? There are 2 key counters, which are critical in Performance and Capacity. Can you name them?

You’re right. They are CPU Demand and CPU Contention.

Let’s talk about CPU Demand first. The diagram below now has CPU Demand. I’ve also added CPU Wait for completeness.

CPU Demand captures the full demand of a VM. It includes the CPU Ready and CPU Co-Stop. This is what you want to see if you want to see the full demand of a VM.

vCenter VM CPU Usage

Before we cover CPU Contention, which is a performance, there is 1 more utilization counter we need to check. Can you guess what is it?

Hint: it does not exist in ESXi. It only exists in vCenter.

CPU Usage.

The counter takes 2 forms: Usage in MHz and Usage in %.

CPU Usage (%) is a rounding of CPU Usage (MHz), not the other way around. The calculation is done first in GHz, then converted into %.

Since vCenter is only a reporting software, it has to base on ESXi. Mapping to Run and Demand do not seem logical. Mapping to Usage makes the most sense. I plotted the 2 counters. They are not identical. I will study them more. Luckily, they are pretty close.

Why Usage is higher than Demand, since Demand = Used + Latency? Based on this, Usage is even higher than Used. Something that can go “down” in Used is not considered in Usage. The suspect is HT and Power Management. It seems like Usage MHz is based on constant CPU clock speed. If that’s the case, then it’s not derived from Used. It’s derived from Run.

Guest OS CPU Utilization

Now that you know the hypervisor VM CPU counters, can you suggest how it impact the Guest OS? I consulted Valentin Bondzio, someone I consider the #1 authority on this topic. He said “What happens to you when time is frozen?”

That’s a great way to put it. As far as Guest OS is concerned, time is frozen when it is not scheduled. 

  • Guest OS experience frozen time when hypervisor deschedules it. Time jumps when it’s scheduled again.
  • Guest OS CPU Usage isn’t aware of stolen time. For this counter to be aware, its code has to be modified. If you know that Microsoft and Linux has modified this counter, let me know in which version they make the change.
  • Guest OS Stolen Time accounts for it. But that’s in Linux, not Windows.

The table below shows the impact of various scheduling events.

I hope the above helps explains why you should not use Guest OS CPU Usage counters. Use VM CPU Used and CPU Demand counters.

In future article, I hope to give guidance on what value considered good or bad. I need to do some profiling first.

VM Availability Monitoring

VM Availability is a common requirement, as IaaS team is bound by Availability SLA. The challenge is reporting this.

The up time of a VM is more complex than that of a physical machine. Just because the VM is powered on, does not mean the Guest OS is up and running. The VM could be stuck at BIOS, Windows hits BSOD or Guest OS simply hang. This means we need to check the Guest OS. If we have VMware Tools, we can check for heartbeat. But what if VMware Tools is not running or not even installed? Then we need to check for sign of life. Does the VM generate network packets, issue disk IOPS, consume RAM?

Another challenge is the frequency of reporting. If you report every 5 minutes, what if the VM was rebooted within that 5 minutes, and it comes back up before the 5th minute ends? You will miss that fact that it was down within that 5 minutes!

From the above, we can build a logic:

If VM Powered Off then
   Return 0. VM is definitely down.
Else
  Calculate up time within the 300 seconds period
End

In the above logic, to calculate the up time, we need first to decide if the Guest OS is indeed up, since the VM is powered on.

We can deduce that Guest OS is up is it’s showing any sign of life. We can take

  • Heartbeat
  • RAM Usage
  • Network Usage
  • Disk IOPS

Can you guess why we can’t use CPU Usage?

VM does generate CPU even though it’s stuck at BIOS! We need a counter that shows 0, and not a very low number. An idle VM is up, not down.

So we need to know if the Guest OS is up or down. We are expecting binary, 1 or 0. Can you see the challenge here?

Yes, none of the counters above is giving you binary. Disk IOPS for example, can vary from 0.01 to 10000. The “sign of life” is not coming as binary.

We need to convert them into 0 or 1. 0 is the easy part, as they will be 0 if they are down.

I’d take Network Usage as example.

  • What if Network Usage is >1? We can use Min (Network Usage, 1) to return 1.
  • What if Network Usage is <1? We can use Round up (Network Usage, 1) to return 1.

So we can combine the above formula to get us 0 or 1.

The last part is to account for partial up time, when the VM was rebooted within the 300 seconds sampling period. The good thing is vR Ops tracks the OS up time every second. So every 5 minute, the value goes up by 300 seconds. As VM normally runs >5 minutes, you end up with a very large number. Our formula becomes:

If the up time is >300 seconds then return 300 else return it as it is.

Implementation

Let’s now put the formula together. Here is the logical formula:

(Is VM Powered on?) x 
(Is Guest OS up?) x 
(period it is up within 300 seconds)

“Is VM Powered on” returns 0 or 1. This is perfect as the whole formula returns 0.

“Is Guest OS up” returns 0 or 1. It returns 1 is there is any sign of life.

We get the Maximum of OS Uptime, Tools Heartbeat, RAM Usage, Disk Usage, and Network Usage. If any of these is not 0, then the Guest OS is up

We use Minimum (Is Guest OS up, 1) to bring down the number to 1.

Since the VM can be idle, we use Round Up to 1. This will round up 0.0001 to 1 but not round up 0 to 1.

To determine how long the VM is up within the 300 seconds, we simply take Minimum (OS Uptime, 300)

To convert the number into percentage, we simply divide by 300 then multiple by 100.

Here is what the formula looks like

Can you write the above formula differently? Yes, you can use If Then Else. I do not use it as it makes the formula harder to read. It’s also more resource intensive.

Let’s now show the above formula using actual vR Ops super metric formula.

I’m using $This feature as the formula is referring to the VM itself. I’m using metric= and not attribute= as I only need 1 value.

Validation

Let’s now take a few scenario and run through the formula.

Scenario 1: the VM is powered off.

  • This will return 0 since the first result is already 0 and multiplying 0 with anything will give 0.

Scenario 2: ideal scenario. The VM is up, active and has VMware Tools

  • The OS Uptime metric, since it’s in seconds and it’s accumulative, will be much larger than other counters. The Max () will return 7368269142, but Min (7368269142, 1) will return 1.
  • The Min (300, 7368269142) will return 300.
  • So the result is 1 x 1 x 300 / 3 = 100. The VM Uptime for that 5 minute period is 100%.

Scenario 3: not ideal scenario. The VM is up, but is idle and has no VMware Tools

Example

Let’s show an example of how the super metric detects the availability issue. Here is a VM that has availability problem. In this case, the VM was rebooted regularly.

The VM Uptime (%) super metric reports it correctly. It’s 100% when it’s up and 0% when it’s down. The super metric matches the Powered On metric and the OS Uptime metric.

Let’s check if the super metric detects the up time within the 5 minutes. To do that, we can zoom into the time it was down. From the Powered On and OS Uptime metric, we can see it’s down for around 10 – 15 minutes. The super metric detects that. The up time went down to 0 for 10 minutes, then partially up in the last 5 minutes.

Here is the uptime in the last 5 minutes. So it went up within this 5 minutes

Limitation

The limitation is within a 5 minute period. The OS has to be up by the 5th minute. If the number is 0, it will be calculated as 0. So if the VM is up for 4:59 minutes, then went down at 5th minute exactly, the Powered On will return 0.

Architecting VMware vSphere for Operations

As an Architect, you take into account many requirements when designing your VMware vSphere environment. As an Operations person, I shall not question your Architecture. I’m sure it is fast, highly available, right on budget, etc. My role is to help you prove to CIO that what you architect actually lives up to its expectation. “Plan meets Actual” is what an Architect wants, because that means your architecture delivers its promise.

The plan of the Architecture exists in some diagram and documents. It's static.
The reality of the Architecture exists in Datacenter. It's live.

When proving, here are the questions we should be able to answer:

  • Availability: Does the IaaS deliver the promised Availability SLA? If not, what was its actual number, when was it breached, how long, which VMs affected? For each VM, when exactly it happened and ended?
  • Performance: Does the IaaS serve all its VMs well? An IaaS platform provides CPU, RAM, Disk and Network as services. If has to deliver these 4 resources when asked by each VM, 24 x 7 x 365 days a year. If not, which VMs were affected by what and when?

Those are simple questions, but are very difficult to answer. Let say you have 10,000 VMs. How do you answer that? How do you provide answer over time (e.g. monthly), proving you handle the peaks well?

To complicate matters, you need to able to answer per Business Units. Business Units A will not care about other business units. Since a Business Unit has >1 apps, you also need to answer per application. An application owner only cares about her applications.

There are a few things you need to do, so you are in the position to prove.

Step 1: Reflect the business in vSphere

Does your vCenter show all the business units? Can you show how the business is mapped into your vSphere environment? You design vSphere so Business can run on it, so where is the Business? A company is made up business units, which may have multiple departments. The structure below is a typical example.

An application typically has multiple tiers. Does your vSphere understand that?

Map the above as folders in your vCenter.

I see many naming convention that is not operations-friendly. It’s impossible to guess what it is. The names are very similar and hard to read, hence it’s easy for operators to make mistake! In some companies, these operators are outsourced or contractors, who are not that familiar or don’t care as much as employee. The naming convention typically originates from mainframe or MS-DOS era, where you cannot have space and have limited characters. Examples are SG1-D01-INS-0001W-PRD. Can you guess what on earth that is? You’re right, you can’t. Imagine there are 1000s of them like that, and you have new operators joining the help desk team.

If you have shared application, you can create a folder for that. Multiple vR Ops applications can point to the same vCenter folder.

Folder, Tags and Annotation

Have you seen a vSphere environment where there are tags and annotation everywhere?

It’s rare to meet customers with a 100% well-thought and documented approach to the 3 features above. They may have general guidelines, but not explicit Do’s and Dont’s. As a result, these 3 features are used wrongly.

Use Tags when the values are discrete, ideally Yes/No. I’d use tag to tag the following:

  • VMs with RDM.
  • VMs with MSCS or Linux Clustering.

Do not use Tags when the values are unlikely to be common. Use annotation for this. Examples are VM Owner Name, Email Address, Mobile Number. In an environment where there are >10K VMs, there can be 1000 VM Owners.

Do not use Tags to tag Service Tier. For Infra objects such as Cluster and Datastore Cluster, that should be clearly reflected in the name itself. I’d prefix all Tier 1 clusters with Tier 1, so the chance of deploying into the wrong tier is minimized.

Step 2: Design Service Tiers into vSphere

Does your vSphere understand that there are different classes of service? Are Tier 1 clusters and datastores clearly labelled?

You should avoid mixing multiple classes of service into a single cluster or datastore. While it is technically possible to segregate, it’s operationally challenging. Resource Pool expects the number of VMs for each pool to be identical among the sibling pool.

For each tier, you need to have both Availability SLA and Performance. For Performance SLA, review this doc.

Step 3: Define and Map Tiers in vR Ops

Now that you’ve considered service tier into your vSphere architecture, time to show it. You cannot show it in vSphere as vCenter does not understand Performance SLA and Availability SLA. You can use vR Ops do this. Follow this step.

Step 4: Map Applications in vR Ops

Use custom groups to create applications. If you have a proper naming convention, it should not be difficult to select members of the applications. All you need is a query that says the names contains XYZ. There should not be a need for regular expression.

Once apps are mapped, you can do something like this.

Step 5: Consider Debug-ability

Things go wrong. Especially in production. Your architecture should lend itself for troubleshooting.

A major area is to ensure the counters are reliable, else it’s hard to troubleshoot performance. The CPU Contention counter, which is the main counter for IaaS Performance SLA, is greatly affected by Power Management. Ensure your ESXi power management follow this guide by Sunny Dua.

Once you have that in place, you will be able to prove that your Architecture lives up to its expectation. Use the dashboards from Operationalize Your World to show that proof!