Author Archives: Iwan Rahabok

About Iwan Rahabok

A father of 2 little girls, my pride and joy. The youngest one quickly said "I'm the joy!"

Purpose-driven Architecture

When you architect IaaS or DaaS, what end goals do you have in mind? I don’t mean the design considerations, such as best practices. I mean the business result that your architecture has to deliver. A sign that your architecture has failed to deliver is you get into this situation:

The goal of IaaS is to ensure the VMs are running well. The goal of DaaS is to ensure End Users are getting good desktop experience. Have you defined well or good?

Let’s discuss IaaS first. Say you’re architecting for 10K VM in 2 datacenters. You envisage 2K VM in the first month, then ramp up to 10K within the first year. Do you know the basic info about each of these 10K VMs, so that you can architect an infra to serve them well?

  • How big are they? vCPU, RAM, Disk
  • How intense are they? CPU Utilization, RAM utilisation, Disk IOPS, Network throughput?
  • Their workload pattern? Daily, weekly, monthly, etc.

You don’t. Even the applications team don’t know. Their vendors don’t know either, as you’re talking about the future.

So why do you promise that your IaaS will serve them well?

That’s a strategic mistake you make as Systems Architect. It’s akin to promising the highway you architect will all the cars, buses and motorcycle well, when you have no idea how many they are and how often they will use it.

Can you do something about it? Yes. You simply provide a good set of choice. The principle you share to your customers are the common sense used in all service industry:

You want it cheap, it won't be fast.
You want it fast, it won't be cheap.

You then offer a few class of service. Give 3 good choice, at difference price point. The highest price has the best performance.

  • Your price has to be cheaper than VMware on AWS, else what’s the point. VMware on AWS  has identical architecture to yours, as it’s using the same software and providing same capabilities. This assures your customers that they are getting good price.
  • Your performance is well defined. It is not subject to interpretation. You put a Performance SLA on the table, assuring your customers that you’re confidence of delivering as promised.

You then architect your IaaS to deliver the above classes of service. The class of service is your business offering. It’s the purpose of your architecture. With class of service clearly defined, the question below becomes easy to answer.

When you know exactly the quality of service you need to deliver, the operations team will not suffer. You handover your architecture to them with ease, as it can be operated easily. It has clear definition of performance and capacity.

Keep the summary below when you are architecting IaaS or DaaS.

For more details, review Operationalize Your World.

A test of your IaaS Operations maturity

What you architect is SDDC. What you handover as business result to CIO is IaaS. We can assess if the architecture is good or not, based on the actual result in production. Does it result in fire-fighting and blame-storming? Or you have a peaceful operations?

The litmus test below helps you assess the maturity of your IaaS.

Do your customers blame your infrastructure?

  • If the answer is yes, take a step to ask yourself why. There is a high chance you’re relying on complaint in your operations. So you actually encourage it. No complaint, no problem. A Complaint-based Operations.
  • The reason why you rely on complaint is you don’t have other means. You have not defined the performance of your IaaS.
  • A sign of matured operations is you have Performance SLA. It is per-VM, measured every 5 minutes.

Is your IaaS cheaper than both VMware on Amazon and Amazon?

  • If not, your CIO may question your business value. The reason for having an in-house architect is so you can bring lower cost, after taking into account your salary.

 Does Help Desk provide a good first level defense?

  • If Help Desk simply passes through to the next level, you need to look at why.
  • Help Desk is your first line of defence. They are not as technical as you are. Equip them with simple dashboard so they can handle VM Owner complaint:
    • Is the problem caused by IaaS not serving the VM well?
    • If yes, which part of the Infra: CPU, RAM, Disk, Network?
    • If not, how to prove it convincingly?

Can you justify new infrastructure when utilization is not yet high?

  • This is not referring to additional money that comes with new project. This is referring to existing clusters/storage.
  • Capacity is measured on utilization and performance. A cluster capacity is full if it can’t serve its VMs well. Since it takes time to buy hardware, you need to have have early warning to detect this performance degradation.

Do you struggle with many over-provisioned VMs?

  • This is an indicator that you’re operating as a System Builder as opposed to a Service Provider.
  • As a System Builder, you’re meddling with each System (read: Application). You size them, and argue with application team.
  • As a Service Provider, you’re not “on the way”. IT simply uses an effective pricing model to drive the right behaviour. Does AWS block you when you buy 40 CPU EC2 VM when you only need 2 CPU?

Does Troubleshooting mean all hands on deck?

  • Do you have a process that is followed by all teams (network, storage, server, OS, application)? Does that process end with Root Cause Analysis?
  • As part of RCA, do you set up alert so issue can be detected faster if it happens again?

There are more questions, but I thought we start with those first. If you want to see details, download this.

VM CPU Counters in vSphere

I couldn’t find a document that explains VM CPU counters in-depth. Having read many sources and consulted folks, here is what I got so far. Correction is most welcomed.

The Basic

At the most basic level, a VM CPU is either being utilized or not being utilized by the Guest OS.

  • When it’s being utilized, the hypervisor must schedule it.
    • If VMkernel has no physical CPU to run it, then the VM is placed into Ready state. The VM is ready, but the hypervisor is not. The Ready counter is increased to account for this.
    • If VMkernel has all the required physical CPUs to run it, then the VM gets to run. It is placed into Run state. The Run counter is increased to track this.
    • If VMkernel has only some of the CPUs, then it will run the VM partially. Eventually, there will be unbalanced. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size. The Co-Stop counter is increased to account for this.
  • When it’s not being utilized, there are 2 possible reasons
    • The CPU is truly idle. It’s not doing work. The Idle Wait counter accounts for it.
    • The CPU is waiting for IO.
      • CPU, being faster than RAM, may have to wait for IO to be brought in. The IO Wait counter accounts for this

With the above explanation, the famous state diagram below makes more sense. I took the diagram from the technical whitepaper “The CPU Scheduler in VMware vSphere® 5.1”. It is a mandatory reading for all vSphere professionals, as it’s a foundational knowledge.

A VM CPU is on one of these 4 states: Run, Ready, Co-Stop and Wait.

  1. Run means it’s consuming CPU cycle.
  2. Ready means it is ready to run, but ESXi has no physical cores to run it.
  3. Co-Stop only applies to vSMP VM. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size.
  4. Wait means the CPU is idle. It can either be waiting for IO, or it really is idle.

Advanced Topic

If the above was all we need to know, monitoring VMware vSphere would have been easy. In reality, the following factors must be considered:

  1. Hyper Threading
  2. Interrupt
  3. System time
  4. Power Management

Hyper Threading (HT)

Hyper Threading (HT) is known to deliver performance boost that is lower than what 2 physical cores deliver. Generally speaking, HT provides 1.25x performance boost in vSphere. That means if both threads are running, each thread only gets 62.5% of the shared physical core. This is a significant drop from the perspective of each VM. From the perspective of each VM, it is better if the second thread is not being used, because the VM can then get 100% performance instead of 62.5%. Because the drop is significant, enabling the latency sensitivity setting will result in a full core reservation. The CPU scheduler will not run any task on the second HT.

The following diagram shows 2 VMs. Each run on a thread of a shared core. There are 4 possible combinations of Run and Idle.

Each VM runs for half the time. The CPU Run counter = 50%, because it’s not aware of HT. But is that really what each VM gets, since they have to fight for the same core?

The answer is obviously no. Hence the need for another counter that accounts for this. The diagram below shows what VM A actually gets.

The CPU Used counter takes this into account. In the first part, VM A only gets 62.5% as VM B is also running. In the second part, VM A gets the full 100%. The total for the entire duration is 40.625%. CPU Used will report this number, while CPU Run will report 50%.

If both threads are running all the time, guest what CPU Used and CPU Run will report?

62.5% and 100% respectively.

Big difference. The counter matters.

Power Management

The 2nd factor that impacts CPU Accounting is CPU speed. The higher the frequency (GHz), the faster the CPU run. All else being equal, a CPU that run at 1 GHz is 50% slower than when it runs at 2 GHz. On the other hand, Turbo Mode can kick in and the CPU clock speed is higher than stated frequency. Turbo Boost normally happens together with power saving on the same CPU socket. Some cores are put to sleep mode, and the power saving is used to turbo mode other cores. The overall power envelope within the socket remain the same.

In addition, it takes time to wake up from a deep C-State. For details on P-State and C-State, see Valentin Bondzio and Mark Achtemichuk, VMworld 2017, Extreme Performance Series.

Because of the above, power management must be accounted for. Just like Hyper-Threading case, CPU Run is not aware of this. CPU Used takes into CPU Frequency Scaling.

Does it mean we should always set power management to maximum?

  • No. ESXi uses power management to save power without impacting performance. A VM running on lower clock speed does not mean it gets less done. You only set it to high performance on latency sensitive applications, where sub-seconds performance matters. VDI, VoIP & Telco NFV are some examples that require low latency.

Stolen or Overlap

When ESXi is running a VM, this activity might get interrupted with IO processing (e.g. incoming network packets). If there is no other available cores in ESXi, VMkernel has to schedule the work on a busy core. If that core happens to be running VM, the work on that VM is interrupted. The counter Overlap accounts for this, although some documentation in VMware may refer to Overlap as Stolen. Linux Guest OS tracks this as Stolen time.

A high overlap indicates the ESXi host is doing heavy IO (Storage or Network). Look at your NSX Edge clusters, and you will see the host has relatively higher Overlap value.

System Time

A VM may execute a privilege instruction, or issue IO. These 2 activities are performed by the hypervisor, on behalf of the VM. vSphere tracks this in separate counter called System. Since this work is not performed by any of the VM CPU, this is charged to the VM CPU 0. The system services are accounted to CPU 0. You may see higher Used on CPU 0 than others, although the CPU Run are balanced for all the VCPUs. So this is not a problem for CPU scheduling. It’s just the way VMKernel does the CPU accounting.

Note that this blog refers to CPU accounting, not Storage accounting. For example, vSphere 6.5 no longer charges the Storage vMotion to the VM.


The relationship among the counters are shown below. To make it simple, each VM is 1 CPU VM.

CPU Used can differ to CPU Run as it takes into account Hyper Threading and frequency changes, includes System time, and excludes Overlap. Because of this, CPU Used is a better reflection of the actual usage than CPU Run.

The counters above is for each CPU in the VM. A VM with 8 CPU will have 8 x 100% = 800%. Other than the CPU, a VM world has other ancillary processes (e.g. MKS world, VMX world) in the ESXi kernel, but they are typically negligible.

What counters are missing from the above diagram? There are 2 key counters, which are critical in Performance and Capacity. Can you name them?

You’re right. They are CPU Demand and CPU Contention.

Let’s talk about CPU Demand first. The diagram below now has CPU Demand. I’ve also added CPU Wait for completeness.

CPU Demand captures the full demand of a VM. It includes the CPU Ready and CPU Co-Stop. This is what you want to see if you want to see the full demand of a VM.

vCenter VM CPU Usage

Before we cover CPU Contention, which is a performance, there is 1 more utilization counter we need to check. Can you guess what is it?

Hint: it does not exist in ESXi. It only exists in vCenter.

CPU Usage.

This counter takes 2 forms: Usage in MHz and Usage in %.

CPU Usage (%) is a rounding of CPU Usage (MHz), not the other way around. The calculation is done first in GHz, then converted into %.

Since vCenter is only a reporting software, it has to base on ESXi. Mapping to Run and Demand do not seem logical. Mapping to Used makes the most sense. I plotted the 2 counters. They are not identical.

The reason is vCenter CPU Usage includes VMX world. Read this good article to understand it better. VMX world exists for each vCPU.

Guest OS CPU Utilization

Now that you know the hypervisor VM CPU counters, can you suggest how it impact the Guest OS? I consulted Valentin Bondzio, someone I consider the #1 authority on this topic. He said “What happens to you when time is frozen?”

That’s a great way to put it. As far as Guest OS is concerned, time is frozen when it is not scheduled. 

  • Guest OS experience frozen time when hypervisor deschedules it. Time jumps when it’s scheduled again.
  • Guest OS CPU Usage isn’t aware of stolen time. For this counter to be aware, its code has to be modified. If you know that Microsoft and Linux has modified this counter, let me know in which version they make the change.
  • Guest OS Stolen Time accounts for it. But that’s in Linux, not Windows.

The table below shows the impact of various scheduling events.

The diagram below shows the lack of visibility. Notice most of them are below the VM.

I hope the above helps explains why you should not use Guest OS CPU Utilization counters.

To measure the Guest OS usage, use this formula:

Guest OS Usage = VM CPU Run - Overlap + Ready + CoStop.

Notice I do not use VM CPU Demand or VM CPU Usage counters. Can you guess why?

The problem is both counters are contaminated with components that can give inaccurate readings. The components are:

  • CPU System. This workload is not coming from the VM. It’s coming from ESXi, executing on behalf of the VM. This does not run inside the Guest OS threads.
  • Frequency scaling and HT. They are not relevant in the context of VM CPU utilization.
    • A VM is consuming a CPU at 100%, regardless whether the 2nd HT runs or not. The fact that the 2nd HT runs at 100% does not mean the VM utilization is 62.5%. The guest is actually running at 100%
    • The same applies to changes in frequency. It makes the VM faster or slower. We need to distinguish utilisation from capacity and performance use cases.
  • VMX. This should not be charged.

How does the supermetric differ to CPU Usage? We can actually plot it. I take a sample of 480 VM in my lab. I use the View List widget to list VM Name, VM CPU Usage and VM Guest OS Usage. I exported into a spreadsheet, then use a simple formula to compare the 2 values.

The result is interesting.

There are situation where CPU Usage is over-reporting. Take example no 2 below. It’s reporting 85% when it’s only 72%. I’m not too worried about this, as this is simply a classic over-size.

There are situation where CPU Usage is under-reporting. Take the last example. It’s reporting 55%, but the reality is it is 93%. You would have thought the VM is fine, when the VM actually need more CPU. In this case, you need to ensure that Ready and Co-Stop aren’t a factor.

I plot the entire 480 values over a line chart, so I get the big picture. I notice that most of the time, it’s correct. That’s a good news. The bulk of the data is <5% difference. The black arrow indicates CPU Usage is over-reporting, while the red arrow indicates it is under-reporting.

You can do the profiling in your environment too, and discover interesting behaviour in production 🙂