Tag Archives: VMware vSphere

Allocation Model in vSphere

Allocation model, using vCPU:pCore and vRAM:pRAM ratio, is one of the 2 capacity models used in VMware vSphere. Together with the Utilization model, they help Infra team manage capacity. The problem with both models is neither of them measure performance. While they correlate to performance, they are not the counter for it.

As part of Operationalize Your World, we proposed a measurement for performance. We modeled performance and developed a counter for it. For the very first time, Performance can be defined and quantified. We also add an availability concept, in the form of concentration risk ratio. Most business cannot tolerate too many critical VMs going down at the same time, especially if they are revenue generating.

Since the debut of Operationalize Your World at VMworld 2015, hundreds of customers have validated this new metric. With performance added, we are in the position to revise VMware vSphere capacity management.

We can now refine Capacity Management and split it into Planning, Monitoring and Troubleshooting.

Planning Stage

At this stage, we do not know what the future workload will be. We can plan that we will deliver a certain level of performance at some level of utilization. We use the allocation ratio at this stage. Allocation Ratio directly relates to your cost, hence your price. If a physical core costs $10 per month, and you do 5:1 over-commit, then each vCPU should be priced at least $2 per month. Lower than this, and you will make a loss. It has to be higher than $2 unless you can sell all resources on Day 1 for 3 years.

We also consider availability at this stage. For example, if the business can only tolerate 100 mission critical VMs going down when a cluster goes down, then we plan our cluster size accordingly. No point planning a large cluster when you can only put 100 VMs. 100 VMs, at average size of 8 vCPUs, results in 400 cores in 2:1 over-commit. Using 40 core ESXi, that’s only 10 ESXi. No point building a cluster of 16.

Monitoring Stage

This is where you check if Plan meets Actual. You have live VMs running, so you have real data, not spreadsheet 🙂 . There are 2 possible situation:

  1. Over-commit
  2. No over-commit.

With no-overcommit, the utilization of the cluster will never exceed 100%. Hence there is no point measuring utilization. There will be no performance issue too, since none of the VMs will compete for resource. No contention means ideal performance. So there is no point measuring performance. The only relevant metrics are availability and allocation.

With over-commit, the opposite happens. The Ratio is no longer valid, as we can have performance issue. It’s also not relevant since we have real data. If you plan on 8:1 over-commit, but at 4:1 you have performance issue, do you keep going? You don’t, even if you make a loss as your financial plan was based on 8:1. You need to figure out why and solve it. If you cannot solve it, then you remain at 4:1. What you learn is your plan did not pan out as planned 😉

There are 3 reasons why ratio (read: allocation model) can be wrong:

Mark Achtemichuk, VMware performance guru, summaries well here. Quoting him:

There is no common ratio and in fact, this line of thinking will cause you operational pain.

Troubleshooting Stage

If you have plenty of capacity, but you have performance problem, you enter capacity troubleshooting. A typical cause of poor performance at when utilization is not high is contention. The VMs are competing for resource. This is where the Cluster Performance (%) counter comes into play. It gives an early warning, hence acting as Leading Indicator

Summary

You no longer have to build buffer to ensure performance. You can go higher on consolidation ratio as you can now measure performance.

If you are Service Provider, you can now offer a premium pricing, as you can back it up with Performance SLA.

If you are customers of an SP, then you can demand a performance SLA. You do not need to rely on ratio as proxy.

VM CPU Counters in vSphere

I couldn’t find a document that explains VM CPU counters in-depth. Having read many sources and consulted folks, here is what I got so far. Correction is most welcomed.

The Basic

At the most basic level, a VM CPU is either being utilized or not being utilized by the Guest OS.

  • When it’s being utilized, the hypervisor must schedule it.
    • If VMkernel has no physical CPU to run it, then the VM is placed into Ready state. The VM is ready, but the hypervisor is not. The Ready counter is increased to account for this.
    • If VMkernel has all the required physical CPUs to run it, then the VM gets to run. It is placed into Run state. The Run counter is increased to track this.
    • If VMkernel has only some of the CPUs, then it will run the VM partially. Eventually, there will be unbalanced. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size. The Co-Stop counter is increased to account for this.
  • When it’s not being utilized, there are 2 possible reasons
    • The CPU is truly idle. It’s not doing work. The Idle Wait counter accounts for it.
    • The CPU is waiting for IO.
      • CPU, being faster than RAM, may have to wait for IO to be brought in. The IO Wait counter accounts for this

With the above explanation, the famous state diagram below makes more sense. I took the diagram from the technical whitepaper “The CPU Scheduler in VMware vSphere® 5.1”. It is a mandatory reading for all vSphere professionals, as it’s a foundational knowledge.

A VM CPU is on one of these 4 states: Run, Ready, Co-Stop and Wait.

  1. Run means it’s consuming CPU cycle.
  2. Ready means it is ready to run, but ESXi has no physical cores to run it.
  3. Co-Stop only applies to vSMP VM. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size.
  4. Wait means the CPU is idle. It can either be waiting for IO, or it really is idle.

Advanced Topic

If the above was all we need to know, monitoring VMware vSphere would have been easy. In reality, the following factors must be considered:

  1. Hyper Threading
  2. Interrupt
  3. System time
  4. Power Management

Hyper Threading (HT)

Hyper Threading (HT) is known to deliver performance boost that is lower than what 2 physical cores deliver. Generally speaking, HT provides 1.25x performance boost in vSphere. That means if both threads are running, each thread only gets 62.5% of the shared physical core. This is a significant drop from the perspective of each VM. From the perspective of each VM, it is better if the second thread is not being used, because the VM can then get 100% performance instead of 62.5%. Because the drop is significant, enabling the latency sensitivity setting will result in a full core reservation. The CPU scheduler will not run any task on the second HT.

The following diagram shows 2 VMs. Each run on a thread of a shared core. There are 4 possible combinations of Run and Idle.

Each VM runs for half the time. The CPU Run counter = 50%, because it’s not aware of HT. But is that really what each VM gets, since they have to fight for the same core?

The answer is obviously no. Hence the need for another counter that accounts for this. The diagram below shows what VM A actually gets.

The CPU Used counter takes this into account. In the first part, VM A only gets 62.5% as VM B is also running. In the second part, VM A gets the full 100%. The total for the entire duration is 40.625%. CPU Used will report this number, while CPU Run will report 50%.

If both threads are running all the time, guest what CPU Used and CPU Run will report?

62.5% and 100% respectively.

Big difference. The counter matters.

Power Management

The 2nd factor that impacts CPU Accounting is CPU speed. The higher the frequency (GHz), the faster the CPU run. All else being equal, a CPU that run at 1 GHz is 50% slower than when it runs at 2 GHz. On the other hand, Turbo Mode can kick in and the CPU clock speed is higher than stated frequency. Turbo Boost normally happens together with power saving on the same CPU socket. Some cores are put to sleep mode, and the power saving is used to turbo mode other cores. The overall power envelope within the socket remain the same.

In addition, it takes time to wake up from a deep C-State. For details on P-State and C-State, see Valentin Bondzio and Mark Achtemichuk, VMworld 2017, Extreme Performance Series.

Because of the above, power management must be accounted for. Just like Hyper-Threading case, CPU Run is not aware of this. CPU Used takes into CPU Frequency Scaling.

Does it mean we should always set power management to maximum?

  • No. ESXi uses power management to save power without impacting performance. A VM running on lower clock speed does not mean it gets less done. You only set it to high performance on latency sensitive applications, where sub-seconds performance matters. VDI, VoIP & Telco NFV are some examples that require low latency.

Stolen or Overlap

When ESXi is running a VM, this activity might get interrupted with IO processing (e.g. incoming network packets). If there is no other available cores in ESXi, VMkernel has to schedule the work on a busy core. If that core happens to be running VM, the work on that VM is interrupted. The counter Overlap accounts for this, although some documentation in VMware may refer to Overlap as Stolen. Linux Guest OS tracks this as Stolen time.

A high overlap indicates the ESXi host is doing heavy IO (Storage or Network). Look at your NSX Edge clusters, and you will see the host has relatively higher Overlap value.

System Time

A VM may execute a privilege instruction, or issue IO. These 2 activities are performed by the hypervisor, on behalf of the VM. vSphere tracks this in separate counter called System. Since this work is not performed by any of the VM CPU, this is charged to the VM CPU 0. The system services are accounted to CPU 0. You may see higher Used on CPU 0 than others, although the CPU Run are balanced for all the VCPUs. So this is not a problem for CPU scheduling. It’s just the way VMKernel does the CPU accounting.

Note that this blog refers to CPU accounting, not Storage accounting. For example, vSphere 6.5 no longer charges the Storage vMotion to the VM.

Conclusion

The relationship among the counters are shown below. To make it simple, each VM is 1 CPU VM.

CPU Used can differ to CPU Run as it takes into account Hyper Threading and frequency changes, includes System time, and excludes Overlap. Because of this, CPU Used is a better reflection of the actual usage than CPU Run.

The counters above is for each CPU in the VM. A VM with 8 CPU will have 8 x 100% = 800%. Other than the CPU, a VM world has other ancillary processes (e.g. MKS world, VMX world) in the ESXi kernel, but they are typically negligible.

What counters are missing from the above diagram? There are 2 key counters, which are critical in Performance and Capacity. Can you name them?

You’re right. They are CPU Demand and CPU Contention.

Let’s talk about CPU Demand first. The diagram below now has CPU Demand. I’ve also added CPU Wait for completeness.

CPU Demand captures the full demand of a VM. It includes the CPU Ready and CPU Co-Stop. This is what you want to see if you want to see the full demand of a VM.

vCenter VM CPU Usage

Before we cover CPU Contention, which is a performance, there is 1 more utilization counter we need to check. Can you guess what is it?

Hint: it does not exist in ESXi. It only exists in vCenter.

CPU Usage.

This counter takes 2 forms: Usage in MHz and Usage in %.

CPU Usage (%) is a rounding of CPU Usage (MHz), not the other way around. The calculation is done first in GHz, then converted into %.

Since vCenter is only a reporting software, it has to base on ESXi. Mapping to Run and Demand do not seem logical. Mapping to Used makes the most sense. I plotted the 2 counters. They are not identical.

The reason is vCenter CPU Usage includes VMX world. Read this good article to understand it better. VMX world exists for each vCPU.

Guest OS CPU Utilization

Now that you know the hypervisor VM CPU counters, can you suggest how it impact the Guest OS? I consulted Valentin Bondzio, someone I consider the #1 authority on this topic. He said “What happens to you when time is frozen?”

That’s a great way to put it. As far as Guest OS is concerned, time is frozen when it is not scheduled. 

  • Guest OS experience frozen time when hypervisor deschedules it. Time jumps when it’s scheduled again.
  • Guest OS CPU Usage isn’t aware of stolen time. For this counter to be aware, its code has to be modified. If you know that Microsoft and Linux has modified this counter, let me know in which version they make the change.
  • Guest OS Stolen Time accounts for it. But that’s in Linux, not Windows.

The table below shows the impact of various scheduling events.

The diagram below shows the lack of visibility. Notice most of them are below the VM. The only counter that hypervisor cannot see is the Guest OS Run Queue, which is not counted by the Guest as utilization as it’s still in the queue.

I hope the above helps explains why you should not use Guest OS CPU Utilization counters.

To measure the Guest OS usage, use this formula:

Guest OS Usage = VM CPU Run - Overlap + Ready + CoStop.

Notice I do not use VM CPU Demand or VM CPU Usage counters. Can you guess why?

The problem is both counters are contaminated with components that can give inaccurate readings. The components are:

  • CPU System. This workload is not coming from the VM. It’s coming from ESXi, executing on behalf of the VM. This does not run inside the Guest OS threads.
  • Frequency scaling and HT. They are not relevant in the context of VM CPU utilization.
    • A VM is consuming a CPU at 100%, regardless whether the 2nd HT runs or not. The fact that the 2nd HT runs at 100% does not mean the VM utilization is 62.5%. The guest is actually running at 100%
    • The same applies to changes in frequency. It makes the VM faster or slower. We need to distinguish utilisation from capacity and performance use cases.
  • VMX. This should not be charged.

How does the supermetric differ to CPU Usage? We can actually plot it. I take a sample of 480 VM in my lab. I use the View List widget to list VM Name, VM CPU Usage and VM Guest OS Usage. I exported into a spreadsheet, then use a simple formula to compare the 2 values.

The result is interesting.

There are situation where CPU Usage is over-reporting. Take example no 2 below. It’s reporting 85% when it’s only 72%. I’m not too worried about this, as this is simply a classic over-size.

There are situation where CPU Usage is under-reporting. Take the last example. It’s reporting 55%, but the reality is it is 93%. You would have thought the VM is fine, when the VM actually need more CPU. In this case, you need to ensure that Ready and Co-Stop aren’t a factor.

I plot the entire 480 values over a line chart, so I get the big picture. I notice that most of the time, it’s correct. That’s a good news. The bulk of the data is <5% difference. The black arrow indicates CPU Usage is over-reporting, while the red arrow indicates it is under-reporting.

You can do the profiling in your environment too, and discover interesting behaviour in production 🙂

VM Availability Monitoring

VM Availability is a common requirement, as IaaS team is bound by Availability SLA. The challenge is reporting this.

The up time of a VM is more complex than that of a physical machine. Just because the VM is powered on, does not mean the Guest OS is up and running. The VM could be stuck at BIOS, Windows hits BSOD or Guest OS simply hang. This means we need to check the Guest OS. If we have VMware Tools, we can check for heartbeat. But what if VMware Tools is not running or not even installed? Then we need to check for sign of life. Does the VM generate network packets, issue disk IOPS, consume RAM?

Another challenge is the frequency of reporting. If you report every 5 minutes, what if the VM was rebooted within that 5 minutes, and it comes back up before the 5th minute ends? You will miss that fact that it was down within that 5 minutes!

From the above, we can build a logic:

If VM Powered Off then
   Return 0. VM is definitely down.
Else
  Calculate up time within the 300 seconds period
End

In the above logic, to calculate the up time, we need first to decide if the Guest OS is indeed up, since the VM is powered on.

We can deduce that Guest OS is up is it’s showing any sign of life. We can take

  • Heartbeat
  • RAM Usage
  • Network Usage
  • Disk IOPS

Can you guess why we can’t use CPU Usage?

VM does generate CPU even though it’s stuck at BIOS! We need a counter that shows 0, and not a very low number. An idle VM is up, not down.

So we need to know if the Guest OS is up or down. We are expecting binary, 1 or 0. Can you see the challenge here?

Yes, none of the counters above is giving you binary. Disk IOPS for example, can vary from 0.01 to 10000. The “sign of life” is not coming as binary.

We need to convert them into 0 or 1. 0 is the easy part, as they will be 0 if they are down.

I’d take Network Usage as example.

  • What if Network Usage is >1? We can use Min (Network Usage, 1) to return 1.
  • What if Network Usage is <1? We can use Round up (Network Usage, 1) to return 1.

So we can combine the above formula to get us 0 or 1.

The last part is to account for partial up time, when the VM was rebooted within the 300 seconds sampling period. The good thing is vR Ops tracks the OS up time every second. So every 5 minute, the value goes up by 300 seconds. As VM normally runs >5 minutes, you end up with a very large number. Our formula becomes:

If the up time is >300 seconds then return 300 else return it as it is.

Implementation

Let’s now put the formula together. Here is the logical formula:

(Is VM Powered on?) x 
(Is Guest OS up?) x 
(period it is up within 300 seconds)

“Is VM Powered on” returns 0 or 1. This is perfect as the whole formula returns 0.

“Is Guest OS up” returns 0 or 1. It returns 1 is there is any sign of life.

We get the Maximum of OS Uptime, Tools Heartbeat, RAM Usage, Disk Usage, and Network Usage. If any of these is not 0, then the Guest OS is up

We use Minimum (Is Guest OS up, 1) to bring down the number to 1.

Since the VM can be idle, we use Round Up to 1. This will round up 0.0001 to 1 but not round up 0 to 1.

To determine how long the VM is up within the 300 seconds, we simply take Minimum (OS Uptime, 300)

To convert the number into percentage, we simply divide by 300 then multiple by 100.

Here is what the formula looks like

Can you write the above formula differently? Yes, you can use If Then Else. I do not use it as it makes the formula harder to read. It’s also more resource intensive.

Let’s now show the above formula using actual vR Ops super metric formula. I’ve optimized the last bit to /3. No point multiply by 100 then divide by 300 😉

I’m using $This feature as the formula is referring to the VM itself. I’m using metric= and not attribute= as I only need 1 value.

Validation

Let’s now take a few scenario and run through the formula.

Scenario 1: the VM is powered off.

  • This will return 0 since the first result is already 0 and multiplying 0 with anything will give 0.

Scenario 2: ideal scenario. The VM is up, active and has VMware Tools

  • The OS Uptime metric, since it’s in seconds and it’s accumulative, will be much larger than other counters. The Max () will return 7368269142, but Min (7368269142, 1) will return 1.
  • The Min (300, 7368269142) will return 300.
  • So the result is 1 x 1 x 300 / 3 = 100. The VM Uptime for that 5 minute period is 100%.

Scenario 3: not ideal scenario. The VM is up, but is idle and has no VMware Tools

Example

Let’s show an example of how the super metric detects the availability issue. Here is a VM that has availability problem. In this case, the VM was rebooted regularly.

The VM Uptime (%) super metric reports it correctly. It’s 100% when it’s up and 0% when it’s down. The super metric matches the Powered On metric and the OS Uptime metric.

Let’s check if the super metric detects the up time within the 5 minutes. To do that, we can zoom into the time it was down. From the Powered On and OS Uptime metric, we can see it’s down for around 10 – 15 minutes. The super metric detects that. The up time went down to 0 for 10 minutes, then partially up in the last 5 minutes.

Here is the uptime in the last 5 minutes. So it went up within this 5 minutes

Limitation

The limitation is within a 5 minute period. The OS has to be up by the 5th minute. If the number is 0, it will be calculated as 0. So if the VM is up for 4:59 minutes, then went down at 5th minute exactly, the Powered On will return 0.