Category Archives: Products

Articles covering specific products, such as VMware vSphere, vCloud Suite, NSX, vRealize, Horizon Suite.

Operationalize Your World with vRealize Operations 6.7

Hard to believe Aria, the code name for vRealize Operations 6.7 is finally here with us. I worked closely with the R&D team and was privileged to see how the man in the arena fought their best. Many folks (Monica, Sunny, Karen, Esther and countless others) worked long hours, including during holiday and in flight. I remember many times we’re having lunch at 3 pm and dinner at 10 pm as we developed and tested the products.

vRealize Operations 6.7 enables changes that makes Operationalize Your World better. I will not repeat the generic product benefits (e.g. scalability, usability, performance). I’d focus on specific benefits that I’ve taken advantage of (meaning it requires changes on my end, and it does not come to Operationalize Your World automatically. My goal is to show you what changes you can make to make your customisation even better. You will see that some of the dashboards change substantially, because the widgets enable me to do things differently. The View List widget ability to drive other widget is one of those features I apply to have a different dashboard experience.

I also use Percentile instead of Maximum. Maximum is good as early warning as it captures the outlier. In capacity use cases, you want to base on something more realistic. A 99th percentile provides additional data that help you make better judgement. In other cases, I replace Average with 95th percentile, as Average is simply not practical in operations.

With that, let’s go through the changes dashboard by dashboard. I won’t explain every single one in details, else you get bored 🙂 My goal is to give you a real world example on the type of customization you can do for your bespoke operations.

At the end of this blog, I list the steps required to upgrade Ops Your World to be compatible with vR Ops 6.7. Without it, some dashboards will not work.

Performance: Overall

  • The table that lists the clusters now sport color coding. Now that it’s using View instead of Object List, the columns can have proper unit. For example, 2.5 TB RAM instead of a long number in KB.
  • Added more insight into Resource Pool, since you need to ensure each pool has equal number of members.
  • I highlight clusters that have too few hosts as that means your HA overhead is high. A cluster with 4 nodes means your HA overhead is 25%.
  • I highlight clusters that have too few physical cores. Your VMware license gives you unlimited cores per socket. Microsoft has stopped giving unlimited cores per socket.
  • I merged the Affected VMs dashboard to make it easier. You no longer have to switch into another dashboard. I rounded the number to 1 decimal point to drive a point that you should not look beyond 1 decimal point.
  • Because Affected VM is gone, it’s 1 less dashboard-sprawl 🙂
  • It cannot pass the VM object when you move to other dashboard. This is because a limitation of View List widget, where it cannot drive a widget on another dashboard. Yes, we are aware of this limitation 🙂

Performance: Villain VMs

  • It’s now an independent dashboard. You now need to select the cluster again.
  • Added CPU and RAM, to complement Disk and Network.
  • I’m not using Cluster average, as unbalance is possible, especially if you have large VMs, resource pool, share and limit.
    • For CPU: I use CPU GHz, so it takes into account the VM CPU Size
    • For RAM: I use RAM Consumed. This is more accurate than Guest OS metrics as this is what ESXi maps to the physical DIMMs.
  • I’m using Health Chart so it can display in color. Because of HA, your ESXi utilization should below, as your HA host is participating.

Performance: Single VM Monitoring

  • I disable the Received Dropped Packet. I think vCenter drops the packet when it’s not intended for the VM, but the counter counts it. So it’s a false positive.

Performance: Single VM Troubleshooting

  • You no longer have to select the VM again. It’s automatically passed from Single VM Monitoring dashboard. This is handy as you may have 1000s of VM.
  • You no longer have to figure out which property to choose to find out the parent ESXi. It’s automatically selected, plus the name of the current ESXi is automatically shown. This is good as some customers could not find the property.
  • You also do not need to clear the metric chart, which is not intuitive.
  • The dashboard has been rearranged. It is neater and easier.

Capacity: Clusters Capacity

  • The table now provides an easier way to see capacity across clusters. The limitation of the health chart was it’s not scalable if you have lots of clusters. You also need to see 3 months data, which can be hard to read since there are thousands of data points.
  • Clicking on the table lets you drill down. I now add current allocation as some customers do not overcommit RAM.
  • I’ve removed 2 dashboards (Overcommit Clusters and Non-Overcommit Clusters) as I found almost all customers mix allocation and utilization. You overcommit CPU but do not overcommit RAM. To me, this is actually the right thing to do. The reason is most VM uses 2 – 8 GB of RAM for each vCPU. A 32 vCPU VM needs somewhere between 64 – 256 GB RAM. While Intel Xeon Platinum has 24 core, you may find the premium too high. I see most ESXi uses the 16-core Xeon, giving a total of 32 cores. So if you do not overcommit CPU, all you need is 64 – 256 GB of RAM. Buying 512 GB is simply a waste.

Capacity: Idle and Powered Off VMs

  • When you upgrade, the number you see may differ. This is due the new capacity engine. I’ve modified the group selection criteria to comply with the new engine.
  • The heatmap has been improved. Easier to see where these VMs are, as the group names are clearer. The color now attracts you to focus on those VMs, giving you more bang for the buck

Capacity: Oversized VM (CPU)

  • For Guest OS CPU utilisation, I’m replacing CPU Demand with a new supermetric. As you can see here, VM CPU Demand counter includes workload that is not coming from the VM. They should not be counted. Demand is also affected by frequency scaling and HT, which is not relevant in the context of VM CPU utilization. A VM is consuming a CPU at 100%, regardless whether the 2nd HT runs or not. The fact that the 2nd HT runs at 100% does not mean the VM utilization is 62.5%. We need to distinguish utilisation from capacity and performance use cases.
  • In the table that lists all the VM utilization, I replaced Average utilization with 95th percentile. It gives you more confidence to right size.

Capacity: Oversized VM (RAM)

  • We know that RAM is used as cached by OS. As a result, the memory consumption tends to be high. I change the dashboard to focus on what you can safely claim, which is the Free RAM.
  • The heatmap now focuses on Free RAM. The larger the box, the more you can claim. The color now indicates how safe it is to claim it.

NOC: Capacity ESXi Utilization

  • I changed ESXi Active to Balloon. It’s a better indicator of ESXi Memory Utilization

Storage: Capacity 

  • I merged the overall and detail dashboards into 1.
  • Limitation: to see the powered off VMs, you need to select the Datastore heatmap instead of Datastore cluster. This can be fixed by using supermetric, but since I’ve created almost 100, I thought I’d give you chance to practice 😉

Easier Import & Export

  • No need to import a dummy policy just to import super metric, and delete policy once imported. Now you know exactly what super metrics are imported.
  • You can also export the super metrics in bulk. Useful when replicating changes to your non-Prod vR Ops instance.
  • No need to manually create XML files for resource interaction. This is now configurable in the UI.

Upgrading Operationalize Your World

If you do not customize Ops Your World dashboards, super metrics, views, etc., then your upgrade is easier. It consists of 2 steps:

  1. Download and Import Ops Your World. Yes, overwrite your existing ones.
  2. Enable the super metrics in your Default policy (it is marked with D)

Enable these metrics & properties. They were disabled as majority of vR Ops users are small companies. Operationalize Your World targets the big deployment.

  • VM: Guest File System | Total Guest File System Free (GB)
    • Needed to show low disk space in absolute amount, as % alone does not tell the full picture. 1% of 2 TB vs 1% of 50 GB aren’t the same
  • VM: CPU Used.
    • Needed to show the individual vCPU. Ideally, you only want this on large VMs. If your environment is >10K VM, you can create a separate policy and enable the supermetric for this group only.
  • VM: Summary | Number of Datastores
    • Once you clean up, you can disable it back. Your IaaS architecture & operations policy should either put all VMDKs of a VM in 1 datastore or intentionally separate them. For ease of operations, put them in 1 datastore.
  • VM: Memory | VM Reservation
  • VM: Heartbeat
    • For better detection of Guest OS uptime. This is kind of redundant, as we already have OS Uptime counter.
  • VM: CPU Run
    • Used in VM Right Sizing. I’ve developed a new model to measure VM CPU Utilization. It uses this counter. I’m keen to get real world feedback from you.
  • VM: CPU Overlap
    • Used in VM Right Sizing.
  • ESXi: Error Packets Received
    • They were disabled as the remediation is likely not at vSphere layer.

You can safely delete the XML configuration files as they are no longer required.

A test of your IaaS Operations maturity

What you architect is SDDC. What you handover as business result to CIO is IaaS. We can assess if the architecture is good or not, based on the actual result in production. Does it result in fire-fighting and blame-storming? Or you have a peaceful operations?

The litmus test below helps you assess the maturity of your IaaS.

Do your customers blame your infrastructure?

  • If the answer is yes, take a step to ask yourself why. There is a high chance you’re relying on complaint in your operations. So you actually encourage it. No complaint, no problem. A Complaint-based Operations.
  • The reason why you rely on complaint is you don’t have other means. You have not defined the performance of your IaaS.
  • A sign of matured operations is you have Performance SLA. It is per-VM, measured every 5 minutes.

Is your IaaS cheaper than both VMware on Amazon and Amazon?

  • If not, your CIO may question your business value. The reason for having an in-house architect is so you can bring lower cost, after taking into account your salary.

 Does Help Desk provide a good first level defense?

  • If Help Desk simply passes through to the next level, you need to look at why.
  • Help Desk is your first line of defence. They are not as technical as you are. Equip them with simple dashboard so they can handle VM Owner complaint:
    • Is the problem caused by IaaS not serving the VM well?
    • If yes, which part of the Infra: CPU, RAM, Disk, Network?
    • If not, how to prove it convincingly?

Can you justify new infrastructure when utilization is not yet high?

  • This is not referring to additional money that comes with new project. This is referring to existing clusters/storage.
  • Capacity is measured on utilization and performance. A cluster capacity is full if it can’t serve its VMs well. Since it takes time to buy hardware, you need to have have early warning to detect this performance degradation.

Do you struggle with many over-provisioned VMs?

  • This is an indicator that you’re operating as a System Builder as opposed to a Service Provider.
  • As a System Builder, you’re meddling with each System (read: Application). You size them, and argue with application team.
  • As a Service Provider, you’re not “on the way”. IT simply uses an effective pricing model to drive the right behaviour. Does AWS block you when you buy 40 CPU EC2 VM when you only need 2 CPU?

Does Troubleshooting mean all hands on deck?

  • Do you have a process that is followed by all teams (network, storage, server, OS, application)? Does that process end with Root Cause Analysis?
  • As part of RCA, do you set up alert so issue can be detected faster if it happens again?

There are more questions, but I thought we start with those first. If you want to see details, download this.

VM CPU Counters in vSphere

I couldn’t find a document that explains VM CPU counters in-depth. Having read many sources and consulted folks, here is what I got so far. Correction is most welcomed.

The Basic

At the most basic level, a VM CPU is either being utilized or not being utilized by the Guest OS.

  • When it’s being utilized, the hypervisor must schedule it.
    • If VMkernel has no physical CPU to run it, then the VM is placed into Ready state. The VM is ready, but the hypervisor is not. The Ready counter is increased to account for this.
    • If VMkernel has all the required physical CPUs to run it, then the VM gets to run. It is placed into Run state. The Run counter is increased to track this.
    • If VMkernel has only some of the CPUs, then it will run the VM partially. Eventually, there will be unbalanced. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size. The Co-Stop counter is increased to account for this.
  • When it’s not being utilized, there are 2 possible reasons
    • The CPU is truly idle. It’s not doing work. The Idle Wait counter accounts for it.
    • The CPU is waiting for IO.
      • CPU, being faster than RAM, may have to wait for IO to be brought in. The IO Wait counter accounts for this

With the above explanation, the famous state diagram below makes more sense. I took the diagram from the technical whitepaper “The CPU Scheduler in VMware vSphere® 5.1”. It is a mandatory reading for all vSphere professionals, as it’s a foundational knowledge.

A VM CPU is on one of these 4 states: Run, Ready, Co-Stop and Wait.

  1. Run means it’s consuming CPU cycle.
  2. Ready means it is ready to run, but ESXi has no physical cores to run it.
  3. Co-Stop only applies to vSMP VM. A VM with >1 CPU may have its CPU stopped if it advances too far. This is why it’s important to right size.
  4. Wait means the CPU is idle. It can either be waiting for IO, or it really is idle.

Advanced Topic

If the above was all we need to know, monitoring VMware vSphere would have been easy. In reality, the following factors must be considered:

  1. Hyper Threading
  2. Interrupt
  3. System time
  4. Power Management

Hyper Threading (HT)

Hyper Threading (HT) is known to deliver performance boost that is lower than what 2 physical cores deliver. Generally speaking, HT provides 1.25x performance boost in vSphere. That means if both threads are running, each thread only gets 62.5% of the shared physical core. This is a significant drop from the perspective of each VM. From the perspective of each VM, it is better if the second thread is not being used, because the VM can then get 100% performance instead of 62.5%. Because the drop is significant, enabling the latency sensitivity setting will result in a full core reservation. The CPU scheduler will not run any task on the second HT.

The following diagram shows 2 VMs. Each run on a thread of a shared core. There are 4 possible combinations of Run and Idle.

Each VM runs for half the time. The CPU Run counter = 50%, because it’s not aware of HT. But is that really what each VM gets, since they have to fight for the same core?

The answer is obviously no. Hence the need for another counter that accounts for this. The diagram below shows what VM A actually gets.

The CPU Used counter takes this into account. In the first part, VM A only gets 62.5% as VM B is also running. In the second part, VM A gets the full 100%. The total for the entire duration is 40.625%. CPU Used will report this number, while CPU Run will report 50%.

If both threads are running all the time, guest what CPU Used and CPU Run will report?

62.5% and 100% respectively.

Big difference. The counter matters.

Power Management

The 2nd factor that impacts CPU Accounting is CPU speed. The higher the frequency (GHz), the faster the CPU run. All else being equal, a CPU that run at 1 GHz is 50% slower than when it runs at 2 GHz. On the other hand, Turbo Mode can kick in and the CPU clock speed is higher than stated frequency. Turbo Boost normally happens together with power saving on the same CPU socket. Some cores are put to sleep mode, and the power saving is used to turbo mode other cores. The overall power envelope within the socket remain the same.

In addition, it takes time to wake up from a deep C-State. For details on P-State and C-State, see Valentin Bondzio and Mark Achtemichuk, VMworld 2017, Extreme Performance Series.

Because of the above, power management must be accounted for. Just like Hyper-Threading case, CPU Run is not aware of this. CPU Used takes into CPU Frequency Scaling.

Does it mean we should always set power management to maximum?

  • No. ESXi uses power management to save power without impacting performance. A VM running on lower clock speed does not mean it gets less done. You only set it to high performance on latency sensitive applications, where sub-seconds performance matters. VDI, VoIP & Telco NFV are some examples that require low latency.

Stolen or Overlap

When ESXi is running a VM, this activity might get interrupted with IO processing (e.g. incoming network packets). If there is no other available cores in ESXi, VMkernel has to schedule the work on a busy core. If that core happens to be running VM, the work on that VM is interrupted. The counter Overlap accounts for this, although some documentation in VMware may refer to Overlap as Stolen. Linux Guest OS tracks this as Stolen time.

A high overlap indicates the ESXi host is doing heavy IO (Storage or Network). Look at your NSX Edge clusters, and you will see the host has relatively higher Overlap value.

System Time

A VM may execute a privilege instruction, or issue IO. These 2 activities are performed by the hypervisor, on behalf of the VM. vSphere tracks this in separate counter called System. Since this work is not performed by any of the VM CPU, this is charged to the VM CPU 0. The system services are accounted to CPU 0. You may see higher Used on CPU 0 than others, although the CPU Run are balanced for all the VCPUs. So this is not a problem for CPU scheduling. It’s just the way VMKernel does the CPU accounting.

Note that this blog refers to CPU accounting, not Storage accounting. For example, vSphere 6.5 no longer charges the Storage vMotion to the VM.

Conclusion

The relationship among the counters are shown below. To make it simple, each VM is 1 CPU VM.

CPU Used can differ to CPU Run as it takes into account Hyper Threading and frequency changes, includes System time, and excludes Overlap. Because of this, CPU Used is a better reflection of the actual usage than CPU Run.

The counters above is for each CPU in the VM. A VM with 8 CPU will have 8 x 100% = 800%. Other than the CPU, a VM world has other ancillary processes (e.g. MKS world, VMX world) in the ESXi kernel, but they are typically negligible.

What counters are missing from the above diagram? There are 2 key counters, which are critical in Performance and Capacity. Can you name them?

You’re right. They are CPU Demand and CPU Contention.

Let’s talk about CPU Demand first. The diagram below now has CPU Demand. I’ve also added CPU Wait for completeness.

CPU Demand captures the full demand of a VM. It includes the CPU Ready and CPU Co-Stop. This is what you want to see if you want to see the full demand of a VM.

vCenter VM CPU Usage

Before we cover CPU Contention, which is a performance, there is 1 more utilization counter we need to check. Can you guess what is it?

Hint: it does not exist in ESXi. It only exists in vCenter.

CPU Usage.

This counter takes 2 forms: Usage in MHz and Usage in %.

CPU Usage (%) is a rounding of CPU Usage (MHz), not the other way around. The calculation is done first in GHz, then converted into %.

Since vCenter is only a reporting software, it has to base on ESXi. Mapping to Run and Demand do not seem logical. Mapping to Used makes the most sense. I plotted the 2 counters. They are not identical.

The reason is vCenter CPU Usage includes VMX world. Read this good article to understand it better. VMX world exists for each vCPU.

Guest OS CPU Utilization

Now that you know the hypervisor VM CPU counters, can you suggest how it impact the Guest OS? I consulted Valentin Bondzio, someone I consider the #1 authority on this topic. He said “What happens to you when time is frozen?”

That’s a great way to put it. As far as Guest OS is concerned, time is frozen when it is not scheduled. 

  • Guest OS experience frozen time when hypervisor deschedules it. Time jumps when it’s scheduled again.
  • Guest OS CPU Usage isn’t aware of stolen time. For this counter to be aware, its code has to be modified. If you know that Microsoft and Linux has modified this counter, let me know in which version they make the change.
  • Guest OS Stolen Time accounts for it. But that’s in Linux, not Windows.

The table below shows the impact of various scheduling events.

The diagram below shows the lack of visibility. Notice most of them are below the VM.

I hope the above helps explains why you should not use Guest OS CPU Utilization counters.

To measure the Guest OS usage, use this formula:

Guest OS Usage = VM CPU Run - Overlap + Ready + CoStop.

Notice I do not use VM CPU Demand or VM CPU Usage counters. Can you guess why?

The problem is both counters are contaminated with components that can give inaccurate readings. The components are:

  • CPU System. This workload is not coming from the VM. It’s coming from ESXi, executing on behalf of the VM. This does not run inside the Guest OS threads.
  • Frequency scaling and HT. They are not relevant in the context of VM CPU utilization.
    • A VM is consuming a CPU at 100%, regardless whether the 2nd HT runs or not. The fact that the 2nd HT runs at 100% does not mean the VM utilization is 62.5%. The guest is actually running at 100%
    • The same applies to changes in frequency. It makes the VM faster or slower. We need to distinguish utilisation from capacity and performance use cases.
  • VMX. This should not be charged.

How does the supermetric differ to CPU Usage? We can actually plot it. I take a sample of 480 VM in my lab. I use the View List widget to list VM Name, VM CPU Usage and VM Guest OS Usage. I exported into a spreadsheet, then use a simple formula to compare the 2 values.

The result is interesting.

There are situation where CPU Usage is over-reporting. Take example no 2 below. It’s reporting 85% when it’s only 72%. I’m not too worried about this, as this is simply a classic over-size.

There are situation where CPU Usage is under-reporting. Take the last example. It’s reporting 55%, but the reality is it is 93%. You would have thought the VM is fine, when the VM actually need more CPU. In this case, you need to ensure that Ready and Co-Stop aren’t a factor.

I plot the entire 480 values over a line chart, so I get the big picture. I notice that most of the time, it’s correct. That’s a good news. The bulk of the data is <5% difference. The black arrow indicates CPU Usage is over-reporting, while the red arrow indicates it is under-reporting.

You can do the profiling in your environment too, and discover interesting behaviour in production 🙂