20-second Peak Metrics

vRealize Operations 8.3 sports a set of 12 metrics that captures the highest 20-second average in any given 5 minute (the default collection interval). Why only 12 and how are they chosen?

First, some background. vRealize Operations collects and stores data every 5 minute. This is good enough for monitoring use case, but not for troubleshooting. 300-second average is not granular enough, as performance problem may not be sustained that long. Even a performance issue that last hours may consist of repeated short bursts. On the other hand, 5-minute produces too much noise for Capacity, where you need 3-month or 1 year overall trend. Are you going to increase the VM vCPU size just because it has 20 second – 1 minute high utilization? What if that’s caused by Windows and not real business workload? I’m keen to learn if I’m wrong.

Take a look at the table below. It shows a VM with 2 virtual disks. Each disk has its own read latency and write latency, giving us a total of 4 counters.

While vRealize Operations collects every 300 seconds, it actually grabs 15 data points. Why 15? Those 15 matches the 20-second that vCenter produces. Each 20-second data point is an average of the entire 20 seconds. So all along, vR Ops actually has 20-second visibility. However, it averages these 15 data points, losing the 20-second granularity.

3 different kinds of summary

What vR Ops 8.3 does is to add a new metric. It does not change the existing metric, because both have their own purpose. The 5-minute average is better for your SLA and performance guarantee claim. If you guarantee 10 ms disk latency for every single IOPS, you’d be hard pressed to deliver that service. These new counters acts as early warning. It’s an internal threshold that you use to monitor if your 5-minute SLA is on the way to be breached.

vR Ops 8.3 takes the peak of these 15 data points, and stores it every 5 minutes. It does not store all 15 data points, because that will create a lot more IOPS and consume more storage. It answers the question “Does the VM or Guest OS experience any performance problem in any 20-second period?”

Having all 20-second data points are more natural to us, as we’re used to 1 second in Windows and 20 second in vCenter performance charts. But how does that additional 14 data points change the end remediation action? If the action you take to troubleshoot is the same (e.g. adjust the VM size), why pay the price of storing 15x more data points?

If you need to store them all, vR Ops Cloud does it for you. Note that it’s limited to 7 days, while this technique lets you store for 6 months as it’s just like any other regular metric.

In the case of virtual disk (as opposed to say memory), a VM can have many of them. A database VM with 20 virtual disks will have 40 peak counters. That also means you need to check each one by one. So what vR Ops 8.3 does is to take the peak among all virtual disks read and writes. It does the same thing with vCPU. A monster VM with 64 vCPU will only have 1 metric, but this metric is the highest among 64 virtual CPU. There is no need to have visibility into each vCPU as the remediation action is the same. Whether it’s vCPU 7 or vCPU 63 that has the problem, it does not change the conclusion of troubleshooting in most cases.

Why these 12?

The next question is naturally why we picked the above 12. You notice they are only VM counters. No ESXi, Resource Pool, Datastore, Cluster, etc counters. The reason is the counters at these “higher-level” objects are mathematically an average of the VMs in the object. A datastore with 10 ms disk latency represents a normalized average of all the VMs in the datastore. Another word, these counters give less visibility than the 12 above, and they can be calculated from the 12. And 1 more reason:

You troubleshoot VM, not infrastructure. If there is no VM, there is no problem :-)

Among the 12 counters, you notice only 1 counter tracks utilization. The other 11 tracks contention. Utilization is not a counter for performance. It’s a counter for capacity. The higher the utilization, the more work gets done, and hence the better the performance. Utilization at 100% is in fact the best possible performance, so long there is no contention. Since we can track contention explicitly, the performance counter becomes secondary, supporting counter.

Why is Guest OS level metrics provided? Because they do not have VM equivalent, and change the course of troubleshooting. If you have high CPU run queue, you look inside Windows and Linux, not at the underlying ESXi.

For CPU, the complete set of contention is provided. There are 6 counters tracking the different type of contention or wait that CPU experiences.

For Memory, popular metrics such as Consumed, Active, Balloon, Swap, Compress, Granted, etc are not shown as they do not indicate performance problem. Memory Contention is the only counter tracking if the VM has memory problem. VM and Guest OS can have memory problem independently. In future, we should add Guest OS memory performance counters, if we find a good one. Linux and Windows does not track memory latency, only track memory disk space consumption, throughput and IOPS. These 2 OS do not track latency, which is the main counter for performance.

For Network, vCenter does not have latency and re-transmit. It has dropped packet, but unfortunately this is subject to false positive. So we have to resort to utilization metric. In future, we should add packets per second.

To use, enable them in the policy

The 12 metrics are disabled by default. Let us know if you think we should enable them by default, and what VM metrics we should disable to compensate 🙂

Lastly, just in case you ask why we do not cover Availability (e.g. something goes down). Reason is this is better covered by event. Log Insight does a better job on this.

OYW: Installation

Deploying Operationalize Your World kit should only take a few minutes, if you understand its components well.

Download the kit from here. Download each file. Yup, it’s just a dropbox link. No login required.

If you do not zip when downloading, you do not need to unzip when importing

Login to vR Ops with account with administrator privilege. It’s easier if you just use the built-in admin account as you’re overwriting contents, and will be sharing these with others.

Import the super metrics. See below on how to do it

Enable all these super metrics in default policy. No need to modify other policies, as they will automatically inherit.

Make sure it’s your Default Policy. It is the last policy. Lowest priority.

To do a mass enable, click Attribute Type, and deselect Metric and Property. This will only show super metrics, which makes it easier to mass select.

Now that only super metrics are shown, you can mass enable them

To mass enable, sort by Object Type. This will push All Object Types to the bottom, which is what we want as we do not want them. If you do that, every single types of objects will have these super metrics, which is not what you want.

From the Actions menu, choose State \ Enable.

Once you select all, then click save. It takes 5 minutes for the super metrics to start calculating. It may take longer for larger environment and clustered vR Ops.

Enable some existing metrics. You need to enable these metrics

Cluster object:

  • Max VM CPU Ready
  • Max VM CPU Co-Stop
  • Percentage of VMs facing CPU Co-Stop
  • Percentage of VMs facing CPU Ready
  • Percentage of VMs facing Memory Contention

VM Object:

  • CPU Overlap
  • CPU Run
  • System Heartbeat Latest
  • Guest|Context Swap Rate.
    This is the CPU Context Switch, which is useful in identifying application performance issue

Now that you’ve got the data, it’s a matter of visualizing.

Import the views. See screenshot below. Repeat for each views.zip file.

Again, do not forget to overwrite existing.

Import the dashboard. Again, choose to overwrite

That’s it! Wait for around 10 minutes and the dashboards will be initialized.

Go to Dashboards drop down menu, and you will see see 5 new top-level categories. They are highlighted in green.

Operationalize Your World replaces the existing ones. Think of it like a powerpack! 🙂

This is how the old category vs new category

If you want to tidy up, use the built-in admin accounts, and simply disable or delete the dashboards under those category.

Yes, that’s it! No need to create new policies, custom group, group type, XML, etc.

Bonus step!

If you want, you can overwrite the home page of vSphere Optimization Assessment. This is just for convenience.

VM rightsizing

VM right-sizing is a commonly misunderstood term, because there are actually multiple use cases. It is not a one fits all, hence there are >1 formula. Here are 4 popular use cases:

Your App Team ask for extra vCPU. In this case, the hypervisor overhead is irrelevant. When you size NSX edge vCPU, you do not need to add extra CPU for the overhead. It’s done outside Linux.

You’re migrating a VM to an ESXi with 2x the speed. For example, from a 2 GHz ESXi to 4 GHz. All else being equal, you can cut down the VM size by 2. A 16 vCPU becomes 8. But you’re worried about causing queue inside the Guest OS.

You’re bulk migrating many VMs to another cluster, with no changes on their configuration. Consider 2 VMs. Both are running Windows Server 2019, have 64 vCPU. Both are running hot, but one of them is very heavy on IO. It sends a lot of network packets and doing lots of disk IOPS. The 2nd VM has a very different footprint on the ESXi. It’s much more demanding. All those IO processing need to be processed by other physical cores.

Your boss ask you to properly charge customers, accounting for what they are actually demanding. Would you charge the 2 VMs above the same way? You might for practical reason, quietly distributing the cost equally, but you know you’re not being fair 😉

You’re planning a tech refresh for Cluster X. It has 24 ESXi and 1000 VM. You are hoping to reduce infrastructure to 12 ESXi, hence you increase the CPU Speed and add cores per socket. Do you consider individual VM, or you do see how they behave as a group? Answer is the later, as 1000 VM will not peak at the same time. Do you consider what happens inside Windows or Linux, or do you see their footprint on your ESXi? Answer is later, as what happens inside is irrelevant.

From the above 5 use cases, there are at least 3 different formula:

  1. Guest OS Sizing. Excludes VM overhead, includes Guest OS Queue
  2. VM Sizing. Includes overhead, excludes Guest OS Queue
  3. VM Sizing. Includes overhead, includes Guest OS Queue

Before I give you the formula, we need to consider another dimension.

Performance vs Capacity

Sizing for capacity considers long term cycle. If there is a 1 minute spike to 100%, you won’t immediately adjust the CPU. On the other hand, troubleshooting a performance problem does not even care if performance was fine 1 minute ago. You are simply interested in the utilization at a point in time.

Sizing also considers buffer, just in case in future demand goes up. Performance does not care about what does not happen. It simply looks at fact (is there a performance problem? Yes/No).

Now that we’re ready, here is the first formula.

Guest OS Sizing for Capacity

The formula is

Run + Overlap + Ready + CoStop + IO Wait + Swap Wait + (Guest Run Queue)

Run is used because that’s the only counter not affected by both Power Management and Hyper Threading. We are sizing the Guest OS, not the VM. For Guest, we care about “how much you run” in a given period vs “how fast you run” when you’re running within that given period. Windows is still running at 100%, despite the fact the underlying VM has to settle for a lower clock speed or compete on a hyper-threading.

Usage, Demand and Use accounts for the efficiency of the run. These are applicable when sizing the VM, not sizing the Guest.

Overlap is added. You know this from your vSphere 101. If not, review this.

Ready, Co Stop, IO Wait and Swap Wait are added. Had there been no contention, Run would have been higher.

Guest OS CPU Run Queue is a counter inside the Guest, indicating processes are waiting in queue, to be executed. Had Windows or Linux have more vCPU, the queue would have been lower (all else being equal).

We need to consider how VM CPU Ready impact Guest OS CPU Run Queue, as both are queue, so the lower layer will certainly impact the upper layer. If there is Ready, then CPU Run Queue needs to be adjusted

From the above number, plot them over time so you can include the peaks. Add headroom as you deem appropriate. Keep it minimal.

Project the above over time to arrive at a single number (on a single point in future).

Once you got a recommended number, adjust for NUMA. This naturally depends on the ESXi number of cores per socket.

2 more things…

As you can see from the above formula, that’s not what the Guest OS actual CPU utilization is. If you want to see what the Guest OS actually uses, then take the Guest OS CPU usage counter. IMHO, this counter has no purpose. You do not monitor utilization for the sake of utilization. You are either doing capacity or performance, and that drives the formula.

For performance, you need to consider 3 counters

Guest OS Usage + Context Switch + Run Queue

Yes, all the above counters are inside the Guest. You’re asking for Guest performance, not VM performance. They are related but not identical. VM performance has difference counters, such as Ready and Co-stop.

VM Sizing for Capacity

VM sizing differs to Guest OS sizing due to overhead (as explained above). All those IO processing are done 2x, once by Windows/Linux, and once by VMkernel.

CPU System counter accounts for this overhead. This is then charged at VM level, not individual vCPU. VMX should also be included, although it’s negligible most of the time.

Since we’re interested in the VM impact on the infrastructure, we need to consider CPU Frequency. This also enables comparison across ESXi with different speed. A 2-vCPU VM on a 4 GHz ESXi, may need 4 vCPU when moved into a 2 GHz ESXi.

HT is automatically accounted for. With lower efficiency, it will simply run longer. Instead of 40% for 5 minutes, it may run 90%. If it exceeds 100%, then it will run longer, and queue will develop inside the Guest OS.

The formula is similar, but we’re using Used instead of Run because Used accounts for CPU speed.

Used + Ready + CoStop + IO Wait + Swap Wait 

VM Sizing for bulk migration

In this use case, you do not care about what happens inside the Guest OS

Used + Ready + CoStop + IO Wait + Swap Wait

Hope that helps you. Let me know your finding in your production environment. Production is always an interesting place, lots of surprise and weird anomalies.