What are the counters that can impact a VM performance? The diagram may surprise you as it does not follow popular thinking. Notice the red dotted line. It means that Utilization does not impact performance.
Take VM CPU for example. You do want the VM to run at 100% CPU utilization, as that means the CPU is doing as much work as possible. The VM performance is not impacted so long it has no queue. It’s handling 100% of the demand perfectly. Specific to VM, a queue may happen inside the VM (read: Guest OS) or outside the VM (read: hypervisor layer).
Now, a VM running at 100% certainly has no more capacity for additional workload. It is full. If you think the VM needs to run more workload in the future, then add more resource. If not, then adding more CPU will in fact slow down its performance. This is one reason why I was a proponent of dropping the Stress counter in vR Ops. It was based on Demand counter.
Let’s switch to ESXi. An ESXi does not have performance problem if it’s not struggling to meet demand. If utilization = 100%, but none of its VMs are contending for resource, that’s in fact the perfect situation. You’re getting 100% of your money!
While Capacity and Performance are closely related, they are not identical.
Now that we’re clear on the difference between VM Performance and VM Capacity, let’s zoom into the counters. The following diagram shows the actual counters on each layer (Guest OS, VM and ESXi) that could impact performance.
Do you agree with the choice of counters? 🙂
Still disagree? Let me know at email@example.com. Michael Ryom emailed me, pointing out that since utilization does not impact performance, I should not put the counters there. Good point, and thank you Michael. I’ve updated the blog so now it states “that could impact performance”.
Some of the above counters are not available unless you install an agent. Some counters, such as Disk Driver queue, is not available unless you install specific tracing software. Specific to Windows storage driver, there is no API so agent won’t help.
Some counters are OS specific. I think Committed % does not impact Linux the way it impacts Windows, due to difference in Memory Management.
Some counters do not have established or proven guidelines. Outstanding IO is one such counter. Storage specialists told me that so long latency is low, the OIO matters less. I’m keen to hear your real world experience.
If you do not want to deploy agent (which I agree, as they incur overhead operationally), here is what you can monitor today with vRealize Operations + vSphere Tools.
The CPU Context Switch is an interesting one. I have not found a guidance on what a bad value is. Read this and let me know your thought. For RAM Page-In Rate, I think a percentage is better than absolute. My take is 1% is not good. What’s your take?
Which counters should you use for your IaaS Performance SLA? Here is my recommendation:
- CPU Ready.
- RAM Contention
- Network TX Dropped Packet.
- Disk Latency
Do you know why I only use CPU Ready, and exclude CPU Co-Stop and CPU Contention? Email me the answer 🙂 It took me years to vrealize the mistake.
IaaS SLA complements Application SLA, which in turn complements Business SLA. Application SLA depends on each apps, hence it takes a lot more effort to establish. It also does not answer if the problem is caused by Infra or Apps.
Hope it’s useful. In the next blog, I will share how you can quickly monitor the counters above on thousands of VMs.