Monthly Archives: September 2018

CPU Ready vs CPU Contention

Folks like Daniel in Hong Kong, Sajag in Thailand, and Ramandeep in US have noticed that I shifted my recommendation from CPU Contention to CPU Ready as Performance SLA. The reason is essentially Change Management. Moving from complaint-based operations to SLA-based is a transformation. It’s not something you do in a week. You need to enlighten your boss and your customers. It’s a paradigm shift that can take months.

As a result, CPU Ready is a better start than CPU Contention. Your IaaS business is not ready for Contention, pun intended.

CPU Ready is more stable than CPU Contention, as it’s not affected by Hyper Threading. Running both HT on a core reduces the amount of CPU cycle by 50%. Since HT gives only 1.25x boost, each HT gets 62.5% when both are running. That reduction is accounted for in CPU Contention, which is why it can spike to >35% when Ready is not even 1%. Test this by running 2 large VMs in 1 ESXi. If the ESXi is 16 cores 32 treads, then you run 2x 16 vCPU VM. Run both at 100%. Set Power Management to Max so you eliminate frequency scaling from impacting CPU Contention. Both should experience minimal CPU Ready but high CPU Contention. My guess is CPU Ready will be <1%, while CPU Contention will be >35%.

CPU Ready is also not affected by Power Management. As you can see here, in general you should take advantage of power savings. The performance degradation is minimal while the savings is substantial. CPU Contention accounts for this frequency drop. My guess is frequency drop of 25% will result in CPU Contention of 25%. I wrote guess as I have not seen a test.

Considering the above, Ready is a lot less volatile. This makes it more suitable as SLA. Operationally, it’s easier to implement. It’s easier to explain to folks less familiar with VMkernel CPU Scheduler.

If you use CPU Contention as formal SLA, you may be spending a lot of time troubleshooting when the business don’t even notice the performance degradation.

Where do you CPU Contention then? For advanced troubleshooting. You first check CPU Run Queue, CPU Context Switch, “Guest OS CPU Usage“, CPU Ready and CPU Co-Stop. When these 5 counters are good, and the other 3 infra element is good (RAM, Disk, Network), and you’re convinced that it’s not application issue, then you should check CPU Contention.

Hope that clarifies. If your observation in production differ to the above, do email me.

VMworld 2018 presentations

You can find my presentations here. They are in powerpoint, not PDF. Feel free to use the deck, add what’s relevant to your and remove what’s not relevant. Let me know how it helps you!

In the Advanced Performance Bootcamp, Sunny and I covered the counters in-depth. You can also find details on VM CPU and  VM RAM.

Alok and I discussed vSAN Troubleshooting in a TAM session. It’s the last session on Thursday and glad to see customers staying until the event team came in to wrap up the place. The session show how you can monitor vSAN at large. We met customers with 3500 clusters, going well beyond 10K clusters. How do you monitor such environment?

Sajag, a customer from Thailand, shared how he significantly reduce the hardware footprint (servers and storage), while improving performance. The help desk tickets went down by 40% in the last 6 months. Majority of the tickets were related to Application Team complaining on performance. His company managed to avoid purchase of new storage array and 1 blade chassis (14 Hosts) worth of hardware. More details below

Look forward to seeing you in Barcelona!