Allocation model, using vCPU:pCore and vRAM:pRAM ratio, is one of the 2 capacity models used in VMware vSphere. Together with the Utilization model, they help Infra team manage capacity. The problem with both models is neither of them measure performance. While they correlate to performance, they are not the counter for it.
As part of Operationalize Your World, we proposed a measurement for performance. We modeled performance and developed a counter for it. For the very first time, Performance can be defined and quantified. We also add an availability concept, in the form of concentration risk ratio. Most business cannot tolerate too many critical VMs going down at the same time, especially if they are revenue generating.
Since the debut of Operationalize Your World at VMworld 2015, hundreds of customers have validated this new metric. With performance added, we are in the position to revise VMware vSphere capacity management.
We can now refine Capacity Management and split it into Planning, Monitoring and Troubleshooting.
At this stage, we do not know what the future workload will be. We can plan that we will deliver a certain level of performance at some level of utilization. We use the allocation ratio at this stage. Allocation Ratio directly relates to your cost, hence your price. If a physical core costs $10 per month, and you do 5:1 over-commit, then each vCPU should be priced at least $2 per month. Lower than this, and you will make a loss. It has to be higher than $2 unless you can sell all resources on Day 1 for 3 years.
We also consider availability at this stage. For example, if the business can only tolerate 100 mission critical VMs going down when a cluster goes down, then we plan our cluster size accordingly. No point planning a large cluster when you can only put 100 VMs. 100 VMs, at average size of 8 vCPUs, results in 400 cores in 2:1 over-commit. Using 40 core ESXi, that’s only 10 ESXi. No point building a cluster of 16.
This is where you check if Plan meets Actual. You have live VMs running, so you have real data, not spreadsheet 🙂 . There are 2 possible situation:
- No over-commit.
With no-overcommit, the utilization of the cluster will never exceed 100%. Hence there is no point measuring utilization. There will be no performance issue too, since none of the VMs will compete for resource. No contention means ideal performance. So there is no point measuring performance. The only relevant metrics are availability and allocation.
With over-commit, the opposite happens. The Ratio is no longer valid, as we can have performance issue. It’s also not relevant since we have real data. If you plan on 8:1 over-commit, but at 4:1 you have performance issue, do you keep going? You don’t, even if you make a loss as your financial plan was based on 8:1. You need to figure out why and solve it. If you cannot solve it, then you remain at 4:1. What you learn is your plan did not pan out as planned 😉
There are 3 reasons why ratio (read: allocation model) can be wrong:
There is no common ratio and in fact, this line of thinking will cause you operational pain.
If you have plenty of capacity, but you have performance problem, you enter capacity troubleshooting. A typical cause of poor performance at when utilization is not high is contention. The VMs are competing for resource. This is where the Cluster Performance (%) counter comes into play. It gives an early warning, hence acting as Leading Indicator
You no longer have to build buffer to ensure performance. You can go higher on consolidation ratio as you can now measure performance.
If you are Service Provider, you can now offer a premium pricing, as you can back it up with Performance SLA.
If you are customers of an SP, then you can demand a performance SLA. You do not need to rely on ratio as proxy.