Allocation Model in vSphere

Allocation model, using vCPU:pCore and vRAM:pRAM ratio, is one of the 2 capacity models used in VMware vSphere. Together with the Utilization model, they help Infra team manage capacity. The problem with both models is neither of them measure performance. While they correlate to performance, they are not the counter for it.

As part of Operationalize Your World, we proposed a measurement for performance. We modeled performance and developed a counter for it. For the very first time, Performance can be defined and quantified. We also add an availability concept, in the form of concentration risk ratio. Most business cannot tolerate too many critical VMs going down at the same time, especially if they are revenue generating.

Since the debut of Operationalize Your World at VMworld 2015, hundreds of customers have validated this new metric. With performance added, we are in the position to revise VMware vSphere capacity management.

We can now refine Capacity Management and split it into Planning, Monitoring and Troubleshooting.

Planning Stage

At this stage, we do not know what the future workload will be. We can plan that we will deliver a certain level of performance at some level of utilization. We use the allocation ratio at this stage. Allocation Ratio directly relates to your cost, hence your price. If a physical core costs $10 per month, and you do 5:1 over-commit, then each vCPU should be priced at least $2 per month. Lower than this, and you will make a loss. It has to be higher than $2 unless you can sell all resources on Day 1 for 3 years.

We also consider availability at this stage. For example, if the business can only tolerate 100 mission critical VMs going down when a cluster goes down, then we plan our cluster size accordingly. No point planning a large cluster when you can only put 100 VMs. 100 VMs, at average size of 8 vCPUs, results in 400 cores in 2:1 over-commit. Using 40 core ESXi, that’s only 10 ESXi. No point building a cluster of 16.

Monitoring Stage

This is where you check if Plan meets Actual. You have live VMs running, so you have real data, not spreadsheet 🙂 . There are 2 possible situation:

  1. Over-commit
  2. No over-commit.

With no-overcommit, the utilization of the cluster will never exceed 100%. Hence there is no point measuring utilization. There will be no performance issue too, since none of the VMs will compete for resource. No contention means ideal performance. So there is no point measuring performance. The only relevant metrics are availability and allocation.

With over-commit, the opposite happens. The Ratio is no longer valid, as we can have performance issue. It’s also not relevant since we have real data. If you plan on 8:1 over-commit, but at 4:1 you have performance issue, do you keep going? You don’t, even if you make a loss as your financial plan was based on 8:1. You need to figure out why and solve it. If you cannot solve it, then you remain at 4:1. What you learn is your plan did not pan out as planned 😉

There are 3 reasons why ratio (read: allocation model) can be wrong:

Mark Achtemichuk, VMware performance guru, summaries well here. Quoting him:

There is no common ratio and in fact, this line of thinking will cause you operational pain.

Troubleshooting Stage

If you have plenty of capacity, but you have performance problem, you enter capacity troubleshooting. A typical cause of poor performance at when utilization is not high is contention. The VMs are competing for resource. This is where the Cluster Performance (%) counter comes into play. It gives an early warning, hence acting as Leading Indicator

Summary

You no longer have to build buffer to ensure performance. You can go higher on consolidation ratio as you can now measure performance.

If you are Service Provider, you can now offer a premium pricing, as you can back it up with Performance SLA.

If you are customers of an SP, then you can demand a performance SLA. You do not need to rely on ratio as proxy.

Purpose-driven Architecture

When you architect IaaS or DaaS, what end goals do you have in mind? I don’t mean the design considerations, such as best practices. I mean the business result that your architecture has to deliver. A sign that your architecture has failed to deliver is you get into this situation:

The goal of IaaS is to ensure the VMs are running well. The goal of DaaS is to ensure End Users are getting good desktop experience. Have you defined well or good?

Let’s zoom into discuss IaaS. Say you’re architecting for 10K VM in 2 datacenters. You envisage 2K VM in the first month, then ramp up to 10K within the first year. Do you know the basic info about each of these 10K VMs, so that you can architect an infra to serve them well?

  • How big are they? vCPU, RAM, Disk
  • How intense are they? CPU Utilization, RAM utilisation, Disk IOPS, Network throughput?
  • Their workload pattern? Daily, weekly, monthly, etc.

You don’t. Even the applications team don’t know. Their vendors don’t know either, as you’re talking about the future.

So why then, do you promise that your IaaS will serve them well?

That’s a mistake you make as Systems Architect. It’s akin to promising the highway you architect will serve all the cars, buses and motorcycle well, when you have no idea how many they are and how often they will use it.

Can you do something about it?

Yes. You simply provide a good set of choice. The principle you share to your customers are the common sense used in all service industry:

You want it cheap, it won't be fast.
You want it fast, it won't be cheap.

You then offer a few class of service. Give 2-3 good choices, at difference price point. The highest price has the best performance.

  • Your price has to be cheaper than VMware on AWS, else what’s the point. VMware on AWS  has identical architecture to yours, as it’s using the same software and providing same capabilities. This assures your customers that they are getting good price.
  • Your performance is well defined. It is not subject to interpretation. You put a Performance SLA on the table, assuring your customers that you’re confidence of delivering as promised.

You then architect your IaaS to deliver the above classes of service. The class of service is your business offering. It’s the purpose of your architecture. With class of service clearly defined, the question below becomes easy to answer.

When you know exactly the quality of service you need to deliver, the operations team will not suffer. You handover your architecture to them with ease, as it can be operated easily. It has clear definition of performance and capacity.

Keep the summary below when you are architecting IaaS or DaaS.

For more details, review Operationalize Your World.

A test of your IaaS Operations maturity

What you architect is SDDC. What you handover as business result to CIO is IaaS. We can assess if the architecture is good or not, based on the actual result in production. Does it result in fire-fighting and blame-storming? Or you have a peaceful operations?

The litmus test below helps you assess the maturity of your IaaS.

Do your customers blame your infrastructure?

  • If the answer is yes, take a step to ask yourself why. There is a high chance you’re relying on complaint in your operations. So you actually encourage it. No complaint, no problem. A Complaint-based Operations.
  • The reason why you rely on complaint is you don’t have other means. You have not defined the performance of your IaaS.
  • A sign of matured operations is you have Performance SLA. It is per-VM, measured every 5 minutes.

Is your IaaS cheaper than both VMware on Amazon and Amazon?

  • If not, your CIO may question your business value. The reason for having an in-house architect is so you can bring lower cost, after taking into account your salary.

 Does Help Desk provide a good first level defense?

  • If Help Desk simply passes through to the next level, you need to look at why.
  • Help Desk is your first line of defence. They are not as technical as you are. Equip them with simple dashboard so they can handle VM Owner complaint:
    • Is the problem caused by IaaS not serving the VM well?
    • If yes, which part of the Infra: CPU, RAM, Disk, Network?
    • If not, how to prove it convincingly?

Can you justify new infrastructure when utilization is not yet high?

  • This is not referring to additional money that comes with new project. This is referring to existing clusters/storage.
  • Capacity is measured on utilization and performance. A cluster capacity is full if it can’t serve its VMs well. Since it takes time to buy hardware, you need to have have early warning to detect this performance degradation.

Do you struggle with many over-provisioned VMs?

  • This is an indicator that you’re operating as a System Builder as opposed to a Service Provider.
  • As a System Builder, you’re meddling with each System (read: Application). You size them, and argue with application team.
  • As a Service Provider, you’re not “on the way”. IT simply uses an effective pricing model to drive the right behaviour. Does AWS block you when you buy 40 CPU EC2 VM when you only need 2 CPU?

Does Troubleshooting mean all hands on deck?

  • Do you have a process that is followed by all teams (network, storage, server, OS, application)? Does that process end with Root Cause Analysis?
  • As part of RCA, do you set up alert so issue can be detected faster if it happens again?

There are more questions, but I thought we start with those first. If you want to see details, download this.