Category Archives: Operations

Operations related matters. Basically, matters not pertaining to Architecture or Engineering. It covers items such as processes, people, strategy, ITIL, and most of the management disciplines (e.g. financial management, capacity management)

Allocation Model in vSphere

Allocation model, using vCPU:pCore and vRAM:pRAM ratio, is one of the 2 capacity models used in VMware vSphere. Together with the Utilization model, they help Infra team manage capacity. The problem with both models is neither of them measure performance. While they correlate to performance, they are not the counter for it.

As part of Operationalize Your World, we proposed a measurement for performance. We modeled performance and developed a counter for it. For the very first time, Performance can be defined and quantified. We also add an availability concept, in the form of concentration risk ratio. Most business cannot tolerate too many critical VMs going down at the same time, especially if they are revenue generating.

Since the debut of Operationalize Your World at VMworld 2015, hundreds of customers have validated this new metric. With performance added, we are in the position to revise VMware vSphere capacity management.

We can now refine Capacity Management and split it into Planning, Monitoring and Troubleshooting.

Planning Stage

At this stage, we do not know what the future workload will be. We can plan that we will deliver a certain level of performance at some level of utilization. We use the allocation ratio at this stage. Allocation Ratio directly relates to your cost, hence your price. If a physical core costs $10 per month, and you do 5:1 over-commit, then each vCPU should be priced at least $2 per month. Lower than this, and you will make a loss. It has to be higher than $2 unless you can sell all resources on Day 1 for 3 years.

We also consider availability at this stage. For example, if the business can only tolerate 100 mission critical VMs going down when a cluster goes down, then we plan our cluster size accordingly. No point planning a large cluster when you can only put 100 VMs. 100 VMs, at average size of 8 vCPUs, results in 400 cores in 2:1 over-commit. Using 40 core ESXi, that’s only 10 ESXi. No point building a cluster of 16.

Monitoring Stage

This is where you check if Plan meets Actual. You have live VMs running, so you have real data, not spreadsheet 🙂 . There are 2 possible situation:

  1. Over-commit
  2. No over-commit.

With no-overcommit, the utilization of the cluster will never exceed 100%. Hence there is no point measuring utilization. There will be no performance issue too, since none of the VMs will compete for resource. No contention means ideal performance. So there is no point measuring performance. The only relevant metrics are availability and allocation.

With over-commit, the opposite happens. The Ratio is no longer valid, as we can have performance issue. It’s also not relevant since we have real data. If you plan on 8:1 over-commit, but at 4:1 you have performance issue, do you keep going? You don’t, even if you make a loss as your financial plan was based on 8:1. You need to figure out why and solve it. If you cannot solve it, then you remain at 4:1. What you learn is your plan did not pan out as planned 😉

There are 3 reasons why ratio (read: allocation model) can be wrong:

Mark Achtemichuk, VMware performance guru, summaries well here. Quoting him:

There is no common ratio and in fact, this line of thinking will cause you operational pain.

Troubleshooting Stage

If you have plenty of capacity, but you have performance problem, you enter capacity troubleshooting. A typical cause of poor performance at when utilization is not high is contention. The VMs are competing for resource. This is where the Cluster Performance (%) counter comes into play. It gives an early warning, hence acting as Leading Indicator


You no longer have to build buffer to ensure performance. You can go higher on consolidation ratio as you can now measure performance.

If you are Service Provider, you can now offer a premium pricing, as you can back it up with Performance SLA.

If you are customers of an SP, then you can demand a performance SLA. You do not need to rely on ratio as proxy.

Purpose-driven Architecture

When you architect IaaS or DaaS, what end goals do you have in mind? I don’t mean the design considerations, such as best practices. I mean the business result that your architecture has to deliver. A sign that your architecture has failed to deliver is you get into this situation:

The goal of IaaS is to ensure the VMs are running well. The goal of DaaS is to ensure End Users are getting good desktop experience. Have you defined well or good?

Let’s zoom into discuss IaaS. Say you’re architecting for 10K VM in 2 datacenters. You envisage 2K VM in the first month, then ramp up to 10K within the first year. Do you know the basic info about each of these 10K VMs, so that you can architect an infra to serve them well?

  • How big are they? vCPU, RAM, Disk
  • How intense are they? CPU Utilization, RAM utilisation, Disk IOPS, Network throughput?
  • Their workload pattern? Daily, weekly, monthly, etc.

You don’t. Even the applications team don’t know. Their vendors don’t know either, as you’re talking about the future.

So why then, do you promise that your IaaS will serve them well?

That’s a mistake you make as Systems Architect. It’s akin to promising the highway you architect will serve all the cars, buses and motorcycle well, when you have no idea how many they are and how often they will use it.

Can you do something about it?

Yes. You simply provide a good set of choice. The principle you share to your customers are the common sense used in all service industry:

You want it cheap, it won't be fast.
You want it fast, it won't be cheap.

You then offer a few class of service. Give 2-3 good choices, at difference price point. The highest price has the best performance.

  • Your price has to be cheaper than VMware on AWS, else what’s the point. VMware on AWS  has identical architecture to yours, as it’s using the same software and providing same capabilities. This assures your customers that they are getting good price.
  • Your performance is well defined. It is not subject to interpretation. You put a Performance SLA on the table, assuring your customers that you’re confidence of delivering as promised.

You then architect your IaaS to deliver the above classes of service. The class of service is your business offering. It’s the purpose of your architecture. With class of service clearly defined, the question below becomes easy to answer.

When you know exactly the quality of service you need to deliver, the operations team will not suffer. You handover your architecture to them with ease, as it can be operated easily. It has clear definition of performance and capacity.

Keep the summary below when you are architecting IaaS or DaaS.

For more details, review Operationalize Your World.

Capacity Management: it’s not what you think!

If you struggle with Capacity Management, then you’ve approached it with the wrong understanding. The issue is not with your technical skills. The issue is you don’t look at it from your customers viewpoint.

Let’s check your technical skills if you don’t trust me 😊

  1. Can you architect a cluster where the performance matches physical? Easy, just don’t overcommit.
  2. Can you architect a cluster that can handle monster VM? Easy, just get lots of core per socket.
  3. Can you architect with very high availability? Easy, just have more HA host, more vSAN FTT and failure domain.
  4. Can you architect a cluster that can run lots of VMs? Easy, just get lots of big hosts.
  5. Can you optimize the performance? Sure, follow performance best practices and configure for performance.
  6. Can you squeeze the cost? Sure, minimize the hardware, CPU socket, and choose the best bang for the buck. You know all the vendors and their technology. You know the pro and cons of each.

You see, it’s not your technical skills. It’s how you present your solution. Remember this?

“Customers want it good, cheap, and fast. Let them pick any 2”

In IaaS business, this translates into

  • Good = high performance, high availability, deep monitoring.
  • Cheap = low $$
  • Fast = soon. How fast you can get this service.

You want good high performance at cheap price? Wait until next generation Xeon and NVM arrive.

In IaaS, it is a service. Customers should not care about the underlying hardware model and architecture. Whether you’re using NSX or not, they should not and do not care.

So, present the following table. It provides 4 sample tiers for CIO & customers to choose from. Tell them the hardware & software are identical.

You should always start your presentation by explaining Tier 1. That’s the tier they expect for performance. They want it as good as physical. Give customers what they want to hear, else they will go to someone’s else cloud (e.g. Amazon or Azure).

Tier 1 sports performance guarantee. This is only possible because you do not overcommit. To the VM, it’s as good as it’s running alone in the box. No contention. There is no need for reservation, and every VM can run at 100% all day long.

What’s the catch?

Obviously, just like First Class seat, tier 1 is expensive. It’s suitable only for those latency sensitive apps.

Show them the price for Tier 1. If they are happy, end of discussion. You architect for Tier 1, as that’s the requirements. If your customers want to fly first class, then you should not stop them.

What if VM Owners wants something much cheaper, and don’t mind a small drop of performance?

You then offer Tier 2 & Tier 3. Explain that you can cut down the cost to any discount they want. But you need to match the over commitment. If they want 50% discount, then it’s 2:1 overcommit. If they want 67% discount, then it’s 3:1 overcommit. It’s that simple.

Any IT fresh graduate can do the above 🙂 No need seasoned IT Prof with 1-2+ decade of combat experience.

Your professionalism comes in here: The performance drops does not drop as low as the discount. You can achieve 50% at <50% performance drop.

How is that possible?

2 reasons impacting Demand: VM Size and VM Utilization.

You control the VM size. By not having monster VM in the lower tier, the IaaS has higher chance of giving good performance for everyone.

BTW, this is your solution to avoid over-provisioned to begin with.

From experience, we know VMs don’t run at 100% most of the time. This utilization + size helps deliver a good performance.

So we know at 2:1 overcommit, the performance degradation will not 50%. But what it will be? 10%, 30%?

BTW, 10% means that the resource is not available immediately 10% of the time. It does not mean that it’s never available. It’s just that there is a latency in getting the resource (e.g. CPU).

We can’t predict what the degradation will be, as it depends on the total utilization of the VMs, which is not in your control. However, we can monitor the degradation experienced by each VM.

This is where you tell CIO: “What is your comfort level?”

Now, we don’t know the impact to the application when there is latency in infrastructure. That depends on the application. Even on the same identical software, e.g. SQL Server 2016, the impact may differ as it depends on how you use that software. Different nature of business workload (e.g. batch vs OLTP) gets impacted differently even on the identical version of the software.

The good thing is we’re not measuring application. We are measuring infrastructure. Infra that takes the shape of service (meaning VM owners don’t really care the spec) cannot be measured by the hardware spec as that’s irrelevant. So you track how well a VM is served by the IaaS instead.

For example, for a Tier 1 VM, what the VM gets will be very close to what it wants. For example, CPU Contention will be below 0.3%, while Memory Contention will simply be 0%. Disk Latency maybe 5 ms (you need to specify it as it can’t be 0 ms).

A Tier 3 VM, on the other hand, will have worse SLA. The CPU Contention maybe 15% (you decide with CIO), the Disk latency maybe 40 ms (again, this is a business decision).

An SLA isn’t useful if it’s not tracked per VM. You track it every 5 minutes for every single VM. This is part of Operationalize Your World.

I talked earlier about controlling the demand by limiting the VM size. You specify the limit to VM size for each tier. For example, any VM in Tier 2 cannot span a physical socket. You will impact the hypervisor scheduler. Customer who wants monster VM is more than welcome to move to higher tier. You do not act as gatekeeper or government. It’s their money and they don’t appreciate you playing parent.

How do you encourage right-sizing?

By money.

Not by “we try to save company money” motherhood talk. This is business, and both Apps team and Infra team are professional. Avoid playing the government in internal IT. The Application Team expects to be treated as customers, not just colleague.

Develop a pricing model that is compelling for them to go small. Use this as best practice:

The above uses Discount and Tax. As a result, it’s much cheaper to go small. A 32 vCPU VM costs 32x of 4 vCPU, not 8x.

The above gets you the model. How about the actual price?

In business, there is no point if you can’t put a price.

Bad news, your price has been set by leading cloud players (AWS & Azure). It’s a commodity business using commodity software and hardware. The price of DC co-location and admin salary are pretty similar too.

All these means the price can’t differ that much. Using the airline analogy, the price among airlines are similar too.

Here is the price from AWS (as at 31 July 2017).

I use M4 series as that gives balance CPU & RAM. Other series are cheaper but they are using older hardware and does not provide balance combination.

From the above, I take the 1-year contract, 100% paid up front for comparison. In Enterprise IT, you may get budget annually and the budget can be transferred up front at the start of fiscal year.

The price above excludes Storage and Network. Only Compute. It also excludes Support, Monitoring, Reporting, Guest OS update, Network, Backup, Security.

It includes: DC facility rent, IaaS software + maintenance.

How do you calculate your price from above? You can take a comparison, and make it apple to apple.

I took 50 4 vCPU VM and 25 8 vCPU VM, and calculate the 3 year price.

To convert to private cloud, use 2:1 overcommit. AWS counts the HT as core.

Based on the above, you can see the price of AWS is high, as it > $100K per ESXi.

To determine your VM cost, you start by determining your total cost. I put half of AWS as I still think it’s reasonable. $396K for 7 ESXi still give you room for IT team salary, DC colocation, etc.

The above gives you your price for the equivalent AWS or Azure machine.

You should run your own calculation. Use your own overcommit ratio and VM size.

Once done, you should take this price to your internal customers. Have a Pricing discussion. When you order food at restaurant, does the price matter to you?

As the architect of the platform, you know the value of your creation best.

I hope this blog gives you food for thought. Capacity Management does not start when the VM is deployed, or the hardware was purchased. It starts much earlier than that. Go back to that time, as that’s how you can get out of the never ending right-sizing and argument over capacity.