Category Archives: Operations

Operations related matters. Basically, matters not pertaining to Architecture or Engineering. It covers items such as processes, people, strategy, ITIL, and most of the management disciplines (e.g. financial management, capacity management)

Capacity Management: it’s not what you think!

If you struggle with Capacity Management, then you’ve approached it with the wrong understanding. The issue is not with your technical skills. The issue is you don’t look at it from your customers viewpoint.

Let’s check your technical skills if you don’t trust me 😊

  1. Can you architect a cluster where the performance matches physical? Easy, just don’t overcommit.
  2. Can you architect a cluster that can handle monster VM? Easy, just get lots of core per socket.
  3. Can you architect with very high availability? Easy, just have more HA host, more vSAN FTT and failure domain.
  4. Can you architect a cluster that can run lots of VMs? Easy, just get lots of big hosts.
  5. Can you optimize the performance? Sure, follow performance best practices and configure for performance.
  6. Can you squeeze the cost? Sure, minimize the hardware, CPU socket, and choose the best bang for the buck. You know all the vendors and their technology. You know the pro and cons of each.

You see, it’s not your technical skills. It’s how you present your solution. Remember this?

“Customers want it good, cheap, and fast. Let them pick any 2”

In IaaS business, this translates into

  • Good = high performance, high availability, deep monitoring.
  • Cheap = low $$
  • Fast = soon. How fast you can get this service.

You want good high performance at cheap price? Wait until next generation Xeon and NVM arrive.

In IaaS, it is a service. Customers should not care about the underlying hardware model and architecture. Whether you’re using NSX or not, they should not and do not care.

So, present the following table. It provides 4 sample tiers for CIO & customers to choose from. Tell them the hardware & software are identical.

You should always start your presentation by explaining Tier 1. That’s the tier they expect for performance. They want it as good as physical. Give customers what they want to hear, else they will go to someone’s else cloud (e.g. Amazon or Azure).

Tier 1 sports performance guarantee. This is only possible because you do not overcommit. To the VM, it’s as good as it’s running alone in the box. No contention. There is no need for reservation, and every VM can run at 100% all day long.

What’s the catch?

Obviously, just like First Class seat, tier 1 is expensive. It’s suitable only for those latency sensitive apps.

Show them the price for Tier 1. If they are happy, end of discussion. You architect for Tier 1, as that’s the requirements. If your customers want to fly first class, then you should not stop them.

What if VM Owners wants something much cheaper, and don’t mind a small drop of performance?

You then offer Tier 2 & Tier 3. Explain that you can cut down the cost to any discount they want. But you need to match the over commitment. If they want 50% discount, then it’s 2:1 overcommit. If they want 67% discount, then it’s 3:1 overcommit. It’s that simple.

Any IT fresh graduate can do the above 🙂 No need seasoned IT Prof with 1-2+ decade of combat experience.

Your professionalism comes in here: The performance drops does not drop as low as the discount. You can achieve 50% at <50% performance drop.

How is that possible?

2 reasons impacting Demand: VM Size and VM Utilization.

You control the VM size. By not having monster VM in the lower tier, the IaaS has higher chance of giving good performance for everyone.

BTW, this is your solution to avoid over-provisioned to begin with.

From experience, we know VMs don’t run at 100% most of the time. This utilization + size helps deliver a good performance.

So we know at 2:1 overcommit, the performance degradation will not 50%. But what it will be? 10%, 30%?

BTW, 10% means that the resource is not available immediately 10% of the time. It does not mean that it’s never available. It’s just that there is a latency in getting the resource (e.g. CPU).

We can’t predict what the degradation will be, as it depends on the total utilization of the VMs, which is not in your control. However, we can monitor the degradation experienced by each VM.

This is where you tell CIO: “What is your comfort level?”

Now, we don’t know the impact to the application when there is latency in infrastructure. That depends on the application. Even on the same identical software, e.g. SQL Server 2016, the impact may differ as it depends on how you use that software. Different nature of business workload (e.g. batch vs OLTP) gets impacted differently even on the identical version of the software.

The good thing is we’re not measuring application. We are measuring infrastructure. Infra that takes the shape of service (meaning VM owners don’t really care the spec) cannot be measured by the hardware spec as that’s irrelevant. So you track how well a VM is served by the IaaS instead.

For example, for a Tier 1 VM, what the VM gets will be very close to what it wants. For example, CPU Contention will be below 0.3%, while Memory Contention will simply be 0%. Disk Latency maybe 5 ms (you need to specify it as it can’t be 0 ms).

A Tier 3 VM, on the other hand, will have worse SLA. The CPU Contention maybe 15% (you decide with CIO), the Disk latency maybe 40 ms (again, this is a business decision).

An SLA isn’t useful if it’s not tracked per VM. You track it every 5 minutes for every single VM. This is part of Operationalize Your World.

I talked earlier about controlling the demand by limiting the VM size. You specify the limit to VM size for each tier. For example, any VM in Tier 2 cannot span a physical socket. You will impact the hypervisor scheduler. Customer who wants monster VM is more than welcome to move to higher tier. You do not act as gatekeeper or government. It’s their money and they don’t appreciate you playing parent.

How do you encourage right-sizing?

By money.

Not by “we try to save company money” motherhood talk. This is business, and both Apps team and Infra team are professional. Avoid playing the government in internal IT. The Application Team expects to be treated as customers, not just colleague.

Develop a pricing model that is compelling for them to go small. Use this as best practice:

The above uses Discount and Tax. As a result, it’s much cheaper to go small. A 32 vCPU VM costs 32x of 4 vCPU, not 8x.

The above gets you the model. How about the actual price?

In business, there is no point if you can’t put a price.

Bad news, your price has been set by leading cloud players (AWS & Azure). It’s a commodity business using commodity software and hardware. The price of DC co-location and admin salary are pretty similar too.

All these means the price can’t differ that much. Using the airline analogy, the price among airlines are similar too.

Here is the price from AWS (as at 31 July 2017).

I use M4 series as that gives balance CPU & RAM. Other series are cheaper but they are using older hardware and does not provide balance combination.

From the above, I take the 1-year contract, 100% paid up front for comparison. In Enterprise IT, you may get budget annually and the budget can be transferred up front at the start of fiscal year.

The price above excludes Storage and Network. Only Compute. It also excludes Support, Monitoring, Reporting, Guest OS update, Network, Backup, Security.

It includes: DC facility rent, IaaS software + maintenance.

How do you calculate your price from above? You can take a comparison, and make it apple to apple.

I took 50 4 vCPU VM and 25 8 vCPU VM, and calculate the 3 year price.

To convert to private cloud, use 2:1 overcommit. AWS counts the HT as core.

Based on the above, you can see the price of AWS is high, as it > $100K per ESXi.

To determine your VM cost, you start by determining your total cost. I put half of AWS as I still think it’s reasonable. $396K for 7 ESXi still give you room for IT team salary, DC colocation, etc.

The above gives you your price for the equivalent AWS or Azure machine.

You should run your own calculation. Use your own overcommit ratio and VM size.

Once done, you should take this price to your internal customers. Have a Pricing discussion. When you order food at restaurant, does the price matter to you?

As the architect of the platform, you know the value of your creation best.

I hope this blog gives you food for thought. Capacity Management does not start when the VM is deployed, or the hardware was purchased. It starts much earlier than that. Go back to that time, as that’s how you can get out of the never ending right-sizing and argument over capacity.

VM Availability Monitoring

VM Availability is a common requirement, as IaaS team is bound by Availability SLA. The challenge is reporting this.

The up time of a VM is more complex than that of a physical machine. Just because the VM is powered on, does not mean the Guest OS is up and running. The VM could be stuck at BIOS, Windows hits BSOD or Guest OS simply hang. This means we need to check the Guest OS. If we have VMware Tools, we can check for heartbeat. But what if VMware Tools is not running or not even installed? Then we need to check for sign of life. Does the VM generate network packets, issue disk IOPS, consume RAM?

Another challenge is the frequency of reporting. If you report every 5 minutes, what if the VM was rebooted within that 5 minutes, and it comes back up before the 5th minute ends? You will miss that fact that it was down within that 5 minutes!

From the above, we can build a logic:

If VM Powered Off then
   Return 0. VM is definitely down.
  Calculate up time within the 300 seconds period

In the above logic, to calculate the up time, we need first to decide if the Guest OS is indeed up, since the VM is powered on.

We can deduce that Guest OS is up is it’s showing any sign of life. We can take

  • Heartbeat
  • RAM Usage
  • Network Usage
  • Disk IOPS

Can you guess why we can’t use CPU Usage?

VM does generate CPU even though it’s stuck at BIOS! We need a counter that shows 0, and not a very low number. An idle VM is up, not down.

So we need to know if the Guest OS is up or down. We are expecting binary, 1 or 0. Can you see the challenge here?

Yes, none of the counters above is giving you binary. Disk IOPS for example, can vary from 0.01 to 10000. The “sign of life” is not coming as binary.

We need to convert them into 0 or 1. 0 is the easy part, as they will be 0 if they are down.

I’d take Network Usage as example.

  • What if Network Usage is >1? We can use Min (Network Usage, 1) to return 1.
  • What if Network Usage is <1? We can use Round up (Network Usage, 1) to return 1.

So we can combine the above formula to get us 0 or 1.

The last part is to account for partial up time, when the VM was rebooted within the 300 seconds sampling period. The good thing is vR Ops tracks the OS up time every second. So every 5 minute, the value goes up by 300 seconds. As VM normally runs >5 minutes, you end up with a very large number. Our formula becomes:

If the up time is >300 seconds then return 300 else return it as it is.


Let’s now put the formula together. Here is the logical formula:

(Is VM Powered on?) x 
(Is Guest OS up?) x 
(period it is up within 300 seconds)

“Is VM Powered on” returns 0 or 1. This is perfect as the whole formula returns 0.

“Is Guest OS up” returns 0 or 1. It returns 1 is there is any sign of life.

We get the Maximum of OS Uptime, Tools Heartbeat, RAM Usage, Disk Usage, and Network Usage. If any of these is not 0, then the Guest OS is up

We use Minimum (Is Guest OS up, 1) to bring down the number to 1.

Since the VM can be idle, we use Round Up to 1. This will round up 0.0001 to 1 but not round up 0 to 1.

To determine how long the VM is up within the 300 seconds, we simply take Minimum (OS Uptime, 300)

To convert the number into percentage, we simply divide by 300 then multiple by 100.

Here is what the formula looks like

Can you write the above formula differently? Yes, you can use If Then Else. I do not use it as it makes the formula harder to read. It’s also more resource intensive.

Let’s now show the above formula using actual vR Ops super metric formula. I’ve optimized the last bit to /3. No point multiply by 100 then divide by 300 😉

I’m using $This feature as the formula is referring to the VM itself. I’m using metric= and not attribute= as I only need 1 value.


Let’s now take a few scenario and run through the formula.

Scenario 1: the VM is powered off.

  • This will return 0 since the first result is already 0 and multiplying 0 with anything will give 0.

Scenario 2: ideal scenario. The VM is up, active and has VMware Tools

  • The OS Uptime metric, since it’s in seconds and it’s accumulative, will be much larger than other counters. The Max () will return 7368269142, but Min (7368269142, 1) will return 1.
  • The Min (300, 7368269142) will return 300.
  • So the result is 1 x 1 x 300 / 3 = 100. The VM Uptime for that 5 minute period is 100%.

Scenario 3: not ideal scenario. The VM is up, but is idle and has no VMware Tools


Let’s show an example of how the super metric detects the availability issue. Here is a VM that has availability problem. In this case, the VM was rebooted regularly.

The VM Uptime (%) super metric reports it correctly. It’s 100% when it’s up and 0% when it’s down. The super metric matches the Powered On metric and the OS Uptime metric.

Let’s check if the super metric detects the up time within the 5 minutes. To do that, we can zoom into the time it was down. From the Powered On and OS Uptime metric, we can see it’s down for around 10 – 15 minutes. The super metric detects that. The up time went down to 0 for 10 minutes, then partially up in the last 5 minutes.

Here is the uptime in the last 5 minutes. So it went up within this 5 minutes


The limitation is within a 5 minute period. The OS has to be up by the 5th minute. If the number is 0, it will be calculated as 0. So if the VM is up for 4:59 minutes, then went down at 5th minute exactly, the Powered On will return 0.

Which VMs need more resources?

You can reduce the following resources from a VM:

  • CPU
  • RAM
  • Storage

Network isn’t something you can reduce, but you know that already 🙂

You can check which VMs need more resources by building a dashboard like the one below. It’s a simple dashboard, which you can customize and enhance. It lets you reduce the resources independently.

I’ve marked the above dashboard with numbers, so we can refer to them:

  1. This is a table that lists all VMs. It’s sorted by the highest 1-hour average of CPU Demand and RAM Demand. The table also lists the VM CPU and RAM configuration, so you can see if the VMs are small or large. It also shows the cluster the VMs are located. The table is sorted by the highest CPU Demand. I’m showing both CPU and RAM in a single table. You can clone the view and split them if that suits your operations better.
  2. This is a table that lists all VMs, but focusing on storage only. With storage, we do not have the complexity of checking peak utilisation. We simply need to check the present situation.
  3. This lists the Top-15 VMs with highest CPU Demand and RAM Demand in a given period. The list is now split, as they can be different VMs. Do not that Top-N widget will average the number over the selected period. A VM with cyclical workload may not show up. The Top-N is complemented with a distribution chart. Select a VM from the Top-N, and you can see where the VM utilisation is.
  4. The distribution chart helps you see if the VM is really under resources or not. The 95th percentile is marked with a vertical green line. You expect that line to be at 100%, indicating that the VMs hit 100% utilisation frequently. If the 95th percentile is at a low number, and you do not see the number 100 in the x-axis, that means the VM is not under resourced.
  5. Storage is easier, as we can simply use the last data. As a result, we can show a distribution of all the VMs. We use a heat map as it can show 2 dimensions. Every VM is represented as a box. The bigger the box, the more storage the VM is configured with. The color indicates if the VM use it.
    • 0% = Black. Wastage
    • 10% = Green. Balanced usage
    • 100% = Red. Need more space!

The CPU and RAM have limitations. For example, they may show high utilisation during AV backup. You want to ignore those period. At this moment, the only way is to plot the high usage over a line chart. We use Log Insight for this. The chart below shows VMs that hit high CPU usage in a given period. Every time a VM hits high CPU usage, it will show up here. As you can see, there are only 4 VMs that hit high CPU usage. All other VMs do not need more CPU.

The above is an example from a healthy environment. What about an environment where a lot of VMs are under-sized? You expect to see lots of alarm! That’s what you have below

Hope the above is useful. If not, drop me an email.