Tag Archives: VMware SDDC

Capacity Management: it’s not what you think!

If you struggle with Capacity Management, then you’ve approached it with the wrong understanding. The issue is not with your technical skills. The issue is you don’t look at it from your customers viewpoint.

Let’s check your technical skills if you don’t trust me 😊

  1. Can you architect a cluster where the performance matches physical? Easy, just don’t overcommit.
  2. Can you architect a cluster that can handle monster VM? Easy, just get lots of core per socket.
  3. Can you architect with very high availability? Easy, just have more HA host, more vSAN FTT and failure domain.
  4. Can you architect a cluster that can run lots of VMs? Easy, just get lots of big hosts.
  5. Can you optimize the performance? Sure, follow performance best practices and configure for performance.
  6. Can you squeeze the cost? Sure, minimize the hardware, CPU socket, and choose the best bang for the buck. You know all the vendors and their technology. You know the pro and cons of each.

You see, it’s not your technical skills. It’s how you present your solution. Remember this?

“Customers want it good, cheap, and fast. Let them pick any 2”

In IaaS business, this translates into

  • Good = high performance, high availability, deep monitoring.
  • Cheap = low $$
  • Fast = soon. How fast you can get this service.

You want good high performance at cheap price? Wait until next generation Xeon and NVM arrive.

In IaaS, it is a service. Customers should not care about the underlying hardware model and architecture. Whether you’re using NSX or not, they should not and do not care.

So, present the following table. It provides 4 sample tiers for CIO & customers to choose from. Tell them the hardware & software are identical.

You should always start your presentation by explaining Tier 1. That’s the tier they expect for performance. They want it as good as physical. Give customers what they want to hear, else they will go to someone’s else cloud (e.g. Amazon or Azure).

Tier 1 sports performance guarantee. This is only possible because you do not overcommit. To the VM, it’s as good as it’s running alone in the box. No contention. There is no need for reservation, and every VM can run at 100% all day long.

What’s the catch?

Obviously, just like First Class seat, tier 1 is expensive. It’s suitable only for those latency sensitive apps.

Show them the price for Tier 1. If they are happy, end of discussion. You architect for Tier 1, as that’s the requirements. If your customers want to fly first class, then you should not stop them.

What if VM Owners wants something much cheaper, and don’t mind a small drop of performance?

You then offer Tier 2 & Tier 3. Explain that you can cut down the cost to any discount they want. But you need to match the over commitment. If they want 50% discount, then it’s 2:1 overcommit. If they want 67% discount, then it’s 3:1 overcommit. It’s that simple.

Any IT fresh graduate can do the above 🙂 No need seasoned IT Prof with 1-2+ decade of combat experience.

Your professionalism comes in here: The performance drops does not drop as low as the discount. You can achieve 50% at <50% performance drop.

How is that possible?

2 reasons impacting Demand: VM Size and VM Utilization.

You control the VM size. By not having monster VM in the lower tier, the IaaS has higher chance of giving good performance for everyone.

BTW, this is your solution to avoid over-provisioned to begin with.

From experience, we know VMs don’t run at 100% most of the time. This utilization + size helps deliver a good performance.

So we know at 2:1 overcommit, the performance degradation will not 50%. But what it will be? 10%, 30%?

BTW, 10% means that the resource is not available immediately 10% of the time. It does not mean that it’s never available. It’s just that there is a latency in getting the resource (e.g. CPU).

We can’t predict what the degradation will be, as it depends on the total utilization of the VMs, which is not in your control. However, we can monitor the degradation experienced by each VM.

This is where you tell CIO: “What is your comfort level?”

Now, we don’t know the impact to the application when there is latency in infrastructure. That depends on the application. Even on the same identical software, e.g. SQL Server 2016, the impact may differ as it depends on how you use that software. Different nature of business workload (e.g. batch vs OLTP) gets impacted differently even on the identical version of the software.

The good thing is we’re not measuring application. We are measuring infrastructure. Infra that takes the shape of service (meaning VM owners don’t really care the spec) cannot be measured by the hardware spec as that’s irrelevant. So you track how well a VM is served by the IaaS instead.

For example, for a Tier 1 VM, what the VM gets will be very close to what it wants. For example, CPU Contention will be below 0.3%, while Memory Contention will simply be 0%. Disk Latency maybe 5 ms (you need to specify it as it can’t be 0 ms).

A Tier 3 VM, on the other hand, will have worse SLA. The CPU Contention maybe 15% (you decide with CIO), the Disk latency maybe 40 ms (again, this is a business decision).

An SLA isn’t useful if it’s not tracked per VM. You track it every 5 minutes for every single VM. This is part of Operationalize Your World.

I talked earlier about controlling the demand by limiting the VM size. You specify the limit to VM size for each tier. For example, any VM in Tier 2 cannot span a physical socket. You will impact the hypervisor scheduler. Customer who wants monster VM is more than welcome to move to higher tier. You do not act as gatekeeper or government. It’s their money and they don’t appreciate you playing parent.

How do you encourage right-sizing?

By money.

Not by “we try to save company money” motherhood talk. This is business, and both Apps team and Infra team are professional. Avoid playing the government in internal IT. The Application Team expects to be treated as customers, not just colleague.

Develop a pricing model that is compelling for them to go small. Use this as best practice:

The above uses Discount and Tax. As a result, it’s much cheaper to go small. A 32 vCPU VM costs 32x of 4 vCPU, not 8x.

The above gets you the model. How about the actual price?

In business, there is no point if you can’t put a price.

Bad news, your price has been set by leading cloud players (AWS & Azure). It’s a commodity business using commodity software and hardware. The price of DC co-location and admin salary are pretty similar too.

All these means the price can’t differ that much. Using the airline analogy, the price among airlines are similar too.

Here is the price from AWS (as at 31 July 2017).

I use M4 series as that gives balance CPU & RAM. Other series are cheaper but they are using older hardware and does not provide balance combination.

From the above, I take the 1-year contract, 100% paid up front for comparison. In Enterprise IT, you may get budget annually and the budget can be transferred up front at the start of fiscal year.

The price above excludes Storage and Network. Only Compute. It also excludes Support, Monitoring, Reporting, Guest OS update, Network, Backup, Security.

It includes: DC facility rent, IaaS software + maintenance.

How do you calculate your price from above? You can take a comparison, and make it apple to apple.

I took 50 4 vCPU VM and 25 8 vCPU VM, and calculate the 3 year price.

To convert to private cloud, use 2:1 overcommit. AWS counts the HT as core.

Based on the above, you can see the price of AWS is high, as it > $100K per ESXi.

To determine your VM cost, you start by determining your total cost. I put half of AWS as I still think it’s reasonable. $396K for 7 ESXi still give you room for IT team salary, DC colocation, etc.

The above gives you your price for the equivalent AWS or Azure machine.

You should run your own calculation. Use your own overcommit ratio and VM size.

Once done, you should take this price to your internal customers. Have a Pricing discussion. When you order food at restaurant, does the price matter to you?

As the architect of the platform, you know the value of your creation best.

I hope this blog gives you food for thought. Capacity Management does not start when the VM is deployed, or the hardware was purchased. It starts much earlier than that. Go back to that time, as that’s how you can get out of the never ending right-sizing and argument over capacity.

Datastore Capacity Management

This post is part of Operationalize Your World post. Do read it first to get the context.

This is the 2nd installment of Storage Capacity Management. The previous post covers the overall storage capacity management, where you can see the big picture and know which datastores are low in capacity. This post drills further and lets you analyze a specific datastore.

Datastore capacity is driven by 2 factors:

  • Performance: If the datastore is unable to serve its existing VMs, are you going to add more VM? You are right, the datastore is full, regardless of how space it has left.
  • Utilization: How much capacity is left? Thin provisioning makes this challenging.

This is what the dashboard looks like.

You start by selecting a datastore you want to check. This step is actually optional, as you would have come from the overall dashboard.

When you select a datastore, its Performance and Utilization are automatically shown.

  • Performance
    • Both actual and SLA are shown.
    • You just need to ensure that actual does not breach SLA.
  • Utilization
    • This shows the total capacity, the provisioned capacity (configured to the VM), and what’s actually used (thin provisioned).
    • You want to be careful with thin provisioning, as the VM can consumed the space as it’s already allocated to it. The line chart has 30-day projection to help you plan.

The 2 line charts is all you need. It is simple enough, yet detailed enough. It gives you room to make the judgement call. You can decide to ignore the spike because you knew it was a special event.

If you want to analyse, you can see the individual VMs. The heatmap shows the VMs distribution. You can see if there are large VMs, because they are bigger. You can see if any VM is running out of capacity, or any VM is wasting the allocated capacity.

The heatmap configuration below shows how it’s done.

You can also check if there are VMs that you can delete. Reclamation will give you extra space. The heatmap has a filter for powered off VMs, so only powered off VMs are shown.

From there, you can drill further to check that the VM has indeed met your Powered Off definition. It’s showing the VM powered off time (%) in the past 30 days. I’ve set the threshold to be 99%. Green means the VM is at least 99% powered off in the past 30 days.

Logic

I hope you agree by now that datastore performance is measured on how well it serves its VMs. We can track this by plotting a line chart showing the maximum storage latency experienced by any VM in the datastore. This maximum number has to be lower than the SLA you promise at all times.

For Utilization, we will plot a line chart showing the disk capacity left in the datastore cluster.

You should be using Datastore Cluster. Other than the benefits that you get from using it, it also makes capacity management easier.

  • You need not manually exclude local datastore.
  • You need not manually group the shared datastores, which can be complex if you have multiple clusters.

With vSAN, you only have 1 datastore per cluster and need not exclude local datastores manually. This means it’s even simpler in vSAN.

Include buffer for snapshot. This can be 20%, depending on your environment. This is why I’m not a fan of many small datastores, as you have pockets of unusable capacity. This does not have to be hardcoded in your super metric, but you have to be mentally aware of it.

Super Metrics

The screenshot below shows the super metric formula to get the Maximum latency of all the VMs in the cluster. I’ve chosen at Virtual Disk level, so it does not matter whether it is VMFS, VMFS, NFS or VSAN.

super metric - vDisk

You can copy paste the formula below:

Max ( ${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=virtualDisk|totalLatency, depth=2 } )

The screenshot below shows the super metric formula to get the total number of disk capacity left in the cluster. This is based on Thin Provisioning consumption.

You can copy paste the formula below:

sum( ${adapterkind=VMWARE, resourcekind=Datastore, attribute=capacity|available_space, depth=1} )

For Thick Provision, use the following super metric:

super metric - Disk - space left in datastore cluster - thick

You can copy paste the formula below:

Sum
(
${adapterkind=VMWARE, resourcekind=Datastore, attribute=capacity|total_capacity, depth=1}
) –
Sum
(
${adapterkind=VMWARE, resourcekind=Datastore, attribute=capacity|consumer_provisioned, depth=1}
)

Hope you find it useful. Just in case you’re not aware, you don’t have to implement all these manually. You can import this dashboard, together with 50+ others, from this set.

vSphere Capacity Reclamation

This post continues from the Operationalize Your World post. Do read it first so you get the context.

There are 5 Reclamation levels you can do. Start from the easiest one first.

Let’s go through the table above:

  • Non VM is the easiest, because they are not owned by someone else. They are yours! Non VM objects, such as template and ISO should be kept in 1 Datastore per physical location. Naturally, you can only reclaim Disk, and not CPU & RAM.
  • Orphaned VM and orphaned vmdk are next, as they are not even registered in vCenter. If they are, they may appear italicized, indicating something wrong. They may not have owners too. Take note vR Ops 6.4 cannot check orphaned vmdk.
  • Powered Off VM is harder, as there is now owner of the VM. You need to deal with VM Owner before you delete them.
  • Idle VM is a great target, as you can now claim CPU and RAM when you power them off. You cannot claim disk yet as you are not deleting them yet.
  • Active VM is the hardest. Focus on large VM. Take on CPU and RAM separately. Easier to tackle when you split them. Divide and conquer.
  • Claiming CPU and RAM from Small VM can be futile, regardless if it’s idle. A idle VM with 1 vCPU cannot be further reduced. It should be powered off. By powering them off first, it’s a safer procedure. You can simply power on if the VM is actually being used.
  • Snapshot. This is actually not as hard as CPU and RAM, hence in the actual dashboard we list them separately.

Why do cars have brake?

So it can go faster!

Take advantage of Powered Off as the brake for your Idle VMs. If you treat Idle and Powered off as 1 continuum, you can power off the Idle VMs earlier. You get the benefit of CPU and RAM reclamation.

What value is considered as Idle?

  • It has to be defined, so it’s measurable and not subjective. Declare it as a formal policy, so you don’t end up arguing with your customers.
  • Default setting in vR Ops policy is CPU Demand = 100 MHz. A VM using 100 Hz or less CPU will be considered Idle.
  • While a VM uses CPU, RAM, Disk and Network, we only use CPU as a definition for Idle. I think there is no need to consider all 4, and states all 4 must be idle, because they are inter-related. It takes CPU cycle to process Network Packets and perform Disk activity. Data from NIC and Disk must be copied to RAM also, and the copying effort requires CPU cycle.
  • How long has it been under that threshold?
    VM does not use CPU non-stop for months. There are times it’s idle, and it’s normal. A month-end VM that processes payroll can be idle for 29 days! The default value of 90% will miss this.

Because of these month-end VMs, I recommend you change the definition from 90% to 99%. Even 99% for 30 days can still wrongly mark active VM as Idle. 1% active means it’s only active for a total of 8 hours (0.3 days) in 30 days. Notice it’s a total, not 1 continuous 8 hours. It’s accumulative within 30 days.

A VM that is idle for 30 days straight, then active for 8 hours, will only need 8 hours to be marked as non idle. A VM that does not accumulate 8 hours, will obviously need more time. The Idle decision logic runs only every 24 hours. So the VM may be marked idle for days after it’s gone active.

The drawback of setting at 99% is we wait the full 30 days before deciding. In some corner case, the VM may never be marked as idle. Take a scenario:

  • A VM was active and serve its purpose for months.
  • After 2 years, the application is being decommissioned as new version released.
  • As a result, the VM goes idle, as it is simply waiting to be deleted. But because we set at 99%, the logic will wait for the full 30 days before deciding.
  • It’s consuming CPU/RAM during the period, as basic services like AV and OS Patches still run. If these non-app workload adds up to >8 hours in 30 days, the VM will never be marked as Idle

Solution: increase threshold from 100 MHz to the amount you think it’s suitable. If possible, power off the VM if it’s really not used.

Powered Off is simpler than Idle, as it’s binary.

A VM that has been powered off for at least 15 days, will take 15 days for it to be marked as Powered On. This creates problem as it’s not a VM you can reclaim.

Solution: add “Is it Powered On now?” into the formula. If a VM is running, it’s no longer considered powered off right away.

This is where the setting is in vR Ops 6.4.

You need to modify the value in your active policy:

  • Change idle from 90% to 99%
  • Change powered off from 90% to 50%

The above is the first of a set of vR Ops dashboard for Capacity Reclamation. I added a short Read Me for 2 reasons:

  • There are 4 dashboards.
    1. The dashboard above
    2. Idle VMs and Powered Off VM. See below.
    3. Active VM: CPU. See this
    4. Active VM: RAM. See this.
  • Reclamation is quite complex when you look at the details. There are many things we can reclaim.

You can replace the Read Me widget with a picture if you know the target screen resolution. I didn’t use image as it will make your import harder.

The above is the 2nd dashboard. It shows the Powered Off VMs and Idle VMs.

The summary at the top tells how much you can reclaim. The table shows where you can claim it.

For the powered off VMs, the widget gives the summary. It tells you how many VMs, and how much space. The table provides details.

The numbers will not be identical due to rounding. The summary is shown in TB while the table in GB. Just in case you’re wondering. 3.7 TB is the correct rounding for 3769.36 GB as there are 1024 GB in 1 TB. 3769/1024 is actually less than 3.7 TB.