VMC Migration: Before vs After comparison

When you are migrating your customers workload to another infrastructure, the onus is on you to prove that you’re not causing problems to the VMs or Applications. This is especially true if it’s your idea to migrate, and you’re not giving them a choice.

There are many examples of migration. Popular ones are:

  • From old DC to new DC.
  • From on-prem to VMC.
  • From on-prem to Cloud. This is typically vSphere as you can simply move without changing VM.

In the above, you typically change all infrastructure. New server, new network, new storage, new vSphere. You may virtualize network by adding NSX. You may also virtualize storage by going vSAN.

Regardless, your Application Team do not and should not care. It’s transparent to them. In fact, it should be better as you’re using faster & bigger hardware. You have more CPU cores, faster RAM, faster storage, bigger network, less network hops, etc.

And that’s exactly where the problem might start 😉

A VM that takes 8 hours to complete the batch job may now take 2 hours. So it completes the same amount of work, doing as many Disk, Network, CPU, RAM in 4x shorter duration.

So what happens to the VM IOPS? Yes, it went up by 400%.

What happens to VM CPU Usage? It also went up by 400%. It has to, as it completes the same amount of logic. Suddenly, a VM that runs relatively idle at 20% becomes highly utilization 80%.

All the above is fine, if not for the next factor. Can you guess what is it?

Hint: it’s how you justify the budget to your management.

Yes, you promise higher consolidation. You have more CPU cores, more RAM, so logically you use higher over-commit ratio. As Mark said, use it carefully.

Since you have to increase overcommit ratio, how do you then prove that performance will not be affected as you drive utilization higher?

The answer is to look at what KPI can impact a VM performance. The article here provides the answer. A VM Owner looks at her VM performance, not your IaaS utilization.

The above is for a VM. It does not answer how the IaaS platform cope. This is where the Cluster KPI comes in.

With the above 2 dashboards, you can monitor and prove both the consumer layer (VM) and provider layer (Infra).

Complaint-based Operations

How do you know that the IaaS Platform (be it on-prem or in the cloud) is serving its workload well? If you depend on complaint, then you run a complaint-based operations.

Changing from reactive to proactive is unfortunately a complex undertaking, especially in large organisation where there are many roles and people. It requires transformation and resetting behaviour. It is not easy to get customers to agree on SLA when you’ve promised them “good” for years.

So what can you do?

  1. Measure your actual performance
  2. Improve it if the reality is not what you expect from a decent IaaS. If there is no complaint, this is even better as it means you do not have to panic and rush the improvement.
  3. Get buy in from your management on the SLA. Establish this Internal SLA.

The diagram below shows how Internal SLA represents the intermediate step. The Formal SLA is typically less stringent, giving yourself buffer.

You do not need the following in this intermediate step:

  • Class of Service. You don’t have to have Gold, Silver, etc tiers. You can keep your mixed workloads in the same clusters or datastores. This means in vR Ops, you do not need to create policy. Just use the base, active policy. This simplifies adoption.
  • Per VM SLA. You’re measuring the Infra for now. If you mix workload, then per VM is impossible to achieve.

Since you’re measuring just the infra, it’s a lot easier to implement. You start by establishing the KPI of a cluster. Below is my recommendation.

The above has Disk, hence no need to look at Datastore. You just need to focus on cluster object. This also removes complexity when a cluster has multiple datastores, and vice versa.

A Cluster KPI is simply average of the above metrics. Now that you a single metric for a cluster that combines its Performance and Utilization, you can aggregate. I’d recommend you count the number of Clusters who KPI falls into the Red zone. If your environment is healthy, then the count will result in 0. You can repeat this check every 5 minutes. In a performing IaaS, you will get a flat line as shown below. Since it’s a trend line, you can see the performance over time, giving you insight if there is a pattern. Since you’re counting the number of clusters in the red zone, this metric is scalable across thousand of clusters. You certainly expect all your clusters to perform!

The table complement the line chart. It lists each cluster, sorted by the worst performing cluster. You can click into the cluster, and drill down into the cluster KPI, if you customize the cluster summary page.


The above requires super metric, my favourite feature in vR Ops. Here is what it looks like:

I removed the disk metrics as I want to avoid double counting when used with vSAN. I also remove 1 network metric.

Below is what it looks like. I did a preview on one of the cluster to show you what it looks like.

Hope it helps you taking the first step toward proactive operations.

vSAN Stretched Clusters dashboard

Thanks for all the feedback and apology for the delay. Stretched vSAN requires a side-by-side comparison so we can easily see if there is unbalanced.

The dashboard only shows stretched vSAN. There is a filter set on the first table.

You select the cluster you want to see, and it will automatically show the rest.

It’s a simple dashboard, focused to give overall information. It does not use Group, Super Metric or Policy. You can certainly enhance it to add details. For example, you can create a drill-down into ESXi Host, Disk Group, etc. See this for example.

You can download from here.