When you are migrating your customers workload to another infrastructure, the onus is on you to prove that you’re not causing problems to the VMs or Applications. This is especially true if it’s your idea to migrate, and you’re not giving them a choice.
There are many examples of migration. Popular ones are:
- From old DC to new DC.
- From on-prem to VMC.
- From on-prem to Cloud. This is typically vSphere as you can simply move without changing VM.
In the above, you typically change all infrastructure. New server, new network, new storage, new vSphere. You may virtualize network by adding NSX. You may also virtualize storage by going vSAN.
Regardless, your Application Team do not and should not care. It’s transparent to them. In fact, it should be better as you’re using faster & bigger hardware. You have more CPU cores, faster RAM, faster storage, bigger network, less network hops, etc.
And that’s exactly where the problem might start 😉
A VM that takes 8 hours to complete the batch job may now take 2 hours. So it completes the same amount of work, doing as many Disk, Network, CPU, RAM in 4x shorter duration.
So what happens to the VM IOPS? Yes, it went up by 400%.
What happens to VM CPU Usage? It also went up by 400%. It has to, as it completes the same amount of logic. Suddenly, a VM that runs relatively idle at 20% becomes highly utilization 80%.
All the above is fine, if not for the next factor. Can you guess what is it?
Hint: it’s how you justify the budget to your management.
Yes, you promise higher consolidation. You have more CPU cores, more RAM, so logically you use higher over-commit ratio. As Mark said, use it carefully.
Since you have to increase overcommit ratio, how do you then prove that performance will not be affected as you drive utilization higher?
The answer is to look at what KPI can impact a VM performance. The article here provides the answer. A VM Owner looks at her VM performance, not your IaaS utilization.
The above is for a VM. It does not answer how the IaaS platform cope. This is where the Cluster KPI comes in.
With the above 2 dashboards, you can monitor and prove both the consumer layer (VM) and provider layer (Infra).