Author Archives: Iwan Rahabok

About Iwan Rahabok

A father of 2 little girls, my pride and joy. The youngest one quickly said "I'm the joy!"

Complaint-based Operations

How do you know that the IaaS Platform (be it on-prem or in the cloud) is serving its workload well? If you depend on complaint, then you run a complaint-based operations.

Changing from reactive to proactive is unfortunately a complex undertaking, especially in large organisation where there are many roles and people. It requires transformation and resetting behaviour. It is not easy to get customers to agree on SLA when you’ve promised them “good” for years.

So what can you do?

  1. Measure your actual performance
  2. Improve it if the reality is not what you expect from a decent IaaS. If there is no complaint, this is even better as it means you do not have to panic and rush the improvement.
  3. Get buy in from your management on the SLA. Establish this Internal SLA.

The diagram below shows how Internal SLA represents the intermediate step. The Formal SLA is typically less stringent, giving yourself buffer.

You do not need the following in this intermediate step:

  • Class of Service. You don’t have to have Gold, Silver, etc tiers. You can keep your mixed workloads in the same clusters or datastores. This means in vR Ops, you do not need to create policy. Just use the base, active policy. This simplifies adoption.
  • Per VM SLA. You’re measuring the Infra for now. If you mix workload, then per VM is impossible to achieve.

Since you’re measuring just the infra, it’s a lot easier to implement. You start by establishing the KPI of a cluster. Below is my recommendation.

The above has Disk, hence no need to look at Datastore. You just need to focus on cluster object. This also removes complexity when a cluster has multiple datastores, and vice versa.

A Cluster KPI is simply average of the above metrics. Now that you a single metric for a cluster that combines its Performance and Utilization, you can aggregate. I’d recommend you count the number of Clusters who KPI falls into the Red zone. If your environment is healthy, then the count will result in 0. You can repeat this check every 5 minutes. In a performing IaaS, you will get a flat line as shown below. Since it’s a trend line, you can see the performance over time, giving you insight if there is a pattern. Since you’re counting the number of clusters in the red zone, this metric is scalable across thousand of clusters. You certainly expect all your clusters to perform!

The table complement the line chart. It lists each cluster, sorted by the worst performing cluster. You can click into the cluster, and drill down into the cluster KPI, if you customize the cluster summary page.


The above requires super metric, my favourite feature in vR Ops. Here is what it looks like:

I removed the disk metrics as I want to avoid double counting when used with vSAN. I also remove 1 network metric.

Below is what it looks like. I did a preview on one of the cluster to show you what it looks like.

Hope it helps you taking the first step toward proactive operations.

vSAN Stretched Clusters dashboard

Thanks for all the feedback and apology for the delay. Stretched vSAN requires a side-by-side comparison so we can easily see if there is unbalanced.

The dashboard only shows stretched vSAN. There is a filter set on the first table.

You select the cluster you want to see, and it will automatically show the rest.

It’s a simple dashboard, focused to give overall information. It does not use Group, Super Metric or Policy. You can certainly enhance it to add details. For example, you can create a drill-down into ESXi Host, Disk Group, etc. See this for example.

You can download from here.

VMworld 2018 presentations

As requested, you can find the deck here. They are in PowerPoint, not PDF.

First session was Operationalize Your World, where you learn about transforming from reactive, complaint-based operations to proactive, insight-based operations. It’s using Performance SLA and KPI. You get KPI defined for VM, Cluster and Multi-Tier Applications. Below is an example of how a VM KPI is calculated.

The second session was vSphere counters deep dive. It does not duplicate information generally covered in vSphere documentations or whitepaper. For example, I explained that measuring memory is hard. MS Windows actually include non active pages in its In Use counter. This explains why vCenter VM Active RAM counter is lower.

Hope you found the 2 sessions useful!