Operationalize Your World 7.5

Update 22 Oct 2019: various updates.

Thank you for all the feedback. This particular release took a while as I wanted a longer validation period and more testing by customers. This release leverages new metrics on vRealize Operations 7.5.

You’ll notice that the dashboards are neater. They are also visually consistent, making learning easier. The idea is all dashboards, and not just vSphere + vSAN dashboards, can adopt the design philosophy. After all, they all serve the 7 pillars of operations, which are:

  • Availability Management
  • Performance Management
  • Capacity Management
  • Configuration Management
  • Cost Management
  • Compliance Management
  • Inventory Management

This release also brings back SLA as you guys keep asking for it. However, I vrealized you’re happy with a single threshold, as it’s better than nothing and much easier for you to operationalize. Your dev/test environment is slower than your mission critical, so this reality will get reflected because they are benchmarked on the same performance standard. So expect to see your cheaper environment shows a lower Performance (%) value.

Let’s tour the dashboards, then we talk about import. 3 dashboards are provided for each object type:

  • Performance
  • Utilization
  • Capacity

Performance shows 1 day and not just the present data, as you need to see the pattern beyond 5 minutes. So the summary table shows all the peak (read: worst) in the last 24 hours. With this, you can simply check it once a day.

VM dashboards

Utilization is shown separately as it impacts Performance and Capacity differently.

The dashboards answer these questions:

  • Are the VMs performing well? If not, which VMs are affected by what problems (CPU, Disk, RAM, Network)?
  • Is the VM performance caused by IaaS not serving it, or by contention within the Guest OS?
  • Are the VMs running high utilization? If yes, which VMs, how high, and what resource (CPU, RAM, Disk, Network)?
  • Are they really high (as that can cause strain in the shared infrastructure!)?
  • Any VMs need to be right-sized? By how much and for which resource? For disk, we need to look inside each partition, not at VM level.

You can also go back to any point in time, and ask the same questions above. This is important as by the time you have the chance to look at the problem, 5 minutes have passed, or the problem no longer happening.

The dashboards sports new counters, giving you deeper insight

  • Guest OS CPU Run Queue
  • Guest OS CPU Context Switch
  • Guest OS Disk Queue Length
  • VM CPU Overlap
  • VM CPU Co-Stop
  • Guest OS RAM Needed (for Capacity use case)
  • Guest OS RAM Free (for Performance use case)
  • Guest OS Page in and Page out

The counters are agentless, hence you need Tools 10.3.5 or higher and the corresponding vSphere that supports it. You need vSphere 6.5 U3 or 6.7 U2 as it includes the new hostd poller.

VMs Performance: Can you guess why there is no utilization counter here?

Cluster Dashboards

You’ll notice the dashboards are similar

  • Just like VM, the same 3 dashboards are provided: performance, capacity, utilization
  • They sport a consistent look

The questions being answered are also similar

  • Are the clusters performing well? If not, which clusters are struggling to deliver what services (CPU, Disk, RAM, Network)? Is it because the cluster is running high utilization?
  • Do we have enough capacity? If not, which cluster are running on what component? What’s the Time Remaining? VM Remaining?
  • Any VMs need to be right-sized? By how much and for which resource? For disk, we need to look inside each partition, not at VM level.

ESXi Dashboards

The ESXi dashboards complement the cluster dashboards by providing the host level details. Unbalance can happens in large or stretched cluster.

Datastore & Datastore Clusters

There are 2 sets, 1 for datastore and 1 for datastore clusters. For the table, we are showing worst (peak) and not 99th percentile as the data is already the average of all the VMs.

Multi-Tier Application

I’ve update this battle tested dashboard. It now sports more metrics and are color coded.

Imports

Download the deck to familiarize yourself.

  • Download the files. No need to unzip.
  • Avoid login as Admin. If you do, you won’t be able to see the changes easily.
  • Enable the following metrics: VM CPU Overlap
  • Import the super metrics.
  • Go to you Default Policy, and enable them. Do not enable them on All Objects, as you end up having super metrics for every single types of objects.
  • Import the views. Choose overwrite existing.
  • Import the dashboards. Choose overwrite existing.

Take a coffee break! Let the super metrics be calculated. It should take ~10 minutes before they start appearing 🙂

Sustainability Dashboards

Thanks Varghese, Arman, Bella, and Lusine for contributing this!

The 1st dashboard answers “What has been saved” question. As it focuses on the past, it shows a line chart.

The 2nd dashboard answers “What could be saved” question. It emphasizes the urgency to take action, hence a heat map is used.

To import, download from here, then:

  1. Import the supermetrics. Enable them.
  2. Import the views
  3. Import the dashboards

That’s all!

Our calculation is conservative. Your actual savings will be more. We are not including things like:

  • Physical buildings and land. With virtualization, you consume less foot print. This means less physical rack.
  • Network equipment –  Less physical servers mean less network ports. Because firewall, load balancers, IDS, IPS can be VM, you have less equipment
  • Other components like UPS , Lighting , Cooling and Labour

Assumptions:

  • Power consumption of a small server (1 socket, 10 cores, 32 GB RAM) = 0.1 KW
  • CO2 emission per KWh = 0.744 Kg
  • Cost of Power = $0.106 per KWh (change this in the supermetric) .
  • Tree offset for CO2 Emission = 36.4 pound of carbon per tree.

More details at this blog by Varghese.

OYW 7.0 Lite Edition

19 April 2019 update: I received questions from customers on what happens to the Full Edition. If you are using it, then it’s forward compatible with the Lite Edition. I have not updated the Full Edition as a huge portion of customers are simply not operationally ready for Performance SLA.

The 7.0 Lite edition replaces the earlier version. You don’t need the old one anymore.

I’ve updated Operationalize Your World based on lesson learned & requirements from customers. You can find the updated here.

It also sports a simplified import steps. You can get it done in minutes. Download the files from here.

Use Admin account. The reason is we’re overwriting the default dashboards and summary pages. Since dashboards are personalized, if you do not use Admin account, it will create new dashboards instead.

  • Import the super metrics.
  • Go to you Default Policy, and enable them. Do not enable them on All Objects as you end up having super metrics for every single types of objects.
  • Import the views. Choose overwrite existing.
  • Import the dashboards. Choose overwrite existing.
  • Import the Summary Pages. Do not change your default pages yet.

Take a coffee break! Let the super metrics be calculated. To validate if they are calculated, go to Troubleshoot a Cluster dashboard.

If the dashboards are filled with data, then it’s time change your default Summary Page. To do it, follow the following screenshot:

Select Manage Summary Dashboards to bring up dialog box below
Choose the Object summary page that you want to customize. I’ve selected vSphere Cluster
Choose the replacement Summary Page. I’ve named them Summary Page for ease of identification

If you are using vSAN, then there are additional steps

  • If you are only using Hybrid or All Flash, but not both types, then simply enable the correct vSAN KPI super metrics. They are named “Ops. Hybrid vSAN KPI” and “Ops. All Flash vSAN KPI”
  • If you are using both Hybrid and All Flash, then you need to create 1 policy for each. Then enable the correct super metric for each policy.

For your convenience, I’ve disabled the Summary Pages. Reason is they are not used as dashboard.

BTW, if you notice an hour glass icon, just wait a while. I notice this can take up to 1 hour in large environment.

If you want to share the dashboards to other users, then do the dashboard sharing:

That’s all you need! In future, I will share how the Multi-Tier Application dashboard is done. This needs Group Types and Groups.