Monthly Archives: February 2017

vRealize Operations Troubleshooting webcast

I just got to know that VMware Education is running a free webinar. It’s only 1 hour.

The date is 28 February 2017. There are 3 different sessions so there should be one that meets your time zone:

  • 08:00 AM – 9:00 AM PST
  • 12:00 PM – 1:00 PM GMT
  • 12:00 PM – 1:00 PM SGT

The topic covers:

  • Brief Introduction to vRealize Operations
  • vRealize Operations Manager Components Layers Functionality
  • Configuration Files
  • & tools
  • Top Trending Issues
    • Node status stuck at “Waiting for Analytics”
    • Troubleshooting “Double Master” issue
    • Manually removing a node at the time of troubleshooting
    • Adapter collection issue

Your IM questions will be answered throughout the broadcast, plus we’ll finish up with a 10-minute Q&A session.

You can register here. I have registered myself as I can benefit from this session! Registration was fast and the course is complimentary.

Applications Health dashboard

This post is part of Operationalize Your World post. Do read it first to get the context.

While all production applications are important, some are definitely more important than others. You monitor these critical applications 24 x 7, and want to be ahead of your customers in detecting their health.

There are 3 parts that make up Health of an application:

  • Is it up?
  • Is it fast?
  • Is it secured?

I’d focus on Performance in this blog. To some extent, if the Guest OS is down, the dashboard below will detect it.

This is the logical design of the dashboard. At the top, the list of critical applications are shown. I’m showing 6 below. The dashboard can handle more, limited by your screen real estate.

An application spans multiple VMs. Some large ones can even have 50 VMs. The health of the apps is color coded. It is based on the lowest health of its VM. You can take the average or weighted average if you want to.

If an app is not green, you can click on it. The dashboard will list all its VMs automatically. It’s plotting a line chart, so you can see the history. You can see how long, how bad and how often the problems happen. This is why I prefer line chart over a single number. A single number hides too many things, and can result in false impression.

For each VM, here is the logic that determines its health. I’m using 5 counters as they are the most common. The logic here is generic, so it can be applied to all applications. Yes, that means it’s not measuring application level. Application level metric needs adapters from Blue Medora.

The RAM Free is based on Guest OS RAM Free.

If you have End Point Operations, you can also add process or website availability.

You can click on the VM that is having problem. Its detailed KPI will be automatically plotted.

This is what the dashboard looks like. Notice the color coded shows you quickly the health of your critical apps. What you want to see is simply green!


Since we have 2 levels of health (Application and VM), we need to create 2 super metrics.

The VM health is a little tricky. How do we define the performance of a VM, from infrastructure viewpoint?

100 = Fast. It’s performing well as far as Infra is concerned.
  0 = Slow. It’s not performing. Need to look at Infra.


  • If any component of a VM is Red, then it should be red, regardless if all other value is green. This is because the situation is serious enough. The VM is definitely performing slower than usual.
  • If no component is Red, then we can take an average.

To implement the above logic, we need to play with the scale. The formula takes average, but it’s skewed by Red. The Red value is way lower than the rest. If not, taking average can mask out a problem. Say disk latency is really bad. But the other 4 counters are perfect. You get a score of 3 + 3 + 3 + 3 + 1 = 13, which means it’s yellow and not red.

To avoid false alarm, it’s important that Red means Red. It has to be bad enough. Threshold has to reflect that. You take action when you see yellow, and you do not wait until things turn to red.

Your dashboard goal is simple: Green.

Here is the actual formula to implement the logic shown previously on the table.

Since CPU, RAM, Disk, etc. have different units, you need to standardise them. I use the following:

  • Green = 100
  • Yellow = 98
  • Orange = 96
  • Red = 0

If the VM is powered off, then the metric will return nothing. The formula below catches that and mark it as 0.

Hang on!“, you might say. Not all my VMs have Guest OS RAM. What happen to the super metric when one of the component in the formula has no data?!

You are right, the whole super metric becomes no data. You don’t want that. A workaround is to assume the value is Green. I am aware that the correct solution is to ignore it, and not treat it as green. I’m afraid that’s not possible in vR Ops 6.5.

The first step in the workaround is to return the value -1 when there is no data. If there is data, take the actual data. Since -1 is higher than no data, it will return -1 when there is no data.

Always test your super metric. You could see from the above, I picked a VM that had no data and then had data.

Once I had that, it’s a matter of inserting the formula to detect -1 in the Guest OS RAM

I added the blue line. It sets the value to 100 when it’s detecting -1.

The actual formula looks like this:

The application health is simpler. It’s simply the worst of the VM Health.

How does vR Ops know all these critical applications?

It does not. You need to create a group for each. I created 3 in the example below.

You can create a static group or dynamic group. I’m showing a dynamic group, leveraging a vCenter folder. You can also use vCenter Annotation or vSphere Tags.

Back to the dashboard. Here is the config for each widget. For the top widget, I simply select the Object Type. In this way, as I add a new app or remove an old app, I do not need to update the dashboard.

For the VM Health chart, I specify the custom range. This has to match the formula, hence I specified 1, 2 and 3 for the range.

For the KPI line chart, I use a custom XML Interaction.

Hope you find it useful. This dashboard is for single-tier application. For multi-tier application, refer to this.

You do not have to create this dashboard manually. Download and import it, together with 50+ dashboards from here.

Datastore Capacity Management

This post is part of Operationalize Your World post. Do read it first to get the context.

This is the 2nd installment of Storage Capacity Management. The previous post covers the overall storage capacity management, where you can see the big picture and know which datastores are low in capacity. This post drills further and lets you analyze a specific datastore.

Datastore capacity is driven by 2 factors:

  • Performance: If the datastore is unable to serve its existing VMs, are you going to add more VM? You are right, the datastore is full, regardless of how space it has left.
  • Utilization: How much capacity is left? Thin provisioning makes this challenging.

This is what the dashboard looks like.

You start by selecting a datastore you want to check. This step is actually optional, as you would have come from the overall dashboard.

When you select a datastore, its Performance and Utilization are automatically shown.

  • Performance
    • Both actual and SLA are shown.
    • You just need to ensure that actual does not breach SLA.
  • Utilization
    • This shows the total capacity, the provisioned capacity (configured to the VM), and what’s actually used (thin provisioned).
    • You want to be careful with thin provisioning, as the VM can consumed the space as it’s already allocated to it. The line chart has 30-day projection to help you plan.

The 2 line charts is all you need. It is simple enough, yet detailed enough. It gives you room to make the judgement call. You can decide to ignore the spike because you knew it was a special event.

If you want to analyse, you can see the individual VMs. The heatmap shows the VMs distribution. You can see if there are large VMs, because they are bigger. You can see if any VM is running out of capacity, or any VM is wasting the allocated capacity.

The heatmap configuration below shows how it’s done.

You can also check if there are VMs that you can delete. Reclamation will give you extra space. The heatmap has a filter for powered off VMs, so only powered off VMs are shown.

From there, you can drill further to check that the VM has indeed met your Powered Off definition. It’s showing the VM powered off time (%) in the past 30 days. I’ve set the threshold to be 99%. Green means the VM is at least 99% powered off in the past 30 days.


I hope you agree by now that datastore performance is measured on how well it serves its VMs. We can track this by plotting a line chart showing the maximum storage latency experienced by any VM in the datastore. This maximum number has to be lower than the SLA you promise at all times.

For Utilization, we will plot a line chart showing the disk capacity left in the datastore cluster.

You should be using Datastore Cluster. Other than the benefits that you get from using it, it also makes capacity management easier.

  • You need not manually exclude local datastore.
  • You need not manually group the shared datastores, which can be complex if you have multiple clusters.

With vSAN, you only have 1 datastore per cluster and need not exclude local datastores manually. This means it’s even simpler in vSAN.

Include buffer for snapshot. This can be 20%, depending on your environment. This is why I’m not a fan of many small datastores, as you have pockets of unusable capacity. This does not have to be hardcoded in your super metric, but you have to be mentally aware of it.

Super Metrics

The screenshot below shows the super metric formula to get the Maximum latency of all the VMs in the cluster. I’ve chosen at Virtual Disk level, so it does not matter whether it is VMFS, VMFS, NFS or VSAN.

super metric - vDisk

You can copy paste the formula below:

Max ( ${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=virtualDisk|totalLatency, depth=2 } )

The screenshot below shows the super metric formula to get the total number of disk capacity left in the cluster. This is based on Thin Provisioning consumption.

You can copy paste the formula below:

sum( ${adapterkind=VMWARE, resourcekind=Datastore, attribute=capacity|available_space, depth=1} )

For Thick Provision, use the following super metric:

super metric - Disk - space left in datastore cluster - thick

You can copy paste the formula below:

${adapterkind=VMWARE, resourcekind=Datastore, attribute=capacity|total_capacity, depth=1}
) –
${adapterkind=VMWARE, resourcekind=Datastore, attribute=capacity|consumer_provisioned, depth=1}

Hope you find it useful. Just in case you’re not aware, you don’t have to implement all these manually. You can import this dashboard, together with 50+ others, from this set.