Tag Archives: Operationalize Your World

Super Metrics bulk export import

If you have a lot of super metrics, backing them up can be a challenge. You cannot do bulk export to back up. It’s easy to have version control issue if you manually export each.

Replicating in another instance (e.g. your test/dev) is tedious as you need to import one by one.

A workaround is to use Policy as vehicle to bulk export/import.

  • For backup purpose, export is all you need.
  • For restore into the same environment (where you exported it earlier), you can use the same XML file. You don’t need to customise it.
  • For replicating in another environment, you likely need to modify the XML file. This is because the policy file contains other settings, such as alert. It’s safer if your exported policy does not contain all other settings.

I’ll show you how to trim the policy file, so it has only super metrics. In this way, it’s safe to import into any environment, as it won’t modify anything. The XML file only contains super metrics, that’s all.

The policy file is just a long XML file. In the example below, it has >5000 lines! Notice it has Alerts and Custom Profile.

To delete an entire section, simply use the keyboard to highlight them. See below, where I selected Custom Profile.

I’ve also deleted the alert section. The file is shorter now 🙂

We still have some irrelevant lines (line 4 – 1187 in my case). Delete them too. You’ll end up with something like this.

I’ll expand the content so you know what exactly the supermetric section contains.

BTW, do not copy/paste your super metric from vR Ops UI into the XML file. The expression is not 100% identical.

Once done, you can import it safely into another vR Ops instance. The import is much faster too! Here is what it looks like:

And if you go to Super Metrics, they are all there 🙂

To enable them, go your default policy (the one marked with a tiny D on the column), and edit it. Go to section 6, and find your super metrics. In Operationalize Your World case, they are all prefixed with “Ops.

Do not enable for all objects. It will slow down your system. It also makes it more complex unnecessarily as the formula don’t apply to them.

Multi-tier Application Monitoring

This post is part of Operationalize Your World post. Do read it first to get the context.

I covered a single-tier application in this post. Read that first, as this blog builds upon that.

Done?

Great! A lot of you have shared that you want a multi-tier applications. One customer has a mission critical application that spans 5 tiers and 68 VMs. The dashboard I shared earlier does not scale to that level, as you certainly don’t want to check 68 VM one by one!

Application

To make sure we are on the same page, here is an example of multi-tier application:

A multi-tier application can suffer from either horizontal or vertical problem:

  • By horizontal, I mean a tier has problem. When the web tier is slow, it can slow down the entire application. The speed of a convoy is determined by the slowest car. We will use the following formula to determine the application performance:
    Application Performance = Minimum (Tier Performance)
  • By vertical, I mean something that cut across tier. Storage, for example. If the slowness is caused by something common, there is no need to troubleshoot individual VM, as they are simply victim.

That means we need to check both angles when an application had performance problem:

  • Which tier had the problem? Since when? How bad? What was the problem?
  • What infra problem did the app had? Storage, Network, CPU, RAM?

The above check makes a good starting point in your analysis. Don’t zoom into a particular VM until you know the overall picture. No point fire fighting the kitchen if the whole house is on fire.

Application Tier

The health of a tier is the average health of its member. This is because a tier scales out. We are not taking the minimum value. This is not a convoy.

Hold on!” you might say. Since it is scale out, App Team has catered for this. If they only need 3 web servers, they will deploy 4 or even 5. So both performance and availability are not affected. The tier performance has to take into account this extra node, not simply doing an average.

This logic sounds reasonable. But is it correct?

It is not actually. Because this is not about Availability. This is about Performance. All web servers are still up, but if node no 4 is slower, user experience will be affected.

My fellow blogger Luciano Gomes advise that some Load Balancer can detect the performance of a node. This is good, as simply counting the number of session is not a complete measurement. The node measurement is based on this formula. It takes into account how the IaaS platform is serving the Node. So it’s looking beyond the Guest OS. This is because performance cannot be measured from within the Guest OS only. Review this discussion between David Davis and Sunny Dua.

This is why we are doing an average. You want to be informed if there is degradation, since this is performance, not availability.

The health of a VM is simple. We leverage the work we’ve done for a single-tier application here.

Dashboard 

I found that doing a logical design of dashboard actually saves time. Follow best practices to help you.

Here is the logical design of the dashboard. Notice it has 3 levels: App, Tier, VM.

Here is what it looks like

Click on the image below to see explanation of each section:

Hope you find it useful. As usual, you do not have to build this from scratch. This is part of Operationalize Your World, which give you 50+ dashboards.

Applications Health dashboard

This post is part of Operationalize Your World post. Do read it first to get the context.

While all production applications are important, some are definitely more important than others. You monitor these critical applications 24 x 7, and want to be ahead of your customers in detecting their health.

There are 3 parts that make up Health of an application:

  • Is it up?
  • Is it fast?
  • Is it secured?

I’d focus on Performance in this blog. To some extent, if the Guest OS is down, the dashboard below will detect it.

This is the logical design of the dashboard. At the top, the list of critical applications are shown. I’m showing 6 below. The dashboard can handle more, limited by your screen real estate.

An application spans multiple VMs. Some large ones can even have 50 VMs. The health of the apps is color coded. It is based on the lowest health of its VM. You can take the average or weighted average if you want to.

If an app is not green, you can click on it. The dashboard will list all its VMs automatically. It’s plotting a line chart, so you can see the history. You can see how long, how bad and how often the problems happen. This is why I prefer line chart over a single number. A single number hides too many things, and can result in false impression.

For each VM, here is the logic that determines its health. I’m using 5 counters as they are the most common. The logic here is generic, so it can be applied to all applications. Yes, that means it’s not measuring application level. Application level metric needs adapters from Blue Medora.

The RAM Free is based on Guest OS RAM Free.

If you have End Point Operations, you can also add process or website availability.

You can click on the VM that is having problem. Its detailed KPI will be automatically plotted.

This is what the dashboard looks like. Notice the color coded shows you quickly the health of your critical apps. What you want to see is simply green!

Implementation

Since we have 2 levels of health (Application and VM), we need to create 2 super metrics.

The VM health is a little tricky. How do we define the performance of a VM, from infrastructure viewpoint?

100 = Fast. It’s performing well as far as Infra is concerned.
  0 = Slow. It’s not performing. Need to look at Infra.

Approach

  • If any component of a VM is Red, then it should be red, regardless if all other value is green. This is because the situation is serious enough. The VM is definitely performing slower than usual.
  • If no component is Red, then we can take an average.

To implement the above logic, we need to play with the scale. The formula takes average, but it’s skewed by Red. The Red value is way lower than the rest. If not, taking average can mask out a problem. Say disk latency is really bad. But the other 4 counters are perfect. You get a score of 3 + 3 + 3 + 3 + 1 = 13, which means it’s yellow and not red.

To avoid false alarm, it’s important that Red means Red. It has to be bad enough. Threshold has to reflect that. You take action when you see yellow, and you do not wait until things turn to red.

Your dashboard goal is simple: Green.

Here is the actual formula to implement the logic shown previously on the table.

Since CPU, RAM, Disk, etc. have different units, you need to standardise them. I use the following:

  • Green = 100
  • Yellow = 98
  • Orange = 96
  • Red = 0

If the VM is powered off, then the metric will return nothing. The formula below catches that and mark it as 0.

Hang on!“, you might say. Not all my VMs have Guest OS RAM. What happen to the super metric when one of the component in the formula has no data?!

You are right, the whole super metric becomes no data. You don’t want that. A workaround is to assume the value is Green. I am aware that the correct solution is to ignore it, and not treat it as green. I’m afraid that’s not possible in vR Ops 6.5.

The first step in the workaround is to return the value -1 when there is no data. If there is data, take the actual data. Since -1 is higher than no data, it will return -1 when there is no data.

Always test your super metric. You could see from the above, I picked a VM that had no data and then had data.

Once I had that, it’s a matter of inserting the formula to detect -1 in the Guest OS RAM

I added the blue line. It sets the value to 100 when it’s detecting -1.

The actual formula looks like this:

The application health is simpler. It’s simply the worst of the VM Health.

How does vR Ops know all these critical applications?

It does not. You need to create a group for each. I created 3 in the example below.

You can create a static group or dynamic group. I’m showing a dynamic group, leveraging a vCenter folder. You can also use vCenter Annotation or vSphere Tags.

Back to the dashboard. Here is the config for each widget. For the top widget, I simply select the Object Type. In this way, as I add a new app or remove an old app, I do not need to update the dashboard.

For the VM Health chart, I specify the custom range. This has to match the formula, hence I specified 1, 2 and 3 for the range.

For the KPI line chart, I use a custom XML Interaction.

Hope you find it useful. This dashboard is for single-tier application. For multi-tier application, refer to this post.

You do not have to create this dashboard manually. Download and import it, together with 50+ dashboards from here.