This post is part of Operationalize Your World post. Do read it first to get the context.
While all production applications are important, some are definitely more important than others. You monitor these critical applications 24 x 7, and want to be ahead of your customers in detecting their health.
There are 3 parts that make up Health of an application:
- Is it up?
- Is it fast?
- Is it secured?
I’d focus on Performance in this blog. To some extent, if the Guest OS is down, the dashboard below will detect it.
This is the logical design of the dashboard. At the top, the list of critical applications are shown. I’m showing 6 below. The dashboard can handle more, limited by your screen real estate.
An application spans multiple VMs. Some large ones can even have 50 VMs. The health of the apps is color coded. It is based on the lowest health of its VM. You can take the average or weighted average if you want to.
If an app is not green, you can click on it. The dashboard will list all its VMs automatically. It’s plotting a line chart, so you can see the history. You can see how long, how bad and how often the problems happen. This is why I prefer line chart over a single number. A single number hides too many things, and can result in false impression.
For each VM, here is the logic that determines its health. I’m using 5 counters as they are the most common. The logic here is generic, so it can be applied to all applications. Yes, that means it’s not measuring application level. Application level metric needs adapters from Blue Medora.
The RAM Free is based on Guest OS RAM Free.
If you have End Point Operations, you can also add process or website availability.
You can click on the VM that is having problem. Its detailed KPI will be automatically plotted.
This is what the dashboard looks like. Notice the color coded shows you quickly the health of your critical apps. What you want to see is simply green!
Since we have 2 levels of health (Application and VM), we need to create 2 super metrics.
The VM health is a little tricky. How do we define the performance of a VM, from infrastructure viewpoint?
100 = Fast. It’s performing well as far as Infra is concerned.
0 = Slow. It’s not performing. Need to look at Infra.
- If any component of a VM is Red, then it should be red, regardless if all other value is green. This is because the situation is serious enough. The VM is definitely performing slower than usual.
- If no component is Red, then we can take an average.
To implement the above logic, we need to play with the scale. The formula takes average, but it’s skewed by Red. The Red value is way lower than the rest. If not, taking average can mask out a problem. Say disk latency is really bad. But the other 4 counters are perfect. You get a score of 3 + 3 + 3 + 3 + 1 = 13, which means it’s yellow and not red.
To avoid false alarm, it’s important that Red means Red. It has to be bad enough. Threshold has to reflect that. You take action when you see yellow, and you do not wait until things turn to red.
Your dashboard goal is simple: Green.
Here is the actual formula to implement the logic shown previously on the table.
Since CPU, RAM, Disk, etc. have different units, you need to standardise them. I use the following:
- Green = 100
- Yellow = 98
- Orange = 96
- Red = 0
If the VM is powered off, then the metric will return nothing. The formula below catches that and mark it as 0.
“Hang on!“, you might say. Not all my VMs have Guest OS RAM. What happen to the super metric when one of the component in the formula has no data?!
You are right, the whole super metric becomes no data. You don’t want that. A workaround is to assume the value is Green. I am aware that the correct solution is to ignore it, and not treat it as green. I’m afraid that’s not possible in vR Ops 6.5.
The first step in the workaround is to return the value -1 when there is no data. If there is data, take the actual data. Since -1 is higher than no data, it will return -1 when there is no data.
Always test your super metric. You could see from the above, I picked a VM that had no data and then had data.
Once I had that, it’s a matter of inserting the formula to detect -1 in the Guest OS RAM
I added the blue line. It sets the value to 100 when it’s detecting -1.
The actual formula looks like this:
The application health is simpler. It’s simply the worst of the VM Health.
How does vR Ops know all these critical applications?
It does not. You need to create a group for each. I created 3 in the example below.
You can create a static group or dynamic group. I’m showing a dynamic group, leveraging a vCenter folder. You can also use vCenter Annotation or vSphere Tags.
Back to the dashboard. Here is the config for each widget. For the top widget, I simply select the Object Type. In this way, as I add a new app or remove an old app, I do not need to update the dashboard.
For the VM Health chart, I specify the custom range. This has to match the formula, hence I specified 1, 2 and 3 for the range.
For the KPI line chart, I use a custom XML Interaction.
Hope you find it useful. This dashboard is for single-tier application. For multi-tier application, refer to this.
You do not have to create this dashboard manually. Download and import it, together with 50+ dashboards from here.