This post is part of Operationalize Your World post. Do read it first to get the context.
I covered a single-tier application in this post. Read that first, as this blog builds upon that.
Great! A lot of you have shared that you want a multi-tier applications. One customer has a mission critical application that spans 5 tiers and 68 VMs. The dashboard I shared earlier does not scale to that level, as you certainly don’t want to check 68 VM one by one!
To make sure we are on the same page, here is an example of multi-tier application:
A multi-tier application can suffer from either horizontal or vertical problem:
- By horizontal, I mean a tier has problem. When the web tier is slow, it can slow down the entire application. The speed of a convoy is determined by the slowest car. We will use the following formula to determine the application performance:
Application Performance = Minimum (Tier Performance)
- By vertical, I mean something that cut across tier. Storage, for example. If the slowness is caused by something common, there is no need to troubleshoot individual VM, as they are simply victim.
That means we need to check both angles when an application had performance problem:
- Which tier had the problem? Since when? How bad? What was the problem?
- What infra problem did the app had? Storage, Network, CPU, RAM?
The above check makes a good starting point in your analysis. Don’t zoom into a particular VM until you know the overall picture. No point fire fighting the kitchen if the whole house is on fire.
The health of a tier is the average health of its member. This is because a tier scales out. We are not taking the minimum value. This is not a convoy.
“Hold on!” you might say. Since it is scale out, App Team has catered for this. If they only need 3 web servers, they will deploy 4 or even 5. So both performance and availability are not affected. The tier performance has to take into account this extra node, not simply doing an average.
This logic sounds reasonable. But is it correct?
It is not actually. Because this is not about Availability. This is about Performance. All web servers are still up, but if node no 4 is slower, user experience will be affected.
My fellow blogger Luciano Gomes advise that some Load Balancer can detect the performance of a node. This is good, as simply counting the number of session is not a complete measurement. The node measurement is based on this formula. It takes into account how the IaaS platform is serving the Node. So it’s looking beyond the Guest OS. This is because performance cannot be measured from within the Guest OS only. Review this discussion between David Davis and Sunny Dua.
This is why we are doing an average. You want to be informed if there is degradation, since this is performance, not availability.
The health of a VM is simple. We leverage the work we’ve done for a single-tier application here.
I found that doing a logical design of dashboard actually saves time. Follow best practices to help you.
Here is the logical design of the dashboard. Notice it has 3 levels: App, Tier, VM.
Here is what it looks like
Click on the image below to see explanation of each section:
Hope you find it useful. As usual, you do not have to build this from scratch. This is part of Operationalize Your World, which give you 50+ dashboards.