Tag Archives: vROps dashboard

Multi-tier Application Monitoring

This post is part of Operationalize Your World post. Do read it first to get the context.

I covered a single-tier application in this post. Read that first, as this blog builds upon that.

Done?

Great! A lot of you have shared that you want a multi-tier applications. One customer has a mission critical application that spans 5 tiers and 68 VMs. The dashboard I shared earlier does not scale to that level, as you certainly don’t want to check 68 VM one by one!

Application

To make sure we are on the same page, here is an example of multi-tier application:

A multi-tier application can suffer from either horizontal or vertical problem:

  • By horizontal, I mean a tier has problem. When the web tier is slow, it can slow down the entire application. The speed of a convoy is determined by the slowest car. We will use the following formula to determine the application performance:
    Application Performance = Minimum (Tier Performance)
  • By vertical, I mean something that cut across tier. Storage, for example. If the slowness is caused by something common, there is no need to troubleshoot individual VM, as they are simply victim.

That means we need to check both angles when an application had performance problem:

  • Which tier had the problem? Since when? How bad? What was the problem?
  • What infra problem did the app had? Storage, Network, CPU, RAM?

The above check makes a good starting point in your analysis. Don’t zoom into a particular VM until you know the overall picture. No point fire fighting the kitchen if the whole house is on fire.

Application Tier

The health of a tier is the average health of its member. This is because a tier scales out. We are not taking the minimum value. This is not a convoy.

Hold on!” you might say. Since it is scale out, App Team has catered for this. If they only need 3 web servers, they will deploy 4 or even 5. So both performance and availability are not affected. The tier performance has to take into account this extra node, not simply doing an average.

This logic sounds reasonable. But is it correct?

It is not actually. Because this is not about Availability. This is about Performance. All web servers are still up, but if node no 4 is slower, user experience will be affected.

My fellow blogger Luciano Gomes advise that some Load Balancer can detect the performance of a node. This is good, as simply counting the number of session is not a complete measurement. The node measurement is based on this formula. It takes into account how the IaaS platform is serving the Node. So it’s looking beyond the Guest OS. This is because performance cannot be measured from within the Guest OS only. Review this discussion between David Davis and Sunny Dua.

This is why we are doing an average. You want to be informed if there is degradation, since this is performance, not availability.

The health of a VM is simple. We leverage the work we’ve done for a single-tier application here.

Dashboard 

I found that doing a logical design of dashboard actually saves time. Follow best practices to help you.

Here is the logical design of the dashboard. Notice it has 3 levels: App, Tier, VM.

Here is what it looks like

Click on the image below to see explanation of each section:

Hope you find it useful. As usual, you do not have to build this from scratch. This is part of Operationalize Your World, which give you 50+ dashboards.

Monitoring NSX Edge, SSL VPN, Firewall and Logical Switch

This blog is contributed by my friend Luciano Gomes, a VMware PSO Senior Consultant in Rio de Janeiro Area, Brazil. Thank you, Lucky!

In this post, I would like to show you how you can monitor NSX Edge, SSL VPN, Firewall and Logical Switch using only one dashboard.

First, let’s get the prerequisites out of the way:

  1. vRealize Operations (Advanced/Enterprise License)
  2. vCenter + NSX
  3. vR Ops Management Pack for NSX

My friend Romain Decker has covered the installation of the Management Pack. Read it here first.

Another friend (life is good when you have many experts as friends!), Lan Nguyen, has documented how to import the dashboard here.

With the above done, go download the Dashboard to be imported here

One done, follow the steps below to configure the Metric Config XML Files.

The above will take you to the Manage Metric Config screen.

  1. Click ReskndMetric folder to expand
  2. Click Green Plus Sign to create a new file.

Give the name exactly below:

Copy and paste this XML below:

<?xml version="1.0" encoding="UTF-8"?>

<AdapterKinds>
 <AdapterKind adapterKindKey="NSX">
 <ResourceKind resourceKindKey="SSLVPNEdgeService">
 <Metric attrkey="clients|clients_active" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|auth_failures" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|utilization" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|workload" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|service_status" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 <ResourceKind resourceKindKey="FirewallEdgeService">
 <Metric attrkey="rule|used" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|service_status" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 <ResourceKind resourceKindKey="EdgeServicesGateway">
 <Metric attrkey="cpu|used_percent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="disk|used_percent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|connection_health" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|connected" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="net|usage_average" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="net|used_percent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="net|maxObserved_KBps" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|attached_vms" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|running" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|status" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="mem|used_percent" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 <ResourceKind resourceKindKey="LogicalSwitch">
 <Metric attrkey="port|max" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_packet_pct" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|broadcast_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|multicast_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|used" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|utilization" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|maxobserved_util" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|maxobserved_rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|maxobserved_tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|unicast_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|multicast_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|broadcast_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|unicast_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="summary|attached_vms" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 </AdapterKind>
 </AdapterKinds>

That’s it!

To use the Dashboard, see the image below:

Hope you find it useful. Do reach out via Linkedin and Twitter. Thanks for reading!

Architecting VMware vSphere for Operations

As an Architect, you take into account many requirements when designing your VMware vSphere environment. As an Operations person, I shall not question your Architecture. I’m sure it is fast, highly available, right on budget, etc. My role is to help you prove to CIO that what you architect actually lives up to its expectation. “Plan meets Actual” is what an Architect wants, because that means your architecture delivers its promise.

The plan of the Architecture exists in some diagram and documents. It's static.
The reality of the Architecture exists in Datacenter. It's live.

When proving, here are the questions we should be able to answer:

  • Availability: Does the IaaS deliver the promised Availability SLA? If not, what was its actual number, when was it breached, how long, which VMs affected? For each VM, when exactly it happened and ended?
  • Performance: Does the IaaS serve all its VMs well? An IaaS platform provides CPU, RAM, Disk and Network as services. If has to deliver these 4 resources when asked by each VM, 24 x 7 x 365 days a year. If not, which VMs were affected by what and when?

Those are simple questions, but are very difficult to answer. Let say you have 10,000 VMs. How do you answer that? How do you provide answer over time (e.g. monthly), proving you handle the peaks well?

To complicate matters, you need to able to answer per Business Units. Business Units A will not care about other business units. Since a Business Unit has >1 apps, you also need to answer per application. An application owner only cares about her applications.

There are a few things you need to do, so you are in the position to prove.

Step 1: Reflect the business in vSphere

Does your vCenter show all the business units? Can you show how the business is mapped into your vSphere environment? You design vSphere so Business can run on it, so where is the Business? A company is made up business units, which may have multiple departments. The structure below is a typical example.

An application typically has multiple tiers. Does your vSphere understand that?

Map the above as folders in your vCenter.

I see many naming convention that is not operations-friendly. It’s impossible to guess what it is. The names are very similar and hard to read, hence it’s easy for operators to make mistake! In some companies, these operators are outsourced or contractors, who are not that familiar or don’t care as much as employee. The naming convention typically originates from mainframe or MS-DOS era, where you cannot have space and have limited characters. Examples are SG1-D01-INS-0001W-PRD. Can you guess what on earth that is? You’re right, you can’t. Imagine there are 1000s of them like that, and you have new operators joining the help desk team.

If you have shared application, you can create a folder for that. Multiple vR Ops applications can point to the same vCenter folder.

Folder, Tags and Annotation

Have you seen a vSphere environment where there are tags and annotation everywhere?

It’s rare to meet customers with a 100% well-thought and documented approach to the 3 features above. They may have general guidelines, but not explicit Do’s and Dont’s. As a result, these 3 features are used wrongly.

Use Tags when the values are discrete, ideally Yes/No. I’d use tag to tag the following:

  • VMs with RDM.
  • VMs with MSCS or Linux Clustering.

Do not use Tags when the values are unlikely to be common. Use annotation for this. Examples are VM Owner Name, Email Address, Mobile Number. In an environment where there are >10K VMs, there can be 1000 VM Owners.

Do not use Tags to tag Service Tier. For Infra objects such as Cluster and Datastore Cluster, that should be clearly reflected in the name itself. I’d prefix all Tier 1 clusters with Tier 1, so the chance of deploying into the wrong tier is minimized.

Step 2: Design Service Tiers into vSphere

Does your vSphere understand that there are different classes of service? Are Tier 1 clusters and datastores clearly labelled?

You should avoid mixing multiple classes of service into a single cluster or datastore. While it is technically possible to segregate, it’s operationally challenging. Resource Pool expects the number of VMs for each pool to be identical among the sibling pool.

For each tier, you need to have both Availability SLA and Performance. For Performance SLA, review this doc.

Step 3: Define and Map Tiers in vR Ops

Now that you’ve considered service tier into your vSphere architecture, time to show it. You cannot show it in vSphere as vCenter does not understand Performance SLA and Availability SLA. You can use vR Ops do this. Follow this step.

Step 4: Map Applications in vR Ops

Use custom groups to create applications. If you have a proper naming convention, it should not be difficult to select members of the applications. All you need is a query that says the names contains XYZ. There should not be a need for regular expression.

Once apps are mapped, you can do something like this.

Step 5: Consider Debug-ability

Things go wrong. Especially in production. Your architecture should lend itself for troubleshooting.

A major area is to ensure the counters are reliable, else it’s hard to troubleshoot performance. The CPU Contention counter, which is the main counter for IaaS Performance SLA, is greatly affected by Power Management. Ensure your ESXi power management follow this guide by Sunny Dua.

Once you have that in place, you will be able to prove that your Architecture lives up to its expectation. Use the dashboards from Operationalize Your World to show that proof!