Tag Archives: vROps dashboard

Dashboards for IT Senior Management

This post is part of Operationalize Your World post. Do read it first to get the context.

CIO, Head of Global Infrastructure, and other IT Senior Management have a different requirements for dashboard than technical folks.

Generally they want:

  • big picture, not details.
  • exception. Things that they need their attention.
  • less technical info. Ideally, present in business terms, not IT.
  • a portal that is easy to access. They may not want to login to vR Ops. If they do, they may forget their password. [e1: vR Ops 6.5 cannot do login-less yet]
  • UI that is easy to understand. So keep each dashboard to a specific question.
  • system that is easy to use. So keep the interaction, clicking, zooming, sorting, etc. minimal.

That’s what they want from you.

What do you want from them?

You show them something so you can get help (e.g. budget, resource). Here are some goals:

  • Show transparency. Giving visibility into live environment to senior management.
  • Prove that you do need additional hardware.
  • Prove the wastage you have been talking for months.

What do you not want to show? There are things you do not want to show. Urgent issues are something that you should not display. It is not about hiding information to CIO. This is about giving you the time or space to do your job. If there is an active fire that requires your full time concentration, you do not want to be interrupted by CIO asking why it’s showing red on the dashboard!

I covered dashboard best practice in this post. Read that first, as this blog builds upon that.

Done? Great!

We take the same approach we did when planning dashboards for specific roles (e.g. Storage team, Network team). We ask a set of questions.

If we implement the above, we will end up with at least 5 dashboards. I’ve combined some of them. I see a wide variety of requirements, so you will customise them anyway 🙂

Basic Visibility

  • How many VMs in our cloud? What’s their CPU, RAM, Disk allocation? This gives you a size of the environment the IaaS is supporting.
  • How much CPU, RAM, Disk do we have? Is it enough to support the above requirement?
  • You should also give the history of VM growth. What is enough today may not be enough in 3 months.

In the dashboard above, I’ve added Availability information. As VMs can be powered off intentionally by application team, you should only report for Tier 1 VMs. Tier 3 VMs, especially those in Test and Dev, can be rebooted frequently and hence will give misleading information.

Performance

The dashboard below shows all VMs. In a large environment, the heat map will automatically combine VMs with the same value (read: color & size).

Every VM is represented by a box. The box can take on value between 0 and 3.

  • Green = 0. The VM is served well.
  • Yellow = 1. One of the IaaS is not delivered as per Performance SLA. We track CPU, RAM and Disk. If your SLA states 10 ms disk latency, then the VM has to get 10 ms.
  • Orange = 2. Two of the IaaS is not delivered.
  • Red = 3. All 3 services not delivered.

The VMs are grouped by Datacenter, then cluster. This lets you see which Datacenter or Cluster aren’t coping well.

The above shows the VMs. What about applications? An Application spans multiple tiers and multiple VMs. Just because a VM does not perform does not mean the whole application is affected. As this is for Senior Management, we’re only showing the Tier 1 applications.

This blog explains the implementation.

Capacity

  • CIO is not in charge of capacity management. He just need to know the decision you want him to make (which is to approve hardware purchase, or get VM Owners to rightsize). For that, he needs to know if you are running out of capacity, and existing capacity is not wasted.
  • How is it growing? This can be taken care of by having a projection. This projection should take into account committed projects too.
  • Capacity is more complex than performance. Just because vSphere cluster is running low on utilization does not mean it can serve the VMs well. See this for detail explanation.
  • Capacity is best presented with a line chart. This enables you to see the trend. For environment with <10 clusters, you can fit all the clusters in the screen. For large environment, you need to make a trade off:
    • show live data. You can be detail as you’re only showing 1 data.
    • show historical data. You can’t be detail as you’re showing >1 data.

Here is an example with historical data. Notice we cannot show details, and the screen only accommodates <10 clusters.

Here is an example where we only show live data. We can show a lot more clusters, and for each we can show CPU, RAM, Disk and Network.

You may run out of capacity. But if you have a lot of wastage, you may have sufficient capacity after you reclaim them. See this for details.

Configuration

  • Do we have “bad” configuration? Examples are old & unsupported versions of Windows, Linux, ESXi, VMware Tools, etc.
  • How uniform is our environment? Complexity is required to optimize cost (hardware, software) and performance. However, there is cost in complexity.
  • Do we have outdated and unsupported products?

If your CIO does not appreciate the complexity, showing CIO the complexity is good for you. It will result in appreciation of your expertise & effort, as it’s certainly easier if the complexity is low. Complexity increases when you have a wide variety of things.

Factor impacting complexity:

  • No of ESXi versions. The more variants, the more complex.
  • No of ESXi CPU version
  • No of brand. The more vendor, the more complex as you need to learn them, and spend time with the their team.
  • No of cluster node size
  • No of shared Datastore size

See the examples provided here for Infrastructure and here for VM.

Multi-tier Application Monitoring

This post is part of Operationalize Your World post. Do read it first to get the context.

I covered a single-tier application in this post. Read that first, as this blog builds upon that.

Done?

Great! A lot of you have shared that you want a multi-tier applications. One customer has a mission critical application that spans 5 tiers and 68 VMs. The dashboard I shared earlier does not scale to that level, as you certainly don’t want to check 68 VM one by one!

Application

To make sure we are on the same page, here is an example of multi-tier application:

A multi-tier application can suffer from either horizontal or vertical problem:

  • By horizontal, I mean a tier has problem. When the web tier is slow, it can slow down the entire application. The speed of a convoy is determined by the slowest car. We will use the following formula to determine the application performance:
    Application Performance = Minimum (Tier Performance)
  • By vertical, I mean something that cut across tier. Storage, for example. If the slowness is caused by something common, there is no need to troubleshoot individual VM, as they are simply victim.

That means we need to check both angles when an application had performance problem:

  • Which tier had the problem? Since when? How bad? What was the problem?
  • What infra problem did the app had? Storage, Network, CPU, RAM?

The above check makes a good starting point in your analysis. Don’t zoom into a particular VM until you know the overall picture. No point fire fighting the kitchen if the whole house is on fire.

Application Tier

The health of a tier is the average health of its member. This is because a tier scales out. We are not taking the minimum value. This is not a convoy.

Hold on!” you might say. Since it is scale out, App Team has catered for this. If they only need 3 web servers, they will deploy 4 or even 5. So both performance and availability are not affected. The tier performance has to take into account this extra node, not simply doing an average.

This logic sounds reasonable. But is it correct?

It is not actually. Because this is not about Availability. This is about Performance. All web servers are still up, but if node no 4 is slower, user experience will be affected.

My fellow blogger Luciano Gomes advise that some Load Balancer can detect the performance of a node. This is good, as simply counting the number of session is not a complete measurement. The node measurement is based on this formula. It takes into account how the IaaS platform is serving the Node. So it’s looking beyond the Guest OS. This is because performance cannot be measured from within the Guest OS only. Review this discussion between David Davis and Sunny Dua.

This is why we are doing an average. You want to be informed if there is degradation, since this is performance, not availability.

The health of a VM is simple. We leverage the work we’ve done for a single-tier application here.

Dashboard 

I found that doing a logical design of dashboard actually saves time. Follow best practices to help you.

Here is the logical design of the dashboard. Notice it has 3 levels: App, Tier, VM.

Here is what it looks like

Click on the image below to see explanation of each section:

Hope you find it useful. As usual, you do not have to build this from scratch. This is part of Operationalize Your World, which give you 50+ dashboards.

Monitoring NSX Edge, SSL VPN, Firewall and Logical Switch

This blog is contributed by my friend Luciano Gomes, a VMware PSO Senior Consultant in Rio de Janeiro Area, Brazil. Thank you, Lucky!

In this post, I would like to show you how you can monitor NSX Edge, SSL VPN, Firewall and Logical Switch using only one dashboard.

First, let’s get the prerequisites out of the way:

  1. vRealize Operations (Advanced/Enterprise License)
  2. vCenter + NSX
  3. vR Ops Management Pack for NSX

My friend Romain Decker has covered the installation of the Management Pack. Read it here first.

Another friend (life is good when you have many experts as friends!), Lan Nguyen, has documented how to import the dashboard here.

With the above done, go download the Dashboard to be imported here

One done, follow the steps below to configure the Metric Config XML Files.

The above will take you to the Manage Metric Config screen.

  1. Click ReskndMetric folder to expand
  2. Click Green Plus Sign to create a new file.

Give the name exactly below:

Copy and paste this XML below:

<?xml version="1.0" encoding="UTF-8"?>

<AdapterKinds>
 <AdapterKind adapterKindKey="NSX">
 <ResourceKind resourceKindKey="SSLVPNEdgeService">
 <Metric attrkey="clients|clients_active" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|auth_failures" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|utilization" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|workload" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|service_status" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 <ResourceKind resourceKindKey="FirewallEdgeService">
 <Metric attrkey="rule|used" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|service_status" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 <ResourceKind resourceKindKey="EdgeServicesGateway">
 <Metric attrkey="cpu|used_percent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="disk|used_percent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|connection_health" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|connected" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="net|usage_average" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="net|used_percent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="net|maxObserved_KBps" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|attached_vms" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|running" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|status" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="mem|used_percent" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 <ResourceKind resourceKindKey="LogicalSwitch">
 <Metric attrkey="port|max" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_packet_pct" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|broadcast_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|multicast_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|used" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|utilization" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|maxobserved_util" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|maxobserved_rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|maxobserved_tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|unicast_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|multicast_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|broadcast_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|unicast_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="summary|attached_vms" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 </AdapterKind>
 </AdapterKinds>

That’s it!

To use the Dashboard, see the image below:

Hope you find it useful. Do reach out via Linkedin and Twitter. Thanks for reading!