Tag Archives: Big Screen

NOC Dashboards for SDDC – Part 2

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Dashboard: Performance

Is my IaaS performing?

That’s the key question that you need to answer. You need to show if the clusters are coping well. Show how the clusters are performing in terms of CPU, RAM and Disk.

The above dashboard is per Service Tier. Do you know why?

Yes, the threshold differs for each tier. What is acceptable in Tier 3 may not be acceptable in Tier 1.

The good thing about line chart is it provides visibility beyond present time. You can show the last 6 hours and still get good details. Showing >24 hours will result in visualization that is too static, not suitable for NOC use case.

Limitation & customisation:

  • You need 1 widget per Service Tier.
  • If you only have a few clusters, you can show multiple Service Tiers in 1 dashboard. 1 row per tier results in simpler visualisation.
  • In environment with >10 clusters, we can group them into Service Tier. Focus more on the highest tier (Tier 1).
  • In environment with >100 clusters, we need another grouping in between. Group the Tier 1 clusters into physical location.

When a cluster is unable to cope, is it because it’s having high utilization? I show CPU, RAM and Disk here. You can add Network as you know the physical limit of ESXi vmnic.

Disk is tricky for 2 reasons:

  • It is not in %. There is no 100% for IOPS or Throughput. The good thing is when you architected the array or vSAN, you did have certain IOPS number in mind right? 😉 Well, now you can get the storage vendor to prove it that it does indeed perform as per the PowerPoint 😉 If not, you get free hardware if they promise a fast performance that will meet business requirement.
  • You need to show both IOPS and Throughput. While they are related, you can have low IOPS and high throughput, and vice versa.

If the cluster utilization is high, the next dashboard drills into each ESXi.

We can also see if there are unbalanced. In theory, they should not, if you set DRS to Fully Automated, and pick at least the middle level sensitivity (3). DRS in vSphere 6.5 considers Network also, so you can have unbalanced CPU/RAM.

With the dashboard above, we can tell if ESXi CPU Utilisation is healthy or not.

  • Low value does not mean VM performs fast. A VM is concerned with VM Contention metric, not ESXi Utilization. Low value means we over invest. It is not healthy as we waste money and power.
  • High value means we risk performance (read: contention)

For ESXi, go with higher core count. You save license if you can reduce socket.

We can also tell if ESXi RAM Utilisation is healthy or not.

  • Customers tend to overspend on RAM and underspend on CPU. The reason is this.
  • For RAM, we have 2 metrics:
    • Active RAM
    • Mapped RAM
  • The value you want is somewhere in between Active and RAM.

In the dashboard, the 3 widgets have different range. The range I set is 30 – 90, 50 – 100 and 10 – 90.

Why not 0 – 100?

It is not 100% because you want to cater for HA. Your ESXi should not hit 100% as if you have HA, it would be beyond 100, meaning performance will be badly affected.

If the cluster or ESXi utilization is high, is it because there are VMs generating excessive workloads?

The dashboard above answers if we have VMs that dominate the shared environment.

  • CPU: show a heat map showing all VMs, sized by CPU Demand in GHz (not %), color by contention
  • RAM: show a heat map showing all VMs, sized by Active RAM, color by contention
  • Storage: show a heat map showing all VMs, sized by IOPS color by latency.

At a glance, we can tell the workload distribution among the VMs. We can also tell if they being served well or not.

Limitation & customisation:

  • You need 1 widget per Service Tier.
  • You can change the threshold anytime. If you want a brand new storage from Finance, set the max to 1 ms 😉
  • In larger environment, group your heatmap (e.g. by cluster, host, folder).
  • We can show individual VM, but we can’t show the history as there are too much data to show.
  • This needs to be done per Tier. 1 dashboard per Tier, as the threshold varies per tier.

Hope you find it useful. For the product-specific implementation, review this blog. To prevent vROps session from timing out, implement this trick by Sunny.

NOC Dashboards for SDDC – Part 1

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Be careful of what you’re showing on the big screen. There are many roles in an organisation. Each will have his/her own viewpoint. What you want to show for your managers is probably different to what you want your customers (App Team) to see.

  • If you create something for the big boss to see, you normally hang the big screen near her office. The complication is a lot of other people can see the info too. Would they appreciate the simplification that you do?
  • If you create for the NOC (Network Operation Center) team to see, think of what they already have. The Help Desk have alerts and large desktop screens already. Complement it, not duplicate it.
  • If you create for the Infra team to see, think from their viewpoint. Since they are not help desk, they don’t look at the dashboards and get the alerts.

Best Practices

In addition to the general best practices, NOC dashboard has its own specific guidelines:

No interaction with the screen

  • Avoid having buttons as there is no user clicking on any part of the dashboard. There is no mouse and keyboard either.

KISS (Keep it simple show)

  • Show minimal information, with large numbers.
  • Don’t show detailed charts as that is hard to read from afar. Be aware of how far the info needs to be displayed. 9 point Calibri at laptop is clear, but not at the projector screen.
  • Ideally, all the numbers are in %, with 0 being bad and 100 being perfect.
  • In cases like Utilisation, you should use the following marker
    • 50% = good, balanced utilization. Ideally, this should be 75%.
    • 0% = wastage
    • 100% = highly utilized.

Use color to classify information.

  • Color is easier than text, as you don’t even need to read.
  • Easier to digest from afar and at a glance.
  • Use key colors (green, yellow, amber, and red).

Remember the 5-second test.

  • A dashboard for NOC should be easy and user friendly. It should not require an explanation.

Choose content that drives action.

  • If you display something that is red most of the time, after a while the viewer will ignore it. This defeats the very purpose of displaying on the big screen.
  • When something on the big screen is red, you want action to be taken. And it’s immediate, not tomorrow.

Dashboards Flow

You should also think of how the dashboards flow. The dashboards are not a collection of unrelated screens. There should be logical continuation, else it can be confusing to the viewers.

There are 5 areas that you can show. For each area, you can show multiple screens so it has some depth. Here are some examples of details:

Availability

  • Show the VM Availability and IaaS Availability.
  • For Infra, you can split into Compute, Storage and Network.

Performance

  • You can split into Tier 1, Tier 2 and Tier 3.

Automatically cycle the dashboard every 1-2 minute. If you wait too long, viewers will lose interest and move on.

Examples

These are ideas to jumpstart your Big Screen dashboard. Every customer seems to have different requirements, because you do operations a little differently. So what you see here need to be tailored.

Dashboard: VM Availability

Are the VMs up and running?

The dashboard answers this question: What is the availability among all VMs? Was any of them down in the past 30 days?

Just by looking at the color, you can tell easily if any VM has <100% uptime in the past 30 days. The red color is obvious. Notice the green has different intensity. You can tailor this setting.

We’re using heat map because it can scale >1000 objects. It’s also color coded.

Limitation & customisation

  • This heatmap requires custom group. If you do not group it, it will include VM that you intentionally powered off (and hence do not want to show)
  • The group name is Monitored VMs. To exclude a VM, you need to place it under the exclusion list.
  • If you want to separate Tier 1 VM from lower tiers, you can create 2 separate heatmap widgets.

Dashboard: IaaS Availability

Are the ESXi up and running? Is there any one of them running on high temperature? Temperature that runs high will trigger BIOS to power off the box.

This heat map can scale to a few hundred hosts, so it’s good enough for most customers. For customers with >500 hosts, group them into service tier. Yes, that means you need different dashboard, 1 per service tier.

The 2nd widget is based on vCenter Datacenter object. The logic is you don’t have a localized heat problem, unless it’s a fan failure. Speaking of fan failure, you should add Blue Medora so you can show hardware, not just ESXi. Show hardware failure, like power supply, fan, disk.

That’s it for Part 1. Hope you find it useful. Part 2 has been scheduled for Thursday.