SDDC Dashboards: The Kitchen

This post is part of Operationalize Your World post. Do read it first to get the context.

There are only 4 parts in the IaaS Monitoring:

  1. Capacity
  2. Configuration (with Inventory)
  3. Audit and Compliance
  4. Availability

Can you figure out why we do not have Performance in “the kitchen” area of your restaurant business?

Performance SLA concept explains why. I’ve also applied it to VDI use case and give an example.

Capacity

The Capacity dashboards below take into account Performance SLA and Availability SLA. Only when these 2 are satisfied, that it considers Utilization. Review this series of blogs for an extensive coverage on this new model.

The set of dashboards answer questions such as:

  • What’s the capacity of my clusters?
  • What’s the consumption on the clusters?
  • Which clusters are running low?
  • Is the cluster still coping well with demands?
  • Does a cluster consist of mostly large VMs?

Here it the dashboard for Tier 1, where we do not overcommit. As a result, both performance & utilization are irrelevant. It is driven by Availability SLA.

The vCPU and vRAM remaining is based on allocation model. It takes into account HA setting.

BTW, the lines in the 2 line chart above do not gradually come down (or up) because this is a lab, not a real life environment. Your production environment will have line chart that makes sense 🙂

Here it the dashboard for Tier 2 or 3. Since we overcommit, we now have to take into account performance, and then utilization.

capacity-tier-2-compute

As you can see from the above dashboard, it has 3 sections:

  • Availability SLA. Do we reach the concentration risk?
  • Performance SLA. Do we serve existing workload well?
  • Utilisation. It uses the net usable capacity as the ceiling. This ceiling takes into account your HA settings and Buffer. The default value for buffer is 10%, which you can change via policy.

Can you spot a limitation on the capacity dashboards I’ve shown so far?

Yes, it’s hard to compare across clusters. If you have many clusters, you want to know which clusters to check first. This dashboard lets you compare. It’s color coded so it’s easier for you to see.

For implementation details, refer to this post.

The twin-sister of Capacity is Reclamation. What can you reclaim and from which VMs?

reclamation

For implementation details, refer to this post.

Configuration 

In the world of Software-Defined, configurations are easy to change. So consistency and drift become 2 areas you need to watch.

The set of dashboards answer questions such as:

  • Are my ESXi config consistent, especially if they are member of the same cluster?
  • Are my ESXi & Clusters configured to follow best practice?
  • Do I have too many combination, which increase complexity?
  • What have I got?

The dashboard below is for ESXi:

configuration-esxi

The dashboard below is for Cluster:

configuration-cluster

That’s all you can do in vR Ops 6.4. If you need more details, you need to deploy VCM. The latest release is 5.8.3. For the list of configuration that it can track per object, review this.

Inventory differs to Configuration.

  • Configuration has Standard, and hence Drift. Inventory does not.
  • Configuration has Compliance. Inventory does not. Well, not generally 😉
  • Configuration has value that can be bad (e.g. ESXi has no syslog). Inventory does not.
  • Inventory has stock take (typically annual). This can trigger work, which impact Configuration.
  • Inventory is typically reported on regular basis.

Because of the above, we’ve provided a purpose built dashboard to track inventory.

inventory

Audit and Compliance

You can check your environment compliance to vSphere Hardening Guide. The dashboard belows shows the summary of compliance, with ability to drill down to each object.

capture

vCenter tasks, events and alarms are 3 areas that you can mine to help answer compliance and audit. Log Insight complements vR Ops nicely here. For example, the following screenshot answer this audit question

  • Who shutdown what VM and when?

compliance

There are many things it can answer, and it’s covered in the workshop.

Availability

Because of HA and DRS, tracking Cluster makes more sense than tracking each ESXi. A cluster uptime remains 100% when 1 host is not available because you have HA. You have catered for that, and as a result, you should not be penalized.

The set of dashboards answer questions such as:

  1. What’s the availability (%) of each cluster in the last 24 hours? Each cluster has its own line chart, and it’s color coded. You expect a green bar, as shown below.
  2. What’s the availability now? The heatmap provides that answers quickly. You can drill down into the cluster if you spot a problem.
  3. Am I containing risk when there is a major outage. How many VMs am I willing to lose when a cluster or datastore goes down?

availability-cluster

The heat map also provides the ESXi uptime. You can toggle between Cluster and ESXi.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

3 thoughts on “SDDC Dashboards: The Kitchen

  1. Pingback: Tenant Self-Service Dashboards - virtual red dot

  2. Pingback: Capacity Management: What's wrong with these statements? - virtual red dot

  3. Pingback: Capacity Management based on Business Policy

Leave a Reply