SDDC Dashboards: The Dining Area

If your customers are happy, your internal problem is secondary.

To ensure that your customers are happy, there are a few proof you must be able to show:

  1. Are the VMs up?
    • This is the #1 Job. It is more important than security and performance. If the VM is dead, there is nothing to talk about 🙂
  2. Are they fast?
    • Just because they are up does not mean they are fast 🙂
    • Is your IaaS serving them well?
  3. If not, which VMs are hit? By what and when?
    • Who are the victims?
  4. Who’s causing the problem?
    • Who are the villain?
  5. When a VM Owner complains, can your Help Desk value add, within 1 minute?
  6. We know we have Over Provisioning disease.
    • But how bad is it?
    • Can you Right-sizing VM, without impacting performance?

Let’s go through the dashboards that answer those questions, starting from Question 1.

Are the VMs up?

The dashboard helps in the following area:

  1. What’s the overall uptime? CIO may ask you to give the overall uptime across time. You can provide a line chart, showing the aggregate uptime among all the VMs.
  2. What’s the Uptime for each VM per month? The table on the dashboard is grouped by month. It’s showing Sep 2016. All VMs are showing 100%, which is what you want to see before you go for lunch or holiday 🙂
  3. What’s the VM availability now? The heat map provides an easy visualisation. You just expect green for all VMs.
  4. If a VM Uptime is <100%, when was it down and how long? You can click on the heat map, and a line chart will be shown automatically. What you want to see is a straight line.

Availability - VM

Are they fast?

The dashboard helps in the following area:

  1. Is your IaaS serving them well? If not, when does it fail to deliver?
    • If you do not define well, you have not defined fast. If you have not defined it, you have not set measureable expectation. That’s not a position you want to take, unless you enjoy performance troubleshooting 🙂
    • Measureable expectation = Performance SLA. Review this to help you.
  2. Which part of your IaaS business fails to deliver the promise?
    • In IaaS, you are only selling CPU, RAM, Disk and Network. The VMs are consuming these 4 resources. Make sure they get what you promise them
  3. How is the performance per cluster?
    1. vSphere Cluster is the smallest logical building block, due to DRS and HA.

picture1

The Performance SLA is dynamic. When you select a cluster of different tier, notice the SLA changed too. You can adjust the SLA to your actual number.

picture2

Who are the victims?

Your IaaS can fail to deliver different resources at different time. For example, it has CPU performance issue at 12:35 pm and Disk performance issue at 22:40 pm. The performance line chart shows you any correlation, if any. In the above example, the selected cluster has Storage performance issue, but doing well on CPU and RAM.

During the same time interval, different VMs can be hit by different problems. If your IaaS fails to deliver on CPU and Disk at 12:35 pm, VM 007 can be hit with CPU problem while VM 747 can be hit with Disk problem. This is why you need to be able to see each resource (CPU, RAM, Disk, Network) independently.

This dashboard depends on the previous dashboard. You select a cluster, then navigate to this dashboard. It will only show VMs from that cluster. You can see which VMs are hit by what (CPU, RAM, Disk, Network). This lets you take the appropriate action, before VM Owner complains.

picture3

Who are the villain?

Which VMs were generating excessive workload? When and for how long? You can see it by tracking the maximum workload generated by any VM on a line chart. The example below shows an excessive IOPS. It jumped to 13,212 IOPS when the average did not even touch 15 IOPS.

picture1

VMs can only generate excessive workload on IOPS and Network. It can’t abuse CPU and RAM, as it can’t go beyond the configuration. The dashboard tracks IOPS and Network. Once you see a peak, you use the Top-N to list the VMs.

picture7

When a VM Owner complains

A VM Owner only cares about her VM. The fact that you have 1001 other VMs is irrelevant. As a result, the fact that your VMware cluster is working hard at 100% utilization is also relevant. That’s the following dashboard does not show other VM and your Infrastructure.

A Help Desk operator can simply search for the VM, or browse the list. Once found, he simply selects that VM. How well your IaaS platform serves it will be automatically shown. The dashboard uses line chart, and not a single number, so you can if there is any pattern. From the example below, it’s clearly showing the IaaS unable to meet its promise on CPU but do well on RAM. It failed for around 20 minutes on Disk.

picture4

The above dashboard clearly tells if you are serving your customer well. It’s suitable for Help Desk Operator. What if you need to find out why. Another word, you move from monitoring to troubleshooting. From this dashboard, you can navigate to the VM Troubleshooting dashboard

picture5

The troubleshooting dashboard provides additional counters. Performance problem can be caused by only 2 main reasons:

  1. The VM itself
    1. It is too small, using wrong driver, apps not leveraging resource.
  2. The Infra
    1. It is heavily loaded (normally due to lots of small VMs)
    2. It is unable to cope (normally due to large VM).
    3. Infra means ESXi, Storage and Network.

The dashboard automatically shows the relevant ESXi and Datastore, with their KPI.

Again, line chart is used, and not a single number, because they give you a lot more info.

picture6

Over Provisioning disease

If you take all the large VMs in your environment, and plot the maximum utilization among them, what do you expect?

You are right. It depends whether they are over provisioned or not. If they are, the max among them will be low. The average will be even lower.

In a healthy, right-sized environment, there is bound to be 1 VM who have high utilization at any given time. This is especially true in a large environment.

The line charts below show the Max and Average utilizations among the large VMs. We can tell easily the degree of provisioning.

The line chart does not show the VMs. That’s where the table comes in. It shows the max utilization of each VM in a given period.

The table does not show relative comparison among these large VMs. If you want to expose the largest VMs, the heat map shows that. The larger the VM, the larger the box.

picture1

What about undersized? Generally speaking, this is not your problem. But if you want to answer “Which VMs hit high CPU usage when?”, you can use the following dashboard:

picture3

The above is what you want to see, indicating only 2 VMs had the problem in the past >1 month. In an environment where many VMs are undersized, you will see something like this. Notice this is not 2 months. This is just 6 hours, and each bar is only 10 minutes!

picture4

Right-sizing VM without impacting performance

The previous dashboard give you the overall situation. To right size, you need to deal with individual VM. This gives you the confidence that performance will not be affected.

You can select any of the large VM, starting from the one with the least utilization. The dashboard below will automatically lists the VM utilization.

  • Each vCPU of the VM are listed in table. It shows the maximum utilisation of each vCPU in the timeline you are interested.
  • It shows analysis of the utilization of the VM. The Forensic chart shows 95% of the VM utilization. You expect that number to be >80% as a VM can’t be spending 95% of the time doing just 20% utilization. The Forensic also shows you the remaining 5%, so you can be convinced.

1-vm

Most VM Owners will ask for a detailed line chart showing each vCPU utilisation. The line chart below will be automatically shown when a VM is selected. It retains a 5-minute granularity.

picture2

RAM right sizing is more challenging, hence it’s not covered here yet. Review this to know more why its not so simple.

Conclusion

Hope you find the blog useful. For more info, you can refer to chapters 4 – 7 in this book.

A set of dashboards for SDDC Operations

This post continues from the Operationalize Your World post. Do read it first so you get the context.

A common requirements among customers is to have a set of dashboards to help them manage their VMware IaaS platform. They want a suite of inter-connected dashboards, not individual dashboards.

In the past several years, we have developed around 40 dashboard to help you operate your VMware SDDC. The dashboards form 1 story. We group each dashboard into the 4 pillars of SDDC Operations.

The set of dashboards also go beyond vSphere Admin, and provide dashboards for Storage Team, Network Team and NOC Team. However, they are yet to provide a complete coverage for every role and every purpose. The table below shows the coverage. No means there is no dashboard yet.

blog 2

Different roles in the team are interested in what’s relevant to them first, which is why the dashboards are tailored for each. Here is the dashboards provided, grouped by role and purpose.

There are naturally more dashboards for the Platform Team. The team was known as the Server Team in the old days of physical world. They have evolved into Platform Team, and is typically where VMware Admin and Architect belong.

They have 2 interfaces in the company:

  • upstream: to VM Owners, application team.
  • downstream: to Storage Team, Network Team

In addition, they also deal with IT Management (CIO, etc), Help Desk and Security/Compliance team.

Capture

You may notice in the above picture that some dashboards are in grey. That means they are not available. Need MP means it needs a Management Pack. We have not included MP as part of this solution. You should get vSphere under control first before extending coverage. Need feedback means I’m yet to see a use case for it. Every dashboard answers a question, and has to be complementary to other dashboards.

The tools we use to manage VMware SDDC is vRealize Operations and Log Insight. We do live demo during the events and customers eagerly ask for a copy that they can import into their environment. This blog provides the steps to import.

Here is what they look like in vRealize Operations 6.3. The dashboard works with 6.3 and 6.2. I have not yet used any 6.3 new feature to maintain backward compatibility. You can easily modify them to take advantage of 6.3.

Dashboards

This will be a set of blog posts. We will cover in the following order:

  1. The Dining Area
  2. The Kitchen
  3. Network Team
  4. Storage Team
  5. Help Desk & NOC
  6. Senior IT Management

The series will appear, 1 blog posts a week starting this week. I’ll add the link to this blog, so you have 1 place to see them.

What’s new with vRealize Operations 6.3

This blog complements the official blog that you can find here, and other articles such as this. Bill Roth and team have written a lot of articles there. You should also read the Release Notes. Take note when you have EPO agent.

Super Metric

I use around ~50 super metrics in typical engagement. By and large, it’s good enough for my customers’ requirements. On the other hand, I have seen folks like Ronald Buder and Brandon Gordon, who build very advance formula, and would like to have more capabilities. Here are 3 enhancements that would go a long way in making super metric more useful:

  • Ability to specify a condition
    • Prior to 6.3, super metric applies to every member of the group or the parent object. If you are counting the number of VMs in a cluster, it will give you all VMs.
    • With 6.3, you can add condition. You can count only VMs that are powered on, or VMs with >8 vCPU. Another example, you can count how many VMs in a Datastore which have latency above certain number.
  • Ability to have IF THEN ELSE
    • Prior to 6.3, super metric works in 1 formula. You cannot apply Formula 1 for condition A and Formula 2 if Condition A is not met. A use case here is you are checking VM Uptime. If you have VMware Tools running, you use the Tools heartbeat to decide that the VM is up. If the VMware Tools is not running, you use VM utilization to decide.
    • The IF THEN ELSE can be combined with AND, OR and NOT. This enables you to build a more comprehensive logic.
  • Ability to combine expression
    • You can have AND, OR, NOT. Enough said 🙂
    • Ability to compare. You can have less than, less than or equal to, greater than, greater than or equal to.

The where clause cannot point to another object, but can point to different metric in the same object. For example, you cannot count the number of VMs in a cluster with CPU Contention metric > SLA of that cluster. The phrase “SLA of that cluster” belongs to the cluster object, not VM object.

That right operand must also be a number. It cannot be another super metric or variable.

The where clause cannot be combined with AND, OR, NOT. This means you cannot have “where VM CPU > 4 and VM RAM > 16”. The reason is that ‘where’ clause calculation is running on the vR Ops node where the data is retrieved, while the rest of all operators (AND, OR, NOT) are running on the node where the super metric expression is executed. Other operators are executed when all data has already retrieved. The retrieved data does not contain metric values for each member object but aggregated values of these objects.

As expected, you will find the new operators in the super metric editor, as shown below.

1

The following screenshot, courtesy of Brandon Gordon, shows a brief description of the operators:

2

Example on how to use the where clause

sum(${adaptertype=VMWARE, objecttype=VirtualMachine, attribute=mem|guest_provisioned, depth=5, where = "sys|poweredOn==1" })

Example on how to use the IF THEN ELSE

${this, metric=diskspace|used}>1024 ? max(${this, attribute=virtualDisk|commandsAveraged_average} as IOPS) / ${this, metric=diskspace|used} * 1024 : max(IOPS)

The [x,y,z] array is actually available since earlier release. What you can do now is x, y, z are independent expressions and all their results are put into the array. They are no longer limited to just constant or metric.

Resource Alias

The name of resource is rather long. If you have a lot of resources in the formula, the whole formula can be hard to read. You can now have a name for the resource. Here is an example:

Before 6.3:

(min(
${adapterkind=VMWARE, resourcekind=HostSystem, attribute= cpu|demand|active_longterm_load, 
depth=5, where=”>=0”}) + 0.0001)/
(max(${adapterkind=VMWARE, resourcekind=HostSystem, attribute=cpu|demand|active_longterm_load, 
depth=5, where=”>=0”}) + 0.0001)”

We will name the resource as CPUload. by adding as CPUload in the formula. Once added, we can refer to it in the formula, resulting in a shorter formula.

(min(
${adapterkind=VMWARE, resourcekind=HostSystem, attribute= cpu|demand|active_longterm_load,
depth=5, where=”>=0”} as CPUload) + 0.0001)/
(max(CPUload) + 0.0001)”

Notice that CPUload includes the depth clause and where clause, not just the metric.

VM Memory metrics

I discuss the limitation in sizing VM RAM in this blog. In a nutshell, the hypervisor does not have visibility into how the Guest OS manages its RAM. Some applications, such as JVM and databases, manage their own RAM. The guest OS does not have visibility to how the app manages its RAM. This is why RAM sizing is best done at Guest OS and App levels.

vR Ops 6.3 brings Guest OS RAM metrics. Yes, it is agentless. There is no need to deploy agents on every VM. How does it work then, if there is no network connection to the VM? VMware Tools comes to the rescue! vR Ops talks to vCenter, which in turns talks to the ESXi via management network. The new version of VMware Tools pulls these additional counters. ESXi retrieves them and passes them to vCenter.

This feature was actually available since vSphere 6.0 Update 1. Yes, that means you need a minimum of ESXi 6.0U1, vCenter 6.0U1 and the VM must be running Tools from ESXi 6.0U1. You do not need to upgrade to vSphere Update 2.

Here is the list of metrics. I’m giving in the internal name as it’s handy to know them.

Internal nameDescription
guest|mem.free_latestThis is one the 3 major counters for capacity & performance monitoring. The other 2 counters are Page-in Rate and Commit Ratio.
In Windows, this is the Free Memory counter. This excludes the cached memory. If this number drops to a low number, Windows is running out of Free RAM. While that number varies per application and use case, I’d generally keep this number > 500 MB for server VM and >100 MB for VDI VM. I set a lower number for VDI because they add up. If you have 10K users, that’s 1 TB of RAM.
guest|mem.needed_latestThe amount of memory needed by the Guest OS.
This looks like the Used RAM as the value is the opposite of Free.
guest|page.inRate_latestguest|page.inRate_latest The Rate the Guest OS brings memory back from disk to DIMM per second. A page that was paged out earlier, has to be brought back first before it can be used. This creates performance issue as the application is waiting longer, as disk is much slower than RAM.
guest|page.outRate_latestThe opposite of the above. This is not as important as the above. Just because a block of memory is moved to disk that does not mean the application experiences memory problem. In many cases, the page that was moved out is the idle page.
guest|page.size_latestSize of the page. In Windows, this is 4 KB by default.
This is not the size of the pagefile.sys in c:\.
guest|mem.physUsable_latestPhysically Usable Memory
Based on a sample of 9 VMs (Windows and Linux), this looks like VM Configured RAM - Hardware used. Since Hardware Used is near 0, this value is near the Configured RAM
guest|swap.spaceRemaining_latestRemaining Swap Space
If you know what this means, I’d appreciate if you let me know.
guest|hugePage.size_latestCurrent size of Huge Page.
This should be 2 MB in Windows.
guest|hugePage.total_latestTotal number of Huge Pages.
Windows Performance Monitor and Resource Monitor do not track this, so it’s good to vR Ops providing additional details.
guest|contextSwapRate_latestContext Swap Rate per second in Windows/Linux
guest|mem.activeFileCache_latestActive File Cache Memory.
This seems to be the Cache Bytes in Windows

Let’s compare them with the RAM counters from Windows. The list below is from Windows 10 Performance Monitor.

3

I’m not sure if they are enabled by default. If not, it’s a matter of enabling from the Policy, as shown below:

VM RAM

This is what it looks like in VM object. Finally! 🙂

VM RAM 2

Reduction in Metrics

This is one of my favourite, as I do have customers struggle with the long list of metrics. This should also improve vR Ops scalability. The example below is from ESXi Host. Quite a number of the capacity metrics are now hidden, as they are needed by default.

9

The reduction can be seen in the Self Monitoring, which has improved a lot in 6.3 also. You can see the number of metrics dropped on the following chart.

Reduced Metrics

The reduction translates into less resource utilisation (CPU, Disk, RAM). I’ve added CPU as an example. Notice the load is also less spiky.

CPU reduced in percentage

Drill down via Line chart

One popular use case is the ability to automatically plots all the children value when you select a parent. There are many examples of this, such as:

  • You select a cluster, and you want to automatically have a line chart of all its ESXi CPU Demand. If you have 8 hosts in that cluster, then you get 8 line charts.
  • You select a data center, and want to automatically have a line chart of all its clusters No of VM too have a sense of VM growth among clusters.

See the following screenshot. Can you notice how it’s done?

4

Hint: it’s done a little differently than what you do in other widgets.

The way you do this is by knowing relationship among objects. You choose the metrics you want to display, not the parent. In the following example, I need to show the ESXi CPU contention on all ESXi in a cluster. So I pick the ESXi object, not the cluster object.

You do not have to specify the relationship (parent, child, self, etc.). vRealize Operations actually automatically figures out the relationship. Unlike other widgets, where you must specify, the View Widget has that intelligence built-in. Nice!

Can you spot a performance issue that happened in the past in the selected cluster below?

5a

The above screenshot shows one of the ESXi experienced a spike in CPU Contention. It touched 9%, which is a high number as the number at ESXi level is the average of all its VMs. One of the VM likely experiencing a much higher number, as most VMs have low CPU Contention. The reason why most have low value because your ESXi has enough cores to serve quite a number of VMs.

Heat Map: Zoom and Grouping

I use heat map a lot, especially in Configuration and Capacity use cases. They are also useful in NOC (big screen or projector). They are not so useful in performance as they can only show latest value. Since vR Ops collect data every 5 minutes, that means anything beyond 5 minutes cannot be shown.

The other limitation of Heat Map, which is addressed in 6.3, is scalability. When you have lots of objects, it can be difficult to see. 6.3 groups the objects, and allows you to drill down.

6

I then drilled down into the selected group. It reveals a lot of more objects.

6 (2)

Sportier looking

I’m a big fan of UI and UX. While underlying architecture matters, the human experience is what we see every time we deal with the system. There are 3 UI enhancements that I spotted as I compared 6.2.1 widgets with 6.3 widgets.

Scoreboard widget

The Scoreboard widget now provides more visual themes than just 2 themes. This is useful when you have multiple Scoreboard widgets in 1 screen. You can use 1 theme for VM and another theme for Infrastructure objects. They help in differentiating objects easily.

7

There is a small usability enhancement. When you choose Fixed View, the size controls do not appear as it’s not relevant. Choose Fixed Size and they will appear.

8

Scoreboard Health widget

Here is what it looks like in 6.2. Notice the font for the object name is not so clear. It does not work well if you need to show it on the NOC (big screen projector). The other problem is long name is truncated. Some objects, such as Disk Device and NSX port group, are very long.

9 (2)

Notice the border? Yup, I’m not a big fan either J Personally, I prefer not to see the border. I use this widget to see a lot of objects, so the border does get in the way.

Here is what it looks in 6.3. I definitely find this more usable. Thank you UX team!

10

Forensic widget

I use forensic widget to quickly know where an object spends 95% of its time. The chart below shows that the ESXi has barely any CPU stress. 95% of the time, the value is not even 0.002%. Once you get used to this widget, it’s a great complement to other visualisation.

11

As you can see above, in 6.2 the UI is looking a little dated.

This is what it looks like in 6.3. Notice the grid lines make it easier to read. There is also peak and low, so it’s easier to see the minimum and maximum.

12

GUI Editor for XML interaction

No more manually modifying XML file and figuring out what the metric names are! There is now a wizard that guides you along the way.

the-wizard-guides-you

Once you select the Adapter Kind, the wizard automatically moves into the Resource Kind. No more typing!

the-wizard-guides-you-2

Maintenance Schedule

The maintenance schedule has more flexibility. A few limitations in 6.2 that were addressed in 6.3:

  • You cannot specify the start date. You can only specify the start time.
  • You cannot specify the expiry date on this schedule. Often you want to schedule only for a fixed period, such as a few months or weeks.
  • You cannot specify the number of runs. Sometimes you want to specify that you only need to run this a few times.

As a comparison, here is what the maintenance schedule editor looks like in 6.2:

13

6.3 addresses the above limitation as you can see in the following screenshot.

14

Note: The new Maintenance Scheduler is not backward compatible. All previously created maintenance schedules will no longer be available and should be created again.

New VM properties

VM folder and VM Datastore are now available via the View widget. If a VM has >1 datastore, it will show all of them, separated by commas. If you have a nested folder, it will show all of them too.

15

That’s all folks. Hope it helps and keep in touch at LinkedIn.