Tag Archives: widget

What’s new with vRealize Operations 6.3

This blog complements the official blog that you can find here, and other articles such as this and this by Michael Ryom. Bill Roth and team have written a lot of articles there. You should also read the Release Notes. Take note when you have EPO agent.

Super Metric

I use around ~50 super metrics in typical engagement. By and large, it’s good enough for my customers’ requirements. On the other hand, I have seen folks like Ronald Buder and Brandon Gordon, who build very advance formula, and would like to have more capabilities. Here are 3 enhancements that would go a long way in making super metric more useful:

  • Ability to specify a condition
    • Prior to 6.3, super metric applies to every member of the group or the parent object. If you are counting the number of VMs in a cluster, it will give you all VMs.
    • With 6.3, you can add condition. You can count only VMs that are powered on, or VMs with >8 vCPU. Another example, you can count how many VMs in a Datastore which have latency above certain number.
  • Ability to have IF THEN ELSE
    • Prior to 6.3, super metric works in 1 formula. You cannot apply Formula 1 for condition A and Formula 2 if Condition A is not met. A use case here is you are checking VM Uptime. If you have VMware Tools running, you use the Tools heartbeat to decide that the VM is up. If the VMware Tools is not running, you use VM utilization to decide.
    • The IF THEN ELSE can be combined with AND, OR and NOT. This enables you to build a more comprehensive logic.
    • You can chain it to create IF THEN ELSEIF.
  • Ability to combine expression
    • You can have AND, OR, NOT. Enough said 🙂
    • Ability to compare. You can have less than, less than or equal to, greater than, greater than or equal to.

The where clause cannot point to another object, but can point to different metric in the same object. For example, you cannot count the number of VMs in a cluster with CPU Contention metric > SLA of that cluster. The phrase “SLA of that cluster” belongs to the cluster object, not VM object.

That right operand must also be a number. It cannot be another super metric or variable.

The where clause cannot be combined with AND, OR, NOT. This means you cannot have “where VM CPU > 4 and VM RAM > 16”. The reason is that ‘where’ clause calculation is running on the vR Ops node where the data is retrieved, while the rest of all operators (AND, OR, NOT) are running on the node where the super metric expression is executed. Other operators are executed when all data has already retrieved. The retrieved data does not contain metric values for each member object but aggregated values of these objects.

As expected, you will find the new operators in the super metric editor, as shown below.

1

The following screenshot, courtesy of Brandon Gordon, shows a brief description of the operators:

2

Example on how to use the where clause

sum(${adaptertype=VMWARE, objecttype=VirtualMachine, attribute=mem|guest_provisioned, depth=5, where = "sys|poweredOn==1" })

Example on how to use the IF THEN ELSE

${this, metric=diskspace|used}>1024 ? max(${this, attribute=virtualDisk|commandsAveraged_average} as IOPS) / ${this, metric=diskspace|used} * 1024 : max(IOPS)
Max([${this, metric=mem|host_contentionPct}>${this, metric=Super Metric|sm_4a3bd0c0-c897-4baf-a60e-4bea139e537b} ? 1 : 0, ${this, metric=cpu|capacity_contentionPct}>${this, metric=Super Metric|sm_20ff3c62-0185-47a8-9bdc-a96f3081a2a8} ? 1 : 0])

The [x,y,z] array is actually available since earlier release. What you can do now is x, y, z are independent expressions and all their results are put into the array. They are no longer limited to just constant or metric.

Resource Alias

The name of resource is rather long. If you have a lot of resources in the formula, the whole formula can be hard to read. You can now have a name for the resource. Here is an example:

Before 6.3:

(min(
${adapterkind=VMWARE, resourcekind=HostSystem, attribute= cpu|demand|active_longterm_load, 
depth=5, where=”>=0”}) + 1)/
(max(${adapterkind=VMWARE, resourcekind=HostSystem, attribute=cpu|demand|active_longterm_load, 
depth=5, where=”>=0”}) + 1)”

We will name the resource as CPUload. by adding as CPUload in the formula. Once added, we can refer to it in the formula, resulting in a shorter formula.

(min(
${adapterkind=VMWARE, resourcekind=HostSystem, attribute= cpu|demand|active_longterm_load,
depth=5, where=”>=0”} as CPUload) + 1)/
(max(CPUload) + 1)”

Notice that CPUload includes the depth clause and where clause, not just the metric.

VM Memory metrics

I discuss the limitation in sizing VM RAM in this blog. In a nutshell, the hypervisor does not have visibility into how the Guest OS manages its RAM. Some applications, such as JVM and databases, manage their own RAM. The guest OS does not have visibility to how the app manages its RAM. This is why RAM sizing is best done at Guest OS and App levels.

vR Ops 6.3 brings Guest OS RAM metrics. Yes, it is agentless. There is no need to deploy agents on every VM. How does it work then, if there is no network connection to the VM? VMware Tools comes to the rescue! vR Ops talks to vCenter, which in turns talks to the ESXi via management network. The new version of VMware Tools pulls these additional counters. ESXi retrieves them and passes them to vCenter.

This feature was actually available since vSphere 6.0 Update 1. Yes, that means you need a minimum of ESXi 6.0U1, vCenter 6.0U1 and the VM must be running Tools from ESXi 6.0U1. You do not need to upgrade to vSphere Update 2.

The table shows a variety of VMs with the Guest OS data. I’ve added the Active RAM from hypervisor as a comparison.

ram-metric

Here is the list of metrics. I’m using the internal name as the table above already has the friendly name.

Internal nameDescription
guest|mem.free_latestThis is one the 3 major counters for capacity & performance monitoring. The other 2 counters are Page-in Rate and Commit Ratio.
In Windows, this is the Free Memory counter. This excludes the cached memory. If this number drops to a low number, Windows is running out of Free RAM. While that number varies per application and use case, I’d generally keep this number > 500 MB for server VM and >100 MB for VDI VM. I set a lower number for VDI because they add up. If you have 10K users, that’s 1 TB of RAM.
guest|mem.needed_latestThe amount of memory needed by the Guest OS. Below this amount, the Guest OS may swap.
This is Total RAM - Free RAM. It includes the Cached RAM, which Standby + Modified.
The Standby memory (which can be significant on Windows, less so on Linux) can be split into 3: FreeAndZero, Cold and Hot. MemNeeded will count the hot part of the buffer cache as being required by the OS.
guest|page.inRate_latestThe rate of reads going through the underlying paging/cache system. It includes not just swapfile I/O, but cacheable reads as well (double, pages/s). The Rate the Guest OS brings memory back from disk to DIMM per second. A page that was paged out earlier, has to be brought back first before it can be used. This creates performance issue as the application is waiting longer, as disk is much slower than RAM.
Windows does not page out any Large Pages. A process can have concurrent mixed usage of Large and non-Large page in Windows. The page size isn’t a system-wide setting that all processes use. The same is likely true for Linux Huge Pages
guest|page.outRate_latestThe opposite of the above. This is not as important as the above. Just because a block of memory is moved to disk that does not mean the application experiences memory problem. In many cases, the page that was moved out is the idle page.
guest|page.size_latestSize of the page. In Windows, this is 4 KB by default.
This is not the size of the pagefile.sys in c:\.
guest|mem.physUsable_latestPhysically Usable Memory
Based on a sample of 9 VMs (Windows and Linux), this looks like VM Configured RAM - Hardware used. Since Hardware Used is near 0, this value is near the Configured RAM
guest|swap.spaceRemaining_latestThe amount of swap space remaining, taking into account the possibility of swapfile growth where possible. If the system is configured to run without a swapfile, this will return zero
guest|hugePage.size_latestCurrent size of Huge Page.
This should be 2 MB in Windows.
guest|hugePage.total_latestTotal number of Huge Pages.
This is Linux specific.
guest|contextSwapRate_latestContext Swap Rate per second in Windows/Linux
guest|mem.activeFileCache_latestActive File Cache Memory. This is the actively in-use subset of the file cache. Unused file cache and non-file backed anonymous buffers (mallocs etc) are not included.
This seems to be the Cache Bytes in Windows

Let’s compare them with the RAM counters from Windows. The list below is from Windows 10 Performance Monitor.

3

I’m not sure if they are enabled by default. If not, it’s a matter of enabling from the Policy, as shown below:

VM RAM

This is what it looks like in VM object. Finally! 🙂

VM RAM 2

Reduction in Metrics

This is one of my favourite, as I do have customers struggle with the long list of metrics. This should also improve vR Ops scalability. The example below is from ESXi Host. Quite a number of the capacity metrics are now hidden, as they are needed by default.

9

The reduction can be seen in the Self Monitoring, which has improved a lot in 6.3 also. You can see the number of metrics dropped on the following chart.

Reduced Metrics

The reduction translates into less resource utilisation (CPU, Disk, RAM). I’ve added CPU as an example. Notice the load is also less spiky.

CPU reduced in percentage

Drill down via Line chart

One popular use case is the ability to automatically plots all the children value when you select a parent. There are many examples of this, such as:

  • You select a cluster, and you want to automatically have a line chart of all its ESXi CPU Demand. If you have 8 hosts in that cluster, then you get 8 line charts.
  • You select a data center, and want to automatically have a line chart of all its clusters No of VM too have a sense of VM growth among clusters.

See the following screenshot. Can you notice how it’s done?

4

Hint: it’s done differently than in other widgets.

The way you do this is by knowing relationship among objects. You choose the metrics you want to display, not the parent. In the following example, I need to show the ESXi CPU contention on all ESXi in a cluster. So I pick the ESXi object, not the cluster object.

You do not have to specify the relationship (parent, child, self, etc.). vRealize Operations actually automatically figures out the relationship. Unlike other widgets, where you must specify, the View Widget has that intelligence built-in. Nice!

Can you spot a performance issue that happened in the past in the selected cluster below?

5a

The above screenshot shows one of the ESXi experienced a spike in CPU Contention. It touched 9%, which is a high number as the number at ESXi level is the average of all its VMs. One of the VM likely experiencing a much higher number, as most VMs have low CPU Contention. The reason why most have low value because your ESXi has enough cores to serve quite a number of VMs.

Property now accompany metrics

One widget customers use heavily is the Object List widget. It can list any objects along with its metrics. In 6.3, you can now list its property. This makes it a lot more useful.

property

Heat Map: Zoom and Grouping

I use heat map a lot, especially in Configuration and Capacity use cases. They are also useful in NOC (big screen or projector). They are not so useful in performance as they can only show latest value. Since vR Ops collect data every 5 minutes, that means anything beyond 5 minutes cannot be shown.

The other limitation of Heat Map, which is addressed in 6.3, is scalability. When you have lots of objects, it can be difficult to see. 6.3 groups the objects, and allows you to drill down.

6

I then drilled down into the selected group. It reveals a lot of more objects.

6 (2)

Sportier looking

I’m a big fan of UI and UX. While underlying architecture matters, the human experience is what we see every time we deal with the system. There are 3 UI enhancements that I spotted as I compared 6.2.1 widgets with 6.3 widgets.

Scoreboard widget

The Scoreboard widget now provides more visual themes than just 2 themes. This is useful when you have multiple Scoreboard widgets in 1 screen. You can use 1 theme for VM and another theme for Infrastructure objects. They help in differentiating objects easily.

7

There is a small usability enhancement. When you choose Fixed View, the size controls do not appear as it’s not relevant. Choose Fixed Size and they will appear.

8

Scoreboard Health widget

Here is what it looks like in 6.2. Notice the font for the object name is not so clear. It does not work well if you need to show it on the NOC (big screen projector). The other problem is long name is truncated. Some objects, such as Disk Device and NSX port group, are very long.

9 (2)

Notice the border? Yup, I’m not a big fan either J Personally, I prefer not to see the border. I use this widget to see a lot of objects, so the border does get in the way.

Here is what it looks in 6.3. I definitely find this more usable. Thank you UX team!

10

Forensic widget

I use forensic widget to quickly know where an object spends 95% of its time. The chart below shows that the ESXi has barely any CPU stress. 95% of the time, the value is not even 0.002%. Once you get used to this widget, it’s a great complement to other visualisation.

11

As you can see above, in 6.2 the UI is looking a little dated.

This is what it looks like in 6.3. Notice the grid lines make it easier to read. There is also peak and low, so it’s easier to see the minimum and maximum.

12

GUI Editor for XML interaction

No more manually modifying XML file and figuring out what the metric names are! There is now a wizard that guides you along the way.

the-wizard-guides-you

Once you select the Adapter Kind, the wizard automatically moves into the Resource Kind. No more typing!

the-wizard-guides-you-2

Maintenance Schedule

The maintenance schedule has more flexibility. A few limitations in 6.2 that were addressed in 6.3:

  • You cannot specify the start date. You can only specify the start time.
  • You cannot specify the expiry date on this schedule. Often you want to schedule only for a fixed period, such as a few months or weeks.
  • You cannot specify the number of runs. Sometimes you want to specify that you only need to run this a few times.

As a comparison, here is what the maintenance schedule editor looks like in 6.2:

13

6.3 addresses the above limitation as you can see in the following screenshot.

14

Note: The new Maintenance Scheduler is not backward compatible. All previously created maintenance schedules will no longer be available and should be created again.

New VM properties

VM folder and VM Datastore are now available via the View widget. If a VM has >1 datastore, it will show all of them, separated by commas. If you have a nested folder, it will show all of them too.

15

That’s all folks. Hope it helps and keep in touch at LinkedIn.

vRealize Operations 6: Top-N widget

The Top-N widget is used extensively in vRealize Operations 6. The default dashboards use them. For example, the Host Overview dashboard use it, as shown below:

Top N is average 0

Notice in the above screenshot (click if not clear) that there is a little text 24h. That means 24 hours. You can adjust, albeit manually, the time period. The value you are seeing in the above Top-N is the average of an entire 24 hour period. So if there is a peak during the period, it may get flatten. It is also the last 24 hours, not yesterday or today. Checking at 9 am or 6 pm will give you a different result. If you check them at 9 am, you’re looking at 9 am yesterday until 9 am today. You are not looking at yesterday (0000 – 2400).

Because the Top-N is an average, you may want to know a bit more details. This is where the Sparkline widget comes in. Clicking on any of the Top-N will show the corresponding object in the Sparkline widget.

You can certainly change the value from last 24 hours to any time period that fits your business needs. Changing the default dashboards do not impact the way vRealize Operations works (e.g. its dynamic threashold calculation). Dashboards are just way to present information.

You can also create your own dashboard and do your own style. For example, I do not use Sparkline as I like to have greater detail. I use Line Chart. I also use Line Chart first, then Top-N second. So it’s the other way around. This is because I have a preference to see details. I use Top-N when I need to zoom into a specific time line, to reveal the objects giving me the value in the Line Chart. I use this Line Chart + Top-N combo a lot when working with customers. You will find many examples in my book.

If you are curious if the Top-N value is really average of the selected time period, you can easily test it. I created a manual group. It has only 2 members. They are the 2 VMs shown below.

Top N is average

There are 2 Line Chart widgets in the above screenshot. Each of them has a corresponding Top-N widget below it. The first line chart shows a longer time horizon. I chose 7 days. From here I could see that there are some spikes. The Top-N, however, does not reveal that. This is because it is an average. It has flatten the data.

The second line chart shows a much shorter time. I have zoomed into 29 Dec around 1 pm. The line chart shows that BCDR-Prod-SRM-Server had a spike around 50% and then dropped to 4%.

Since I know the time period, I’m going to configure my Top-N to zoom into that specific time. You can see below that I’ve configured to 12:55 pm – 1:05 pm. So I’m taking only 2 values.  Top N is average - 2

Since the first value is 50.17%, and the second value is 4%, we would expect to see a Top-N showing 27.085%. And you got it right, the Top-N shows 27.085%.

 

Top N is average - 3