Tag Archives: vRealize Operations 6.3

vRealize Operations 6.3 Self Monitoring

vRealize Operations 6.3 sports an enhanced self monitoring. This is covered by Michael Ryom on this blog as part of his what’s new in 6.3, so I will start where he left off. So do read his first.

The screenshots in this blog are taken from 6.4 release (see what’s new by Roshan). I do not have 6.3, but I think this self-monitoring feature is the same with 6.3.

vCenter Adapter Details

The vCenter Adapter collects data from vCenter. The dashboard helps answer collection questions from each vCenter, such as:

  • Is there anything wrong in collection? A big drop in the number of objects and metrics can give a clue, especially if you are not removing objects in the associated vCenter.
  • Is collection taking longer than usual?
  • Is collection failing to collect the new objects?


The lab has ~300 VMs and 30 ESXi. I added the number of objects and metrics. On average, I get around 160 metrics per object.

As you can see from the above, I have customised it. It is safe to customize, and I do encourage you to do so. Best to follow these 2 rules:

  1. Do not use Admin account. You won’t be able to track what you have changed if you do.
  2. Do not modify the existing object. Clone them, prefix with your company name (e.g. MSFT)

If you want to know where the counters come from, go into edit mode. Notice you cannot edit if you are not using the built-in Admin account. That’s a protection, so you do not accidentally modify OOTB objects.


From the above, you can tell the metrics are coming from the vCenter object itself. vSphere World is chosen, as its children are vCenter objects.

Cluster Statistics

The dashboard provides aggregate information at cluster level, so you can see summary before going into each node. There are interesting counters such as Object, Metrics, Alarms and Alerts.

You can click on each scoreboard and the detail line chart will be automatically shown. For example, I clicked on Metric and can see that collection went up on 21 November. If this is not due to new VMs or vSphere infra, then it’s something I’d need to investigate.


You can also get usual information such as CPU, RAM, Disk and Network. I’ve selected the CPU Usage in the example below.


If your vR Ops is slow, you can use the Average IO Transaction Time to tell you if vR Ops is experiencing high disk latency. If the number is much higher here than what you see at the VM level, check if the IO is stuck in the Guest OS.


We can also see the IOPS. From here we can see there is a daily pattern. There is a daily spike in Writes. The peak hit 4K IOPS sustained over 5 minutes. So the actual IOPS is higher as it is a 300 seconds average. There is also a daily spike in Reads, but at a different time.


Performance Details

The detail dashboard covers the individual node. The lab only has 1 node, which is what I’d recommend you to deploy. From what I see, a single node with 4 vCPU running the latest Intel Xeon should be able to handle up to 4000 VM. I’m assuming you only use the vSphere Adapters.


Can you spot the customisation I made to the dashboard?

Yes, I’ve added extra column. This is how it’s done.


Notice you do not need more than 4 vCPU here.


Can you guess the one time peak around 16 November? Yes, that’s when we upgraded it.


You might want to customize the dashboard further, or build your own. You may also want to setup new alerts. To do that, you need to know 2 info:

  1. The Relationships among objects, such as the hierarchy.
  2. The metrics and properties for each object.

One way to study is to click on a particular object and see the All Metric page. Below is an example. This one is for the Collector services. You can see the metrics and property you can get from this object.


You can see the full list of metric & property here.

To create new alert and new symptom, it’s wise to check if existing alert has covered it. For example, here are the symptoms for collector. Notice there are different object type. You need to know that.


Hope you find it useful. I will expand this post next week once I finish my travel.

What’s new with vRealize Operations 6.3

This blog complements the official blog that you can find here, and other articles such as this and this by Michael Ryom. Bill Roth and team have written a lot of articles there. You should also read the Release Notes. Take note when you have EPO agent.

Super Metric

I use around ~75 super metrics in typical engagement. By and large, it’s good enough for my customers’ requirements. On the other hand, I have seen folks like Ronald Buder and Brandon Gordon, who build very advance formula, and would like to have more capabilities. Here are 3 enhancements that would go a long way in making super metric more useful:

  • Ability to specify a condition
    • Prior to 6.3, super metric applies to every member of the group or the parent object. If you are counting the number of VMs in a cluster, it will give you all VMs.
    • With 6.3, you can add condition. You can count only VMs that are powered on, or VMs with >8 vCPU. Another example, you can count how many VMs in a Datastore which have latency above certain number.
  • Ability to have IF THEN ELSE
    • Prior to 6.3, super metric works in 1 formula. You cannot apply Formula 1 for condition A and Formula 2 if Condition A is not met. A use case here is you are checking VM Uptime. If you have VMware Tools running, you use the Tools heartbeat to decide that the VM is up. If the VMware Tools is not running, you use VM utilization to decide.
    • The IF THEN ELSE can be combined with AND, OR and NOT. This enables you to build a more comprehensive logic.
    • You can chain it to create IF THEN ELSEIF.
  • Ability to combine expression
    • You can have AND, OR, NOT. Enough said 🙂
    • Ability to compare. You can have less than, less than or equal to, greater than, greater than or equal to.

The where clause cannot point to another object, but can point to different metric in the same object. For example, you cannot count the number of VMs in a cluster with CPU Contention metric > SLA of that cluster. The phrase “SLA of that cluster” belongs to the cluster object, not VM object.

That right operand must also be a number. It cannot be another super metric or variable.

The where clause cannot be combined with AND, OR, NOT. This means you cannot have “where VM CPU > 4 and VM RAM > 16”. The reason is that ‘where’ clause calculation is running on the vR Ops node where the data is retrieved, while the rest of all operators (AND, OR, NOT) are running on the node where the super metric expression is executed. Other operators are executed when all data has already retrieved. The retrieved data does not contain metric values for each member object but aggregated values of these objects.

As expected, you will find the new operators in the super metric editor, as shown below.


The following screenshot, courtesy of Brandon Gordon, shows a brief description of the operators:


Example on how to use the where clause

sum(${adaptertype=VMWARE, objecttype=VirtualMachine, attribute=mem|guest_provisioned, depth=5, where = "sys|poweredOn==1" })

Example on how to use the IF THEN ELSE

${this, metric=diskspace|used}>1024 ? max(${this, attribute=virtualDisk|commandsAveraged_average} as IOPS) / ${this, metric=diskspace|used} * 1024 : max(IOPS)
Max([${this, metric=mem|host_contentionPct}>${this, metric=Super Metric|sm_4a3bd0c0-c897-4baf-a60e-4bea139e537b} ? 1 : 0, ${this, metric=cpu|capacity_contentionPct}>${this, metric=Super Metric|sm_20ff3c62-0185-47a8-9bdc-a96f3081a2a8} ? 1 : 0])

The [x,y,z] array is actually available since earlier release. What you can do now is x, y, z are independent expressions and all their results are put into the array. They are no longer limited to just constant or metric.

Resource Alias

The name of resource is rather long. If you have a lot of resources in the formula, the whole formula can be hard to read. You can now have a name for the resource. Here is an example:

Before 6.3:

${adapterkind=VMWARE, resourcekind=HostSystem, attribute= cpu|demand|active_longterm_load, 
depth=5, where=”>=0”}) + 1)/
(max(${adapterkind=VMWARE, resourcekind=HostSystem, attribute=cpu|demand|active_longterm_load, 
depth=5, where=”>=0”}) + 1)”

We will name the resource as CPUload. by adding as CPUload in the formula. Once added, we can refer to it in the formula, resulting in a shorter formula.

${adapterkind=VMWARE, resourcekind=HostSystem, attribute= cpu|demand|active_longterm_load,
depth=5, where=”>=0”} as CPUload) + 1)/
(max(CPUload) + 1)”

Notice that CPUload includes the depth clause and where clause, not just the metric.

Guest OS metrics

Having visibility inside the Guest is critical. I discuss the limitation in sizing VM RAM in this blog. In a nutshell, the hypervisor does not have visibility into how the Guest OS manages its RAM. Some applications, such as JVM and databases, manage their own RAM. The guest OS does not have visibility to how the app manages its RAM. This is why RAM sizing is best done at Guest OS and App levels.

vR Ops 6.3 brings Guest OS metrics. Yes, it is agentless! There is no need to deploy agents on every VM. How does it work then, if there is no network connection to the VM? VMware Tools comes to the rescue! vR Ops talks to vCenter, which in turns talks to the ESXi via management network. The new version of VMware Tools pulls these additional counters. ESXi retrieves them and passes them to vCenter.

This feature was actually available since vSphere 6.0 Update 1. You need a minimum of ESXi 6.0U1, vCenter 6.0U1 and the VM must be running Tools from ESXi 6.0U1.

Always check Tools Release Notes for enhancement & bug fixes!

The table shows a variety of VMs with the Guest OS data. I’ve added the Active RAM from hypervisor as a comparison.


Here is the list of metrics. I’m using the internal name as the table above already has the friendly name.

Internal nameDescription
guest|mem.free_latestThis is one the 3 major counters for capacity & performance monitoring. The other 2 counters are Page-in Rate and Commit Ratio.
In Windows, this is the Free Memory counter. This excludes the cached memory. If this number drops to a low number, Windows is running out of Free RAM. While that number varies per application and use case, I’d generally keep this number > 500 MB for server VM and >100 MB for VDI VM. I set a lower number for VDI because they add up. If you have 10K users, that’s 1 TB of RAM.
In Linux, read this good article: http://www.chrisjohnston.org/ubuntu/why-on-linux-am-i-seeing-so-much-ram-usage
guest|mem.needed_latestThe amount of memory needed by the Guest OS. Below this amount, the Guest OS may swap.
The formula for Linux is physicalMem - Maximum of (0, (memAvailable - 5 % of phyiscalMem)).
The formula for Windows is memTotal-(coldStandby + free + reservation)
In Linux, memAvailable is an estimate of how much memory is available for starting new applications, without swapping. Calculated from MemFree, SReclaimable, the size of the file LRU lists, and the low watermarks in each zone. The estimate takes into account that the system needs some page cache to function well, and that not all reclaimable slab will be reclaimable, due to items being in use. Reference: https://superuser.com/questions/980820/what-is-the-difference-between-memfree-and-memavailable-in-proc-meminfo

The Standby memory (which can be significant) can be split into 3: FreeAndZero, Cold and Hot. MemNeeded will count the hot part of the buffer cache as being required by the OS.
In Linux, review this https://www.linuxatemyram.com/
In Tools, the counter is called guest.mem.needed

Example: Say you have 10 GB of RAM. So the Physical RAM = 10 GB.

Situation 1: high memory utilization.

MemAvailable = 2 GB.
Tools will calculate MemNeeded as
= 10 GB - Maximum (0, 2 - 5% of 10 GB)
= 10 - Maximum (0, 1.5 GB)
= 10 - 1.5 GB
= 8.5 GB
You actually still have 2 GB here. But Tools adds around 5%

Situation 1: low memory utilization.
MemAvailable = 8 GB.
Tools will calculate MemNeeded as
= 10 GB - Maximum (0, 8 - 5% of 10 GB)
= 10 - Maximum (0, 7.5 GB)
= 10 - 7.5 GB
= 2.5 GB
Again, Tools adds around 5%
guest|page.inRate_latestThe Rate the Guest OS brings memory back from disk to DIMM per second. Another word, the rate of reads going through paging/cache system. It includes not just swapfile I/O, but cacheable reads as well (double pages/s). A page that was paged out earlier, has to be brought back first before it can be used. This creates performance issue as the application is waiting longer, as disk is much slower than RAM.
The unit is in number of pages, not MB. It's not possible to convert due to mix use of Large Page (2 MB) and Page (4 KB).
A process can have concurrent mixed usage of Large and non-Large page in Windows. The page size isn’t a system-wide setting that all processes use. The same is likely true for Linux Huge Pages.
$ cat /proc/vmstat | grep pgpgin
pgpgin 604222959257
Windows: Win32_PerfFormattedData_PerfOS_Memory::PagesInputPersec
guest|page.outRate_latestThe opposite of the above. This is not as important as the above. Just because a block of memory is moved to disk that does not mean the application experiences memory problem. In many cases, the page that was moved out is the idle page. Windows does not page out any Large Pages.
guest|page.size_latestSize of the page. In Windows, this is 4 KB by default.
This is not the size of the pagefile.sys in c:\.
guest|mem.physUsable_latestPhysically Usable Memory
Based on a sample of 9 VMs (Windows and Linux), this looks like VM Configured RAM - Hardware used. Since Hardware Used is near 0, this value is near the Configured RAM
guest|swap.spaceRemaining_latestThe amount of swap space remaining, taking into account the possibility of swapfile growth where possible. A low remaining will trigger paging. If the system is configured to run without a swapfile, this will return zero
guest|hugePage.size_latestCurrent size of Huge Page.
This should be 2 MB in Windows.
guest|hugePage.total_latestTotal number of Huge Pages.
This is Linux specific.
guest|mem.activeFileCache_latestActive File Cache Memory. This is the actively in-use subset of the file cache. Unused file cache and non-file backed anonymous buffers (mallocs etc) are not included.
This seems to be the Cache Bytes in Windows
guest|contextSwapRate_latestCPU Context switch Rate per second in Windows/Linux.
For details, see https://msdn.microsoft.com/en-us/library/aa394279(v=vs.85).aspx and

The last metric is a CPU metric. So now you know if the process performance is due to heavy context switch!

Let’s compare them with the RAM counters from Windows. The list below is from Windows 10 Performance Monitor.


I’m not sure if they are enabled by default. If not, it’s a matter of enabling from the Policy, as shown below:


This is what it looks like in VM object. Finally! 🙂


Reduction in Metrics

This is one of my favourite, as I do have customers struggle with the long list of metrics. This should also improve vR Ops scalability. The example below is from ESXi Host. Quite a number of the capacity metrics are now hidden, as they are needed by default.


The reduction can be seen in the Self Monitoring, which has improved a lot in 6.3 also. You can see the number of metrics dropped on the following chart.

Reduced Metrics

The reduction translates into less resource utilisation (CPU, Disk, RAM). I’ve added CPU as an example. Notice the load is also less spiky.

CPU reduced in percentage

Drill down via Line chart

One popular use case is the ability to automatically plots all the children value when you select a parent. There are many examples of this, such as:

  • You select a cluster, and you want to automatically have a line chart of all its ESXi CPU Demand. If you have 8 hosts in that cluster, then you get 8 line charts.
  • You select a data center, and want to automatically have a line chart of all its clusters No of VM too have a sense of VM growth among clusters.

See the following screenshot. Can you notice how it’s done?


Hint: it’s done differently than in other widgets.

The way you do this is by knowing relationship among objects. You choose the metrics you want to display, not the parent. In the following example, I need to show the ESXi CPU contention on all ESXi in a cluster. So I pick the ESXi object, not the cluster object.

You do not have to specify the relationship (parent, child, self, etc.). vRealize Operations actually automatically figures out the relationship. Unlike other widgets, where you must specify, the View Widget has that intelligence built-in. Nice!

Can you spot a performance issue that happened in the past in the selected cluster below?


The above screenshot shows one of the ESXi experienced a spike in CPU Contention. It touched 9%, which is a high number as the number at ESXi level is the average of all its VMs. One of the VM likely experiencing a much higher number, as most VMs have low CPU Contention. The reason why most have low value because your ESXi has enough cores to serve quite a number of VMs.

Property now accompany metrics

One widget customers use heavily is the Object List widget. It can list any objects along with its metrics. In 6.3, you can now list its property. This makes it a lot more useful.


Heat Map: Zoom and Grouping

I use heat map a lot, especially in Configuration and Capacity use cases. They are also useful in NOC (big screen or projector). They are not so useful in performance as they can only show latest value. Since vR Ops collect data every 5 minutes, that means anything beyond 5 minutes cannot be shown.

The other limitation of Heat Map, which is addressed in 6.3, is scalability. When you have lots of objects, it can be difficult to see. 6.3 groups the objects, and allows you to drill down.


I then drilled down into the selected group. It reveals a lot of more objects.

6 (2)

Sportier looking

I’m a big fan of UI and UX. While underlying architecture matters, the human experience is what we see every time we deal with the system. There are 3 UI enhancements that I spotted as I compared 6.2.1 widgets with 6.3 widgets.

Scoreboard widget

The Scoreboard widget now provides more visual themes than just 2 themes. This is useful when you have multiple Scoreboard widgets in 1 screen. You can use 1 theme for VM and another theme for Infrastructure objects. They help in differentiating objects easily.


There is a small usability enhancement. When you choose Fixed View, the size controls do not appear as it’s not relevant. Choose Fixed Size and they will appear.


Scoreboard Health widget

Here is what it looks like in 6.2. Notice the font for the object name is not so clear. It does not work well if you need to show it on the NOC (big screen projector). The other problem is long name is truncated. Some objects, such as Disk Device and NSX port group, are very long.

9 (2)

Notice the border? Yup, I’m not a big fan either J Personally, I prefer not to see the border. I use this widget to see a lot of objects, so the border does get in the way.

Here is what it looks in 6.3. I definitely find this more usable. Thank you UX team!


Forensic widget

I use forensic widget to quickly know where an object spends 95% of its time. The chart below shows that the ESXi has barely any CPU stress. 95% of the time, the value is not even 0.002%. Once you get used to this widget, it’s a great complement to other visualisation.


As you can see above, in 6.2 the UI is looking a little dated.

This is what it looks like in 6.3. Notice the grid lines make it easier to read. There is also peak and low, so it’s easier to see the minimum and maximum.


GUI Editor for XML interaction

No more manually modifying XML file and figuring out what the metric names are! There is now a wizard that guides you along the way.


Once you select the Adapter Kind, the wizard automatically moves into the Resource Kind. No more typing!


Maintenance Schedule

The maintenance schedule has more flexibility. A few limitations in 6.2 that were addressed in 6.3:

  • You cannot specify the start date. You can only specify the start time.
  • You cannot specify the expiry date on this schedule. Often you want to schedule only for a fixed period, such as a few months or weeks.
  • You cannot specify the number of runs. Sometimes you want to specify that you only need to run this a few times.

As a comparison, here is what the maintenance schedule editor looks like in 6.2:


6.3 addresses the above limitation as you can see in the following screenshot.


Note: The new Maintenance Scheduler is not backward compatible. All previously created maintenance schedules will no longer be available and should be created again.

New VM properties

VM folder and VM Datastore are now available via the View widget. If a VM has >1 datastore, it will show all of them, separated by commas. If you have a nested folder, it will show all of them too.


That’s all folks. Hope it helps and keep in touch at LinkedIn.