Tag Archives: vSphere

Which VMs need more resources?

You can reduce the following resources from a VM:

  • CPU
  • RAM
  • Storage

Network isn’t something you can reduce, but you know that already 🙂

You can check which VMs need more resources by building a dashboard like the one below. It’s a simple dashboard, which you can customize and enhance. It lets you reduce the resources independently.

I’ve marked the above dashboard with numbers, so we can refer to them:

  1. This is a table that lists all VMs. It’s sorted by the highest 1-hour average of CPU Demand and RAM Demand. The table also lists the VM CPU and RAM configuration, so you can see if the VMs are small or large. It also shows the cluster the VMs are located. The table is sorted by the highest CPU Demand. I’m showing both CPU and RAM in a single table. You can clone the view and split them if that suits your operations better.
  2. This is a table that lists all VMs, but focusing on storage only. With storage, we do not have the complexity of checking peak utilisation. We simply need to check the present situation.
  3. This lists the Top-15 VMs with highest CPU Demand and RAM Demand in a given period. The list is now split, as they can be different VMs. Do not that Top-N widget will average the number over the selected period. A VM with cyclical workload may not show up. The Top-N is complemented with a distribution chart. Select a VM from the Top-N, and you can see where the VM utilisation is.
  4. The distribution chart helps you see if the VM is really under resources or not. The 95th percentile is marked with a vertical green line. You expect that line to be at 100%, indicating that the VMs hit 100% utilisation frequently. If the 95th percentile is at a low number, and you do not see the number 100 in the x-axis, that means the VM is not under resourced.
  5. Storage is easier, as we can simply use the last data. As a result, we can show a distribution of all the VMs. We use a heat map as it can show 2 dimensions. Every VM is represented as a box. The bigger the box, the more storage the VM is configured with. The color indicates if the VM use it.
    • 0% = Black. Wastage
    • 10% = Green. Balanced usage
    • 100% = Red. Need more space!

The CPU and RAM have limitations. For example, they may show high utilisation during AV backup. You want to ignore those period. At this moment, the only way is to plot the high usage over a line chart. We use Log Insight for this. The chart below shows VMs that hit high CPU usage in a given period. Every time a VM hits high CPU usage, it will show up here. As you can see, there are only 4 VMs that hit high CPU usage. All other VMs do not need more CPU.

The above is an example from a healthy environment. What about an environment where a lot of VMs are under-sized? You expect to see lots of alarm! That’s what you have below

Hope the above is useful. If not, drop me an email.

NOC Dashboards for SDDC – Part 2

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Dashboard: Performance

Is my IaaS performing?

That’s the key question that you need to answer. You need to show if the clusters are coping well. Show how the clusters are performing in terms of CPU, RAM and Disk.

The above dashboard is per Service Tier. Do you know why?

Yes, the threshold differs for each tier. What is acceptable in Tier 3 may not be acceptable in Tier 1.

The good thing about line chart is it provides visibility beyond present time. You can show the last 6 hours and still get good details. Showing >24 hours will result in visualization that is too static, not suitable for NOC use case.

Limitation & customisation:

  • You need 1 widget per Service Tier.
  • If you only have a few clusters, you can show multiple Service Tiers in 1 dashboard. 1 row per tier results in simpler visualisation.
  • In environment with >10 clusters, we can group them into Service Tier. Focus more on the highest tier (Tier 1).
  • In environment with >100 clusters, we need another grouping in between. Group the Tier 1 clusters into physical location.

When a cluster is unable to cope, is it because it’s having high utilization? I show CPU, RAM and Disk here. You can add Network as you know the physical limit of ESXi vmnic.

Disk is tricky for 2 reasons:

  • It is not in %. There is no 100% for IOPS or Throughput. The good thing is when you architected the array or vSAN, you did have certain IOPS number in mind right? 😉 Well, now you can get the storage vendor to prove it that it does indeed perform as per the PowerPoint 😉 If not, you get free hardware if they promise a fast performance that will meet business requirement.
  • You need to show both IOPS and Throughput. While they are related, you can have low IOPS and high throughput, and vice versa.

If the cluster utilization is high, the next dashboard drills into each ESXi.

We can also see if there are unbalanced. In theory, they should not, if you set DRS to Fully Automated, and pick at least the middle level sensitivity (3). DRS in vSphere 6.5 considers Network also, so you can have unbalanced CPU/RAM.

With the dashboard above, we can tell if ESXi CPU Utilisation is healthy or not.

  • Low value does not mean VM performs fast. A VM is concerned with VM Contention metric, not ESXi Utilization. Low value means we over invest. It is not healthy as we waste money and power.
  • High value means we risk performance (read: contention)

For ESXi, go with higher core count. You save license if you can reduce socket.

We can also tell if ESXi RAM Utilisation is healthy or not.

  • Customers tend to overspend on RAM and underspend on CPU. The reason is this.
  • For RAM, we have 2 metrics:
    • Active RAM
    • Mapped RAM
  • The value you want is somewhere in between Active and RAM.

In the dashboard, the 3 widgets have different range. The range I set is 30 – 90, 50 – 100 and 10 – 90.

Why not 0 – 100?

It is not 100% because you want to cater for HA. Your ESXi should not hit 100% as if you have HA, it would be beyond 100, meaning performance will be badly affected.

If the cluster or ESXi utilization is high, is it because there are VMs generating excessive workloads?

The dashboard above answers if we have VMs that dominate the shared environment.

  • CPU: show a heat map showing all VMs, sized by CPU Demand in GHz (not %), color by contention
  • RAM: show a heat map showing all VMs, sized by Active RAM, color by contention
  • Storage: show a heat map showing all VMs, sized by IOPS color by latency.

At a glance, we can tell the workload distribution among the VMs. We can also tell if they being served well or not.

Limitation & customisation:

  • You need 1 widget per Service Tier.
  • You can change the threshold anytime. If you want a brand new storage from Finance, set the max to 1 ms 😉
  • In larger environment, group your heatmap (e.g. by cluster, host, folder).
  • We can show individual VM, but we can’t show the history as there are too much data to show.
  • This needs to be done per Tier. 1 dashboard per Tier, as the threshold varies per tier.

Hope you find it useful. For the product-specific implementation, review this blog. To prevent vROps session from timing out, implement this trick by Sunny.

What’s new with vRealize Operations 6.3

This blog complements the official blog that you can find here, and other articles such as this and this by Michael Ryom. Bill Roth and team have written a lot of articles there. You should also read the Release Notes. Take note when you have EPO agent.

Super Metric

I use around ~50 super metrics in typical engagement. By and large, it’s good enough for my customers’ requirements. On the other hand, I have seen folks like Ronald Buder and Brandon Gordon, who build very advance formula, and would like to have more capabilities. Here are 3 enhancements that would go a long way in making super metric more useful:

  • Ability to specify a condition
    • Prior to 6.3, super metric applies to every member of the group or the parent object. If you are counting the number of VMs in a cluster, it will give you all VMs.
    • With 6.3, you can add condition. You can count only VMs that are powered on, or VMs with >8 vCPU. Another example, you can count how many VMs in a Datastore which have latency above certain number.
  • Ability to have IF THEN ELSE
    • Prior to 6.3, super metric works in 1 formula. You cannot apply Formula 1 for condition A and Formula 2 if Condition A is not met. A use case here is you are checking VM Uptime. If you have VMware Tools running, you use the Tools heartbeat to decide that the VM is up. If the VMware Tools is not running, you use VM utilization to decide.
    • The IF THEN ELSE can be combined with AND, OR and NOT. This enables you to build a more comprehensive logic.
    • You can chain it to create IF THEN ELSEIF.
  • Ability to combine expression
    • You can have AND, OR, NOT. Enough said 🙂
    • Ability to compare. You can have less than, less than or equal to, greater than, greater than or equal to.

The where clause cannot point to another object, but can point to different metric in the same object. For example, you cannot count the number of VMs in a cluster with CPU Contention metric > SLA of that cluster. The phrase “SLA of that cluster” belongs to the cluster object, not VM object.

That right operand must also be a number. It cannot be another super metric or variable.

The where clause cannot be combined with AND, OR, NOT. This means you cannot have “where VM CPU > 4 and VM RAM > 16”. The reason is that ‘where’ clause calculation is running on the vR Ops node where the data is retrieved, while the rest of all operators (AND, OR, NOT) are running on the node where the super metric expression is executed. Other operators are executed when all data has already retrieved. The retrieved data does not contain metric values for each member object but aggregated values of these objects.

As expected, you will find the new operators in the super metric editor, as shown below.

1

The following screenshot, courtesy of Brandon Gordon, shows a brief description of the operators:

2

Example on how to use the where clause

sum(${adaptertype=VMWARE, objecttype=VirtualMachine, attribute=mem|guest_provisioned, depth=5, where = "sys|poweredOn==1" })

Example on how to use the IF THEN ELSE

${this, metric=diskspace|used}>1024 ? max(${this, attribute=virtualDisk|commandsAveraged_average} as IOPS) / ${this, metric=diskspace|used} * 1024 : max(IOPS)
Max([${this, metric=mem|host_contentionPct}>${this, metric=Super Metric|sm_4a3bd0c0-c897-4baf-a60e-4bea139e537b} ? 1 : 0, ${this, metric=cpu|capacity_contentionPct}>${this, metric=Super Metric|sm_20ff3c62-0185-47a8-9bdc-a96f3081a2a8} ? 1 : 0])

The [x,y,z] array is actually available since earlier release. What you can do now is x, y, z are independent expressions and all their results are put into the array. They are no longer limited to just constant or metric.

Resource Alias

The name of resource is rather long. If you have a lot of resources in the formula, the whole formula can be hard to read. You can now have a name for the resource. Here is an example:

Before 6.3:

(min(
${adapterkind=VMWARE, resourcekind=HostSystem, attribute= cpu|demand|active_longterm_load, 
depth=5, where=”>=0”}) + 1)/
(max(${adapterkind=VMWARE, resourcekind=HostSystem, attribute=cpu|demand|active_longterm_load, 
depth=5, where=”>=0”}) + 1)”

We will name the resource as CPUload. by adding as CPUload in the formula. Once added, we can refer to it in the formula, resulting in a shorter formula.

(min(
${adapterkind=VMWARE, resourcekind=HostSystem, attribute= cpu|demand|active_longterm_load,
depth=5, where=”>=0”} as CPUload) + 1)/
(max(CPUload) + 1)”

Notice that CPUload includes the depth clause and where clause, not just the metric.

VM Memory metrics

I discuss the limitation in sizing VM RAM in this blog. In a nutshell, the hypervisor does not have visibility into how the Guest OS manages its RAM. Some applications, such as JVM and databases, manage their own RAM. The guest OS does not have visibility to how the app manages its RAM. This is why RAM sizing is best done at Guest OS and App levels.

vR Ops 6.3 brings Guest OS RAM metrics. Yes, it is agentless. There is no need to deploy agents on every VM. How does it work then, if there is no network connection to the VM? VMware Tools comes to the rescue! vR Ops talks to vCenter, which in turns talks to the ESXi via management network. The new version of VMware Tools pulls these additional counters. ESXi retrieves them and passes them to vCenter.

This feature was actually available since vSphere 6.0 Update 1. Yes, that means you need a minimum of ESXi 6.0U1, vCenter 6.0U1 and the VM must be running Tools from ESXi 6.0U1. You do not need to upgrade to vSphere Update 2.

The table shows a variety of VMs with the Guest OS data. I’ve added the Active RAM from hypervisor as a comparison.

ram-metric

Here is the list of metrics. I’m using the internal name as the table above already has the friendly name.

Internal nameDescription
guest|mem.free_latestThis is one the 3 major counters for capacity & performance monitoring. The other 2 counters are Page-in Rate and Commit Ratio.
In Windows, this is the Free Memory counter. This excludes the cached memory. If this number drops to a low number, Windows is running out of Free RAM. While that number varies per application and use case, I’d generally keep this number > 500 MB for server VM and >100 MB for VDI VM. I set a lower number for VDI because they add up. If you have 10K users, that’s 1 TB of RAM.
guest|mem.needed_latestThe amount of memory needed by the Guest OS. Below this amount, the Guest OS may swap.
This is Total RAM - Free RAM. It includes the Cached RAM, which Standby + Modified.
The Standby memory (which can be significant on Windows, less so on Linux) can be split into 3: FreeAndZero, Cold and Hot. MemNeeded will count the hot part of the buffer cache as being required by the OS.
guest|page.inRate_latestThe rate of reads going through the underlying paging/cache system. It includes not just swapfile I/O, but cacheable reads as well (double, pages/s). The Rate the Guest OS brings memory back from disk to DIMM per second. A page that was paged out earlier, has to be brought back first before it can be used. This creates performance issue as the application is waiting longer, as disk is much slower than RAM.
Windows does not page out any Large Pages. A process can have concurrent mixed usage of Large and non-Large page in Windows. The page size isn’t a system-wide setting that all processes use. The same is likely true for Linux Huge Pages
guest|page.outRate_latestThe opposite of the above. This is not as important as the above. Just because a block of memory is moved to disk that does not mean the application experiences memory problem. In many cases, the page that was moved out is the idle page.
guest|page.size_latestSize of the page. In Windows, this is 4 KB by default.
This is not the size of the pagefile.sys in c:\.
guest|mem.physUsable_latestPhysically Usable Memory
Based on a sample of 9 VMs (Windows and Linux), this looks like VM Configured RAM - Hardware used. Since Hardware Used is near 0, this value is near the Configured RAM
guest|swap.spaceRemaining_latestThe amount of swap space remaining, taking into account the possibility of swapfile growth where possible. If the system is configured to run without a swapfile, this will return zero
guest|hugePage.size_latestCurrent size of Huge Page.
This should be 2 MB in Windows.
guest|hugePage.total_latestTotal number of Huge Pages.
This is Linux specific.
guest|contextSwapRate_latestContext Swap Rate per second in Windows/Linux
guest|mem.activeFileCache_latestActive File Cache Memory. This is the actively in-use subset of the file cache. Unused file cache and non-file backed anonymous buffers (mallocs etc) are not included.
This seems to be the Cache Bytes in Windows

Let’s compare them with the RAM counters from Windows. The list below is from Windows 10 Performance Monitor.

3

I’m not sure if they are enabled by default. If not, it’s a matter of enabling from the Policy, as shown below:

VM RAM

This is what it looks like in VM object. Finally! 🙂

VM RAM 2

Reduction in Metrics

This is one of my favourite, as I do have customers struggle with the long list of metrics. This should also improve vR Ops scalability. The example below is from ESXi Host. Quite a number of the capacity metrics are now hidden, as they are needed by default.

9

The reduction can be seen in the Self Monitoring, which has improved a lot in 6.3 also. You can see the number of metrics dropped on the following chart.

Reduced Metrics

The reduction translates into less resource utilisation (CPU, Disk, RAM). I’ve added CPU as an example. Notice the load is also less spiky.

CPU reduced in percentage

Drill down via Line chart

One popular use case is the ability to automatically plots all the children value when you select a parent. There are many examples of this, such as:

  • You select a cluster, and you want to automatically have a line chart of all its ESXi CPU Demand. If you have 8 hosts in that cluster, then you get 8 line charts.
  • You select a data center, and want to automatically have a line chart of all its clusters No of VM too have a sense of VM growth among clusters.

See the following screenshot. Can you notice how it’s done?

4

Hint: it’s done differently than in other widgets.

The way you do this is by knowing relationship among objects. You choose the metrics you want to display, not the parent. In the following example, I need to show the ESXi CPU contention on all ESXi in a cluster. So I pick the ESXi object, not the cluster object.

You do not have to specify the relationship (parent, child, self, etc.). vRealize Operations actually automatically figures out the relationship. Unlike other widgets, where you must specify, the View Widget has that intelligence built-in. Nice!

Can you spot a performance issue that happened in the past in the selected cluster below?

5a

The above screenshot shows one of the ESXi experienced a spike in CPU Contention. It touched 9%, which is a high number as the number at ESXi level is the average of all its VMs. One of the VM likely experiencing a much higher number, as most VMs have low CPU Contention. The reason why most have low value because your ESXi has enough cores to serve quite a number of VMs.

Property now accompany metrics

One widget customers use heavily is the Object List widget. It can list any objects along with its metrics. In 6.3, you can now list its property. This makes it a lot more useful.

property

Heat Map: Zoom and Grouping

I use heat map a lot, especially in Configuration and Capacity use cases. They are also useful in NOC (big screen or projector). They are not so useful in performance as they can only show latest value. Since vR Ops collect data every 5 minutes, that means anything beyond 5 minutes cannot be shown.

The other limitation of Heat Map, which is addressed in 6.3, is scalability. When you have lots of objects, it can be difficult to see. 6.3 groups the objects, and allows you to drill down.

6

I then drilled down into the selected group. It reveals a lot of more objects.

6 (2)

Sportier looking

I’m a big fan of UI and UX. While underlying architecture matters, the human experience is what we see every time we deal with the system. There are 3 UI enhancements that I spotted as I compared 6.2.1 widgets with 6.3 widgets.

Scoreboard widget

The Scoreboard widget now provides more visual themes than just 2 themes. This is useful when you have multiple Scoreboard widgets in 1 screen. You can use 1 theme for VM and another theme for Infrastructure objects. They help in differentiating objects easily.

7

There is a small usability enhancement. When you choose Fixed View, the size controls do not appear as it’s not relevant. Choose Fixed Size and they will appear.

8

Scoreboard Health widget

Here is what it looks like in 6.2. Notice the font for the object name is not so clear. It does not work well if you need to show it on the NOC (big screen projector). The other problem is long name is truncated. Some objects, such as Disk Device and NSX port group, are very long.

9 (2)

Notice the border? Yup, I’m not a big fan either J Personally, I prefer not to see the border. I use this widget to see a lot of objects, so the border does get in the way.

Here is what it looks in 6.3. I definitely find this more usable. Thank you UX team!

10

Forensic widget

I use forensic widget to quickly know where an object spends 95% of its time. The chart below shows that the ESXi has barely any CPU stress. 95% of the time, the value is not even 0.002%. Once you get used to this widget, it’s a great complement to other visualisation.

11

As you can see above, in 6.2 the UI is looking a little dated.

This is what it looks like in 6.3. Notice the grid lines make it easier to read. There is also peak and low, so it’s easier to see the minimum and maximum.

12

GUI Editor for XML interaction

No more manually modifying XML file and figuring out what the metric names are! There is now a wizard that guides you along the way.

the-wizard-guides-you

Once you select the Adapter Kind, the wizard automatically moves into the Resource Kind. No more typing!

the-wizard-guides-you-2

Maintenance Schedule

The maintenance schedule has more flexibility. A few limitations in 6.2 that were addressed in 6.3:

  • You cannot specify the start date. You can only specify the start time.
  • You cannot specify the expiry date on this schedule. Often you want to schedule only for a fixed period, such as a few months or weeks.
  • You cannot specify the number of runs. Sometimes you want to specify that you only need to run this a few times.

As a comparison, here is what the maintenance schedule editor looks like in 6.2:

13

6.3 addresses the above limitation as you can see in the following screenshot.

14

Note: The new Maintenance Scheduler is not backward compatible. All previously created maintenance schedules will no longer be available and should be created again.

New VM properties

VM folder and VM Datastore are now available via the View widget. If a VM has >1 datastore, it will show all of them, separated by commas. If you have a nested folder, it will show all of them too.

15

That’s all folks. Hope it helps and keep in touch at LinkedIn.