vSphere visibility for Storage Team

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Ask any Storage Team and Platform Team whether the collaboration between them can be improved by a mile, and you are likely get a nod. One reason for this issue is there is lack of common visibility. Storage Team do not get always get access vSphere vSphere. Even if they do, vCenter is not designed for Storage team. The UI is designed for VMware Admin.

vRealize Operations and Log Insight can bridge that providing a set of read-only, purpose-built dashboards.

  • When a VM Owner complains, can we clear if it’s a storage issue within 1 minute?
    • No ping pong between VM Owner, vSphere Admin, Storage Admin
  • Is the Storage serving all the VMs well?
    • If not, who are affected, when and how bad? Read or Write?
    • The answer has to be tier based, as Tier 1 VM expects lower latency than Tier 3
  • What’s the total demand hitting the array? Are they growing fast?
    • Who are the heavy hitters among the VMs?
  • When & where are we running out of capacity?
    • How much disk space can be reclaimed? From which VMs?
  • What’s the config that we have?
    • Are they consistent?

The question above covers the typical areas of SDDC Operations, such as performance, capacity, configuration and availability. I’ll show some example dashboards to get you going


The set of dashboards answer questions such as:

  • What’s the overall performance, for each cluster and datastore?
  • When a cluster is not performing, do we know when and which VMs were affected?
  • What’s the total demand hitting the storage system? Who are the heavy hitters?


As this is for Storage Team, we can drill down to a specific datastore. It provides detail line charts of the datastore latency, throughput, outstanding IO and IOPS.

It also shows the VMs in the datastore, and if any of them is generating a lot of IOPS (villain VM).


The heavy hitters dashboard distinguishes between Bursty Hitter and Sustained Hitter. You can see who hit you and how long.


Storage Team normally want better granularity than 5 minute. You can complement the above view VMkernel. I’ve shared posted here, so follow that link.


The set of dashboards answer questions such as:

  • What’s the overall capacity? How is it used? Where are we our over-subscription?
  • Any datastore is running low on capacity? Are we using the datastores equally?
  • Are the VMs equally distributed among the datastores?
  • For each shared datastores, what are the capacity?


You should use Datastore Cluster as part of your design. If you do, you also get visibility into their capacity.


You can drill down into each datastore and determine their capacity.


Capacity: Space Reclamation

There are 4 places where you can reclaim storage, from easiest to the hardest

  1. Orphaned VMs and orphaned VMDK. They are the easiest as they are not even owned.
  2. Snapshot.
  3. Powered Off VM
  4. Idle VM


The dashboard does not list active VM. Don’t bother trying 🙂

Single VM

When a VM owner complain, can we rule out within 1 minute whether Storage is the issue?


Hope you find the material useful. If you do, go back to the Main Page.

Right sizing VM Memory without using agent

The much needed ability to get visibility into Guest OS Memory is finally possible in vSphere. Part of the new features in vR Ops 6.3, you can now get Guest OS RAM metrics without using agent. So long you have vSphere 6.0 U1 or later, and the VM is running Tools 10.0.0 or later, you are set. Thanks Gavin Craig for pointing this out. The specific feature needed in Tools is called Common Agent Framework. That removes the need for multiple agents in a VM.

As a result, we can now update the guidance for RAM Right Sizing:

For Apps that manage its own RAM, use metrics from the Apps.
For others, use metrics from the Guest OS.
Use vR Ops Demand if you have no Guest OS visibility. Do not use vCenter Active.

Examples of applications that manage its own RAM are JVM and Database. If you use Guest OS counter, you can result in wrong sizing and make situation worse. Manny Sidhu provides a real example here. The application vendor asked for 64 GB RAM when they are only actively using 16 GB, as he shared in the vCenter screenshot below.

For apps that do not manage its own RAM, you should use Guest OS data. The table below compares 63 VMs, using a variety of Microsoft Windows. A good proportion of them are just idle, as this is a lab, not real life production.

  1. What conclusion do you arrive at? I’ve added a summary at the bottom of the list.
  2. How do you think VM Consumed vs VM Active vs Guest OS Used?


And the table below shows comparison for Linux.

What do you spot? What’s your conclusion? How does this change your capacity planning? 😉


Here is the summary for both OS. Total is 101 VM, not a bad sample size. I’ve also added comparison. Notice something does not add up?


To help you compare further, here is a vR Ops heatmap showing all the VMs.


I created a super metric that compares Guest OS metric with VM Active. As expected, Guest OS is higher as it takes into account cache. It’s not just Used, and Windows does use RAM as cache (I think Linux does too, but not 100% sure).

The super metric is a ratio. I divide Guest OS : VM Active. I set 0 as Black, 5 as yellow, and 10 as red. Nothing is black, as VM Active is lower than Guest in all samples.


  • VM Consumed is always near 100%, even on VM that are idle for days. This is expected, as its nature as a cache. Do not use it for right sizing.
  • Windows memory management differs to Linux. Notice its VM Consumed is higher (94%) than Linux (82%). I guess it’s writing zero during boot creates this.
  • VM Active can be too aggressive as it does not take into account cache. vR Ops adds Demand counter, which makes the number less aggressive.
  • Guest OS Used + Cache is much greater than VM Active or VM Demand. It’s 69% vs 15% vs 31%
  • Guest OS Used + Cache + Free does not add up to 100%. In the sample, it only adds to 83%

Based on the above data, I’d prefer to use Guest OS, as it takes into account cache.

  • Side reading, if you need more info:
    Refer to this for Windows 7 metrics, and this for Windows 2008 metrics. 
    This is a simple test to understand Windows 7 memory behaviour.

You can develop a simple vR Ops dashboard like the one below to help you right size based on Guest OS. The dashboard excludes all VMs that do not have Guest OS RAM data.


Since not all VMs have Guest OS RAM data, the first step is to create a group that only contains VMs with the data. Use the example below.


Notice the Group Type is VM Types. Follow that exactly, including the case!

Once you created the group type and group, the next steps is to download the following:

  • Super metrics. Don’t forget to enable them!
  • Views
  • Dashboard

I’ve zipped them into 1 file. You can get it here. Import them into your vR Ops. If you are not sure how to import, review this.

You can customize the dashboard. Do not be afraid to experiment with it. It does not modify any actual metric and object as dashboard is just a presentation layer.

Take for example, the scoreboard. We can add color coding to quickly tell you the amount of RAM wasted. If you have > 1 TB RAM wasted, you want it to show red.


To do that, it’s a matter of editing the scoreboard widget. I’ve added thresholds, so it changes from green to yellow when I cross 500 GB, to orange when I cross 750 GB, and to red when I cross 1 TB.


Hope that helps. If you find this dashboard useful, you should consider getting the complete set. It’s delivered as part of Operationalize Your World program.


SDDC Dashboards: The Kitchen

This post continues from the Operationalize Your World post. Do read it first so you get the context.

There are only 4 parts in the IaaS Monitoring:

  1. Capacity
  2. Configuration
  3. Audit and Compliance
  4. Availability

Can you figure out why we do not have Performance in “the kitchen” area of your restaurant business?

Performance SLA concept explains why. I’ve also applied it to VDI use case and give an example.

The dashboards are based on vRealize Operations 6.2. You can easily add 6.3 features.


The Capacity dashboards below take into account Performance SLA and Availability SLA. Only when these 2 are satisfied, that it considers Utilization. Review this series of blogs for an extensive coverage on this new model.

The set of dashboards answer questions such as:

  • What’s the capacity of my clusters?
  • What’s the consumption on the clusters?
  • Which clusters are running low?
  • Is the cluster still coping well with demands?

Here it the dashboard for Tier 1, where we do not overcommit. As a result, both performance & utilization are irrelevant.


The lines do not gradually come down (or up) because this is a lab, not a real life environment. Your production environment will have line chart that makes sense 🙂

Here it the dashboard for Tier 2 or 3. Since we overcommit, we now have to take into account performance, and then utilization. In this example, I have also smoothen the line. This hides a 5-minute spike.


I do not recommend you hide the 5-minute spike, because that makes your capacity planning inconsistent with your Performance SLA.


In the world of Software-Defined, configurations are easy to change. So consistency becomes an area you need to watch.

The set of dashboards answer questions such as:

  • Are my ESXi config consistent, especially if they are member of the same cluster?
  • Are my ESXi configured to follow best practice?
  • Do I have too many combination, which increase complexity?

The dashboard below is for ESXi:


The dashboard below is for Cluster:


Audit and Compliance

vCenter tasks, events and alarms are 3 areas that you can mine to help answer compliance and audit. Log Insight complements vR Ops nicely here. For example, the following screenshot answer this audit question

  • Who shutdown what VM and when?


There are many things it can answer, and it’s covered in the workshop.


Because of HA and DRS, tracking Cluster makes more sense than tracking each ESXi. A cluster uptime remains 100% when 1 host is not available because you have HA. You have catered for that, and as a result, you should not be penalized.

The set of dashboards answer questions such as:

  1. What’s the availability (%) of each cluster in the last 24 hours? Each cluster has its own line chart, and it’s color coded. You expect a green bar, as shown below.
  2. What’s the availability now? The heatmap provides that answers quickly. You can drill down into the cluster if you spot a problem.
  3. Am I containing risk when there is a major outage. How many VMs am I willing to lose when a cluster or datastore goes down?


The heat map also provides the ESXi uptime. You can toggle between Cluster and ESXi.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations.