I travel globally meeting VMware customers. A very popular request among customers is a simple set of dashboards that answer these questions:
- Are the VMs served well?
- If not, which VMs are affected? By what problems (CPU, RAM, Disk, Network)? How bad?
- Is it because of Villain VMs, consuming excessive amount of shared resource? If yes, who are they?
- Are the problems spread across clusters, networks, datastores? Or are they isolated to specific part of my IaaS?
- How long has the problem been happening? Is there a pattern?
- Is my Infrastructure running hot?
- This could be a reason why the VMs were not served well.
- If yes, which part? I need to see the 4 IaaS elements (CPU, RAM, Disk, Network), and easily spot where the problems are.
- Blue = 0% = cold. Not used.
- Red = 100% = hot. Highly utilized
- Compute: Which cluster are running hot? Is it CPU or RAM?
- Are the cluster balanced? Hosts in the cluster should have similar color.
- Are the hosts of equal capacity? The bigger the host, the bigger the box, so I can spot it easily.
- Select an ESXi, then click dashboard navigation to drill down into Troubleshoot a Host dashboard.
- Storage: Which datastores are running hot?
- Hot = busy processing lots of IOPS.
- Select a datastore, then click dashboard navigation to drill down into Troubleshoot a Datastore dashboard.
- Network: Which LAN or VXLAN carries a lot of traffic?
- The bigger the network (no of VMs or ports), the bigger the box.
- The higher the traffic, the redder the color.
Using the dashboard best practices covered here, I translated the above into 4 dashboards. I added a simple VM Reclamation dashboard to complete the functionality. The picture below shows the functional relationship among the dashboards.
The result was 5 simple dashboards. It’s a lite version of Operationalize Your World, which has 50 dashboards. As a result, the import step is much simpler. It’s also upgradeable to the full OYW.
Are the VMs served well?
The above shows the present data. It’s suitable for live NOC screen, where you can see from a distance. All you want to see is green! You can customize the threshold, simply edit each widget.
Easy to spot the villain VMs. They are the biggest! If you have a large box occupying a relatively large area, that means you have a VM consuming a large percentage of your shared environment.
The above is not so good to show The Past. Unlike The Present (which has 1 data point), the past has many. For that, we need to use line chart. This is why the next dashboard is required.
Were the VMs served well?
Are my Infra running hot?
Was my Infra running hot?
- Which clusters had the problem? Is it CPU or RAM? How bad is it?
- Both max and average lines are shown so you get better idea.
- If max is high but average is low, no one may complain yet. This is your proactive window!
- Which datastores had the problem?
- How bad is the situation?
- Is the IO stuck in the queue?
What can I easily reclaim?
I focus on powered off VMs and Idle VMs as they are easier than active VMs.
From this dashboard, you can select a VM, then click dashboard navigation to drill down into VM Utilization dashboard.
Compare to the Operationalize Your World import step, this is much easier. It does not require preparation, which is time consuming. The strikethrough steps are not required.
Plan which clusters & datastores belong to what service tier Create service tier policy
- Create a Group Type call “VM Types”.
- Import Groups
Import super metrics, using dummy policy
- Import super metrics. Then enable them on your base policy
Enable super metrics on each service-tier policy Create XML interaction, manually Create text widget content, manually Create roles. Assign users to roles
- Import dashboards & Import views
To get the file, download it from here.