Hard to believe Aria, the code name for vRealize Operations 6.7 is finally here with us. I worked closely with the R&D team and was privileged to see how the man in the arena fought their best. Many folks (Monica, Sunny, Karen, Esther and countless others) worked long hours, including during holiday and in flight. I remember many times we’re having lunch at 3 pm and dinner at 10 pm as we developed and tested the products.
vRealize Operations 6.7 enables changes that makes Operationalize Your World better. I will not repeat the generic product benefits (e.g. scalability, usability, performance). I’d focus on specific benefits that I’ve taken advantage of (meaning it requires changes on my end, and it does not come to Operationalize Your World automatically. My goal is to show you what changes you can make to make your customisation even better. You will see that some of the dashboards change substantially, because the widgets enable me to do things differently. The View List widget ability to drive other widget is one of those features I apply to have a different dashboard experience.
I also use Percentile instead of Maximum. Maximum is good as early warning as it captures the outlier. In capacity use cases, you want to base on something more realistic. A 99th percentile provides additional data that help you make better judgement. In other cases, I replace Average with 95th percentile, as Average is simply not practical in operations.
With that, let’s go through the changes dashboard by dashboard. I won’t explain every single one in details, else you get bored 🙂 My goal is to give you a real world example on the type of customization you can do for your bespoke operations.
At the end of this blog, I list the steps required to upgrade Ops Your World to be compatible with vR Ops 6.7. Without it, some dashboards will not work.
- You can now see the live performance of each cluster at the top.
- The table that lists the clusters now sport color coding. Now that it’s using View instead of Object List, the columns can have proper unit. For example, 2.5 TB RAM instead of a long number in KB.
- Added more insight into Resource Pool, since you need to ensure each pool has equal number of members.
- I highlight clusters that have too few hosts as that means your HA overhead is high. A cluster with 4 nodes means your HA overhead is 25%.
- I highlight clusters that have too few physical cores. Your VMware license gives you unlimited cores per socket. Microsoft has stopped giving unlimited cores per socket.
- I merged the Affected VMs dashboard to make it easier. You no longer have to switch into another dashboard. I rounded the number to 1 decimal point to drive a point that you should not look beyond 1 decimal point.
- Because Affected VM is gone, it’s 1 less dashboard-sprawl 🙂
- It cannot pass the VM object when you move to other dashboard. This is because a limitation of View List widget, where it cannot drive a widget on another dashboard. Yes, we are aware of this limitation 🙂
Performance: Villain VMs
- It’s now an independent dashboard. You now need to select the cluster again.
- Added CPU and RAM, to complement Disk and Network.
- I’m not using Cluster average, as unbalance is possible, especially if you have large VMs, resource pool, share and limit.
- For CPU: I use CPU GHz, so it takes into account the VM CPU Size
- For RAM: I use RAM Consumed. This is more accurate than Guest OS metrics as this is what ESXi maps to the physical DIMMs.
- I’m using Health Chart so it can display in color. Because of HA, your ESXi utilization should below, as your HA host is participating.
Performance: Single VM Monitoring
- I disable the Received Dropped Packet. I think vCenter drops the packet when it’s not intended for the VM, but the counter counts it. So it’s a false positive.
Performance: Single VM Troubleshooting
- You no longer have to select the VM again. It’s automatically passed from Single VM Monitoring dashboard. This is handy as you may have 1000s of VM.
- You no longer have to figure out which property to choose to find out the parent ESXi. It’s automatically selected, plus the name of the current ESXi is automatically shown. This is good as some customers could not find the property.
- You also do not need to clear the metric chart, which is not intuitive.
- The dashboard has been rearranged. It is neater and easier.
Capacity: Clusters Capacity
- The table now provides an easier way to see capacity across clusters. The limitation of the health chart was it’s not scalable if you have lots of clusters. You also need to see 3 months data, which can be hard to read since there are thousands of data points.
- Clicking on the table lets you drill down. I now add current allocation as some customers do not overcommit RAM.
- I’ve removed 2 dashboards (Overcommit Clusters and Non-Overcommit Clusters) as I found almost all customers mix allocation and utilization. You overcommit CPU but do not overcommit RAM. To me, this is actually the right thing to do. The reason is most VM uses 2 – 8 GB of RAM for each vCPU. A 32 vCPU VM needs somewhere between 64 – 256 GB RAM. While Intel Xeon Platinum has 24 core, you may find the premium too high. I see most ESXi uses the 16-core Xeon, giving a total of 32 cores. So if you do not overcommit CPU, all you need is 64 – 256 GB of RAM. Buying 512 GB is simply a waste.
Capacity: Idle and Powered Off VMs
- When you upgrade, the number you see may differ. This is due the new capacity engine. I’ve modified the group selection criteria to comply with the new engine.
- The heatmap has been improved. Easier to see where these VMs are, as the group names are clearer. The color now attracts you to focus on those VMs, giving you more bang for the buck
Capacity: Oversized VM (CPU)
- For Guest OS CPU utilisation, I’m replacing CPU Demand with a new supermetric. As you can see here, VM CPU Demand counter includes workload that is not coming from the VM. They should not be counted. Demand is also affected by frequency scaling and HT, which is not relevant in the context of VM CPU utilization. A VM is consuming a CPU at 100%, regardless whether the 2nd HT runs or not. The fact that the 2nd HT runs at 100% does not mean the VM utilization is 62.5%. We need to distinguish utilisation from capacity and performance use cases.
- In the table that lists all the VM utilization, I replaced Average utilization with 95th percentile. It gives you more confidence to right size.
Capacity: Oversized VM (RAM)
- We know that RAM is used as cached by OS. As a result, the memory consumption tends to be high. I change the dashboard to focus on what you can safely claim, which is the Free RAM.
- The heatmap now focuses on Free RAM. The larger the box, the more you can claim. The color now indicates how safe it is to claim it.
NOC: Capacity ESXi Utilization
- I changed ESXi Active to Balloon. It’s a better indicator of ESXi Memory Utilization
- I merged the overall and detail dashboards into 1.
- Limitation: to see the powered off VMs, you need to select the Datastore heatmap instead of Datastore cluster. This can be fixed by using supermetric, but since I’ve created almost 100, I thought I’d give you chance to practice 😉
Easier Import & Export
- No need to import a dummy policy just to import super metric, and delete policy once imported. Now you know exactly what super metrics are imported.
- You can also export the super metrics in bulk. Useful when replicating changes to your non-Prod vR Ops instance.
- No need to manually create XML files for resource interaction. This is now configurable in the UI.
Upgrading Operationalize Your World
If you do not customize Ops Your World dashboards, super metrics, views, etc., then your upgrade is easier. It consists of 2 steps:
- Download and Import Ops Your World. Yes, overwrite your existing ones.
- Enable the super metrics in your Default policy (it is marked with D)
Enable these metrics & properties. They were disabled as majority of vR Ops users are small companies. Operationalize Your World targets the big deployment.
- VM: Guest File System | Total Guest File System Free (GB)
- Needed to show low disk space in absolute amount, as % alone does not tell the full picture. 1% of 2 TB vs 1% of 50 GB aren’t the same
- VM: CPU Used.
- Needed to show the individual vCPU. Ideally, you only want this on large VMs. If your environment is >10K VM, you can create a separate policy and enable the supermetric for this group only.
- VM: Summary | Number of Datastores
- Once you clean up, you can disable it back. Your IaaS architecture & operations policy should either put all VMDKs of a VM in 1 datastore or intentionally separate them. For ease of operations, put them in 1 datastore.
- VM: Memory | VM Reservation
- VM: Heartbeat
- For better detection of Guest OS uptime. This is kind of redundant, as we already have OS Uptime counter.
- VM: CPU Run
- Used in VM Right Sizing. I’ve developed a new model to measure VM CPU Utilization. It uses this counter. I’m keen to get real world feedback from you.
- VM: CPU Overlap
- Used in VM Right Sizing.
- VM: Guest | Context Swap Rate
- This is the CPU Context Switch, which is useful in identifying application performance issue
- ESXi: Temperature metrics
- ESXi: Error Packets Received
- They were disabled as the remediation is likely not at vSphere layer.
You can safely delete the XML configuration files as they are no longer required.