I mentioned earlier that vRealize Operations does not simply regurgitate what vCenter has. Some of the vSphere-specific characteristics are not properly understood by traditional management tools. Partial understanding can lead to misunderstanding. vRealize Operations starts by fully understanding the unique behavior of vSphere, then simplifying it by consolidating and standardizing the counters. For example, vRealize Operations creates derived counters such as Contention and Workload, then applies them to CPU, RAM, disk, and network.
Let’s take a look at one example of how partial information can be misleading in a troubleshooting scenario. It is common for customers to invest in an ESXi host with plenty of RAM. I’ve seen hosts with 256 to 512 GB of RAM. One reason behind this is the way vCenter displays information. In the following screenshot, vCenter is giving me an alert. The host is running on high memory utilization. I’m not showing the other host, but you can see that it has a warning, as it is high too. The screenshots are all from vCenter 5.0 and vCenter Operations 5.7, but the behavior is still the same in vCenter 5.5 Update 2 and vRealize Operations 6.0.
I’m using vSphere 5.0 and vCenter Operations 5.x to show the screenshots as I want to provide an example of the point I stated in Chapter 1 of the book, which is the rapid change of vCloud Suite and VMware SDDC products. I have verified that the behaviour is still the same in vSphere 5.5 U2.
The first step is to check if someone has modified the alarm by reducing the threshold. The next screenshot shows that utilization above 95% will trigger an alert, while utilization above 90% will trigger a warning. The threshold has to be breached by at least 5 minutes. The alarm is set to a suitably high configuration, so we will assume the alert is genuinely indicating a high utilization on the host.
Let’s verify the memory utilization. I’m checking both the hosts as there are two of them in the cluster. Both are indeed high. The utilization for vmsgesxi006 has gone down in the time taken to review the Alarm Settings tab and move to this view, so both hosts are now in the Warning status.
Now we will look at the vmsgesxi006 specification. From the following screenshot, we can see it has 32 GB of physical RAM, and RAM usage is 30747 MB. It is at 93.8%utilization.
Since all the numbers shown in the preceding screenshot are refreshed within minutes, we need to check with a longer timeline to make sure this is not a one-time spike. So let’s check for the last 24 hours. The next screenshot shows that the utilization was indeed consistently high. For the entire 24-hour period, it has consistently been above 92.5%, and it hits 95% several times. So this ESXi host was indeed in need of more RAM.
Deciding whether to add more RAM is complex; there are many factors to be considered. There will be downtime on the host, and you need to do it for every host in the cluster since you need to maintain a consistent build cluster-wide. There will be lots of vMotion as each host needs to be shutdown. Because the ESXi is highly utilized, I should increase the RAM significantly so that I can support more VMs or larger VMs. Buying bigger DIMMs may mean throwing away the existing DIMMs, as there are rules restricting the mixing of DIMMs. Mixing DIMMs also increases management complexity. The new DIMM may require a BIOS update, which may trigger a change request. Alternatively, the large DIMM may not be compatible with the existing host, in which case I have to buy a new box. So a RAM upgrade may trigger a host upgrade, which is a larger project.
Before jumping in to a procurement cycle to buy more RAM, let’s double-check our findings. It is important to ask what is the host used for? and who is using it?.
In this example scenario, we examined a lab environment, the VMware ASEAN lab. Let’s check out the memory utilization again, this time with the context in mind. The preceding graph shows high memory utilization over a 24-hour period, yet no one was using the lab in the early hours of the morning! I am aware of this as I am the lab administrator in the past 6+ years!
We will now turn to vCenter Operations for an alternative view. The following screenshot from vCenter Operations 5 tells a different story. CPU, RAM, disk, and network are all in the healthy range. Specifically for RAM, it has 97% utilization but 32% demand. Note that the Memory chart is divided into 2 parts. The upper one is at the ESXi level, while the lower one shows individual VMs in that host.
The upper part is in turn split into 2. The green rectangle (Demand) sits on top of a grey rectangle (Usage). The green rectangle shows a healthy figure at around 10 GB. The grey rectangle is much longer, almost filling the entire area.
The lower part shows the hypervisor and the VMs’ memory utilization. Each little green box represents 1 VM.
On the bottom left, note the KEY METRICS section. vCenter Operations 5 shows that Memory | Contention is 0%. This means none of the VMs running on the host is contending for memory. They are all being served well!
I shared earlier that the behavior remains the same in vCenter 5.5. So, let’s take a look at how memory utilization is shown in vCenter 5.5.
The next screenshot shows the counters provided by vCenter 5.5. This is from a different ESXi host, as I want to provide you with a second example. It is not related to the 1st example at all. This host has 48 GB of RAM. About 26 GB has been mapped to VM or VMkernel, which is shown by the Consumed counter (the highest line in the chart; notice that the value is almost constant). The Usage counter shows 52% because it takes from Consumed. The active memory is a lot lower, as you can see from the line at the bottom. Notice that the line is not a simple straight line, as the value goes up and down. This proves that the Usage counter is actually the Consumed counter.
Notice that the ballooning is 0, so there is no memory pressure for this host. Can you explain why? It is because Consumed is below the threshold.
At this point, some readers might wonder whether that’s a bug in vCenter. No, it is not. ESXi is simply taking full advantage of the extra RAM. This is a similar behaviour by Windows 7 and Windows 2008.
There are situations in which you want to use the consumed memory and not the active memory. In fact, some applications may not run properly if you use active memory. I will cover this when we discuss memory counters in further chapters. Also, technically, it is not a bug as the data it gives is correct. It is just that additional data will give a more complete picture since we are at the ESXi level and not at the VM level. vRealize Operations distinguishes between the active memory and consumed memory and provides both types of data. vCenter uses the Consumed counter for utilization for the ESXi host.
As you will see later in this book, vCenter uses the Active counter for utilization for VM. So the Usage counter has a different formula in vCenter depending upon the object. This makes sense as they are at different levels. In my book I explain these 2 levels.
vRealize Operations uses the Active counter for utilization. Just because a physical DIMM on the motherboard is mapped to a virtual DIMM in the VM, it does not mean it is used (read or write). You can use that DIMM for other VMs and you will not incur (for practical purposes) performance degradation. It is common for Microsoft Windows to initialize pages upon boot with zeroes, but never use them subsequently.
Having said all the above, none of the counters & information above should be used in isolation. You should always check the VMs. It’s a common mistake to look at ESXi RAM counters (be it Consumed, Active, Usage, or Workload) and conclude there is RAM performance or RAM capacity issue. You should always check the VMs first. Are they having Memory Contention? If not, there is no RAM performance issue. What about Capacity? It’s the same principle. Take into account performance first, before you look at utilization. See this series of articles on what super metric you need.
For further information on this topic, I would recommend reviewing Kit Colbert’s presentation on Memory in vSphere at VMworld, 2012. The content is still relevant for vSphere 5.x. The title is Understanding Virtualized Memory Performance Management and the session ID is INF-VSP1729. If the link has changed, here is the link to the full list of VMworld 2012 sessions.
Not all performance management tools understand this vCenter-specific characteristic. They would have given you a recommendation to buy more RAM.
Now that you know your ESXi does not need so much RAM, you can consider a single socket ESXi.
[20 March 2016: added 3rd example]
User Jengl at VMware communities posted a question here. He wanted to know why, when “we patched the half of our cluster we had high consumed/usage of memory and the vCenter started to initiate balloon, compress and finally swap on the ESXi Hosts. I was not expecting that, because active memory was only a small percentage of consumed.
He shared his screenshot, which I copied here so I can explain it.
In the above screenshot, it looks like ESX Host (and not cluster or VM). We could see around 10:50 am, there was a spike. The Memory Usage counter hit 98.92%. That’s high, as that’s an average of 5 minutes. It probably touched 100% within that 5 minutes. Memory Workload, which is based on Active, was low at 19%. This is possible. The VMs were not actively using the RAM. Contention rose to a 1.82%, which indicates some VMs were contending for RAM.
The whole situation went back to normal. Contention disappeared pretty quickly, and Memory Usage went down. It’s hard to see fro the chart, but it looks like it dropped below 95%.
Based on the above, I’d expect balloon to have kicked in at 10:50 am too. It would then taper off. Because the VMs were not actively using the RAM, I’d not expect the value to drop to 0.
Guess what? The screenshot below confirmed that.
I hope that’s clear.