Tag Archives: VM Right Sizing

Right sizing VM Memory without using agent

This post continues from the Operationalize Your World post. Do read it first so you get the context.

The much needed ability to get visibility into Guest OS Memory is finally possible in vSphere. Part of the new features in vR Ops 6.3, you can now get Guest OS RAM metrics without using agent. So long you have vSphere 6.0 U1 or later, and the VM is running Tools 10.0.0 or later, you are set. Thanks Gavin Craig for pointing this out. The specific feature needed in Tools is called Common Agent Framework. That removes the need for multiple agents in a VM.

As a result, we can now update the guidance for RAM Right Sizing:

For Apps that manage its own RAM, use metrics from the Apps.
For others, use metrics from the Guest OS.
Use vR Ops Demand if you have no Guest OS visibility. Do not use vCenter Active.

Examples of applications that manage its own RAM are JVM and Database. If you use Guest OS counter, you can result in wrong sizing and make situation worse. Manny Sidhu provides a real example here. The application vendor asked for 64 GB RAM when they are only actively using 16 GB, as he shared in the vCenter screenshot below.

For apps that do not manage its own RAM, you should use Guest OS data. The table below compares 63 VMs, using a variety of Microsoft Windows. A good proportion of them are just idle, as this is a lab, not real life production.

  1. What conclusion do you arrive at? I’ve added a summary at the bottom of the list.
  2. How do you think VM Consumed vs VM Active vs Guest OS Used?

comparison-windows

And the table below shows comparison for Linux.

What do you spot? What’s your conclusion? How does this change your capacity planning? 😉

comparison-linux

Here is the summary for both OS. Total is 101 VM, not a bad sample size. I’ve also added comparison. Notice something does not add up?

total

To help you compare further, here is a vR Ops heatmap showing all the VMs.

compare

I created a super metric that compares Guest OS metric with VM Active. As expected, Guest OS is higher as it takes into account cache. It’s not just Used, and Windows does use RAM as cache (I think Linux does too, but not 100% sure).

The super metric is a ratio. I divide Guest OS : VM Active. I set 0 as Black, 5 as yellow, and 10 as red. Nothing is black, as VM Active is lower than Guest in all samples.

Conclusion

  • VM Consumed is always near 100%, even on VM that are idle for days. This is expected, as its nature as a cache. Do not use it for right sizing.
  • Windows memory management differs to Linux. Notice its VM Consumed is higher (94%) than Linux (82%). I guess it’s writing zero during boot creates this.
  • VM Active can be too aggressive as it does not take into account cache. vR Ops adds Demand counter, which makes the number less aggressive.
  • Guest OS Used + Cache is much greater than VM Active or VM Demand. It’s 69% vs 15% vs 31%
  • Guest OS Used + Cache + Free does not add up to 100%. In the sample, it only adds to 83%

Based on the above data, I’d prefer to use Guest OS, as it takes into account cache.

  • Side reading, if you need more info:
    Refer to this for Windows 7 metrics, and this for Windows 2008 metrics. 
    This is a simple test to understand Windows 7 memory behaviour.

You can develop a vR Ops dashboard like the one below to help you right size based on Guest OS. Notice it takes similar approach with the dashboard to right size CPU.

vm-right-sizing-memory

The dashboard answers the following questions:

  • How many large VMs do I have? What’s the total RAM among them?
    • Answered by the scoreboard widget. It only shows large VM (default is >24 GB RAM) which is powered on and has Guest OS metric.
  • Are the large VM utilizing the RAM given to them?
    • Answered by the 2 line charts:
      • Maximum Guest OS Used (%) in the group
      • Average Guest OS Used (%) in the group
    • In general, Guest OS Used can hit 100% as Windows/Linux takes advantage of the RAM as cache. Hence you see the peak of Used is high.
  • Where are these large VMs located?
    • Answered by the heat map.

The dashboard excludes all VMs that do not have Guest OS RAM data. Since not all VMs have Guest OS RAM data, the first step is to create a group that only contains VMs with the data. Use the example below.

group

You should also manually exclude app that manages its own memory.

Notice the Group Type is VM Types. Follow that exactly, including the case!

Once you created the group type and group, the next steps is to download the following:

  • Super metrics. Don’t forget to enable them!
  • Views
  • Dashboard

You should download the dashboard, view, super metric and the rest of Operationalize Your World package.

You can customize the dashboard. Do not be afraid to experiment with it. It does not modify any actual metric and object as dashboard is just a presentation layer.

Take for example, the scoreboard. We can add color coding to quickly tell you the amount of RAM wasted. If you have > 1 TB RAM wasted, you want it to show red.

customize

To do that, it’s a matter of editing the scoreboard widget. I’ve added thresholds, so it changes from green to yellow when I cross 500 GB, to orange when I cross 750 GB, and to red when I cross 1 TB.

scoreboard

Hope that helps. I’m keen to know how it helps you right sizing with confidence, now that you have in-guest visibility.

Capacity management at the VM Level

This blog post is loosely adapted from my book, titled VMware Performance and Capacity Management. It is published by Packt Publishing

In the previous post, I shared that capacity management in SDDC needs to be split into 2:

  1. VM level
  2. infrastructure level.

In this post, I will cover capacity management at VM level. I shared earlier that they should be done by application team (which is the customer of infrastructure team).

There are some tips you can give to your customers and policies you can set to keep things simple. For a start, keep the building blocks simple—one VM, one OS, one application, and one instance, as shown in the following diagram. So, avoid having one OS running the web, app, and DB server or avoid having one Microsoft SQL server running five instances of databases. The workload can become harder to predict as you cannot isolate them. It is recommended to adjust the size of the peak workload for a production environment. A month-end VM needs to be sized based on the month-end workload. For a non-production environment, you may want to tell the application team to opt for a smaller VM, because the vSphere Cluster where the VM is running is oversubscribed. The large VM may not get the CPU it asks for if it asks for too many.

Book - Chapter 2 - addon - 1 VM 1 OS 1 App

Be careful with those VMs that have two distinct peaks: one for CPU resources and another one for memory resources. I have seen this with a telecommunications client running Oracle Hyperion. For example, the first peak needs 8 vCPUs and 12 GB vRAM, and the second peak needs 2 vCPUs and 48 GB vRAM. In this case, the application team tendency is to size for 8 vCPUs and 48 GB vRAM. This results in an unnecessarily large VM, which can result in poor performance for both peaks. It is likely there are two different workloads running in the VM, which should be split into two VMs.

Size correctly. Educate the application team that oversizing results in slower performance in the virtual world. Although I encourage the standardization of VM size to make life simple, you should be flexible for large or extra-large cases. For example, once you pass 8 vCPUs, you need to consider every additional CPU carefully, ensure the VM really needs it, and ensure the application can indeed take advantage of the extra threads. You also need to verify that the underlying ESXi has sufficient physical cores, as it will affect your consolidation ratio, and hence your capacity management. You may see an ESXi that is largely idle yet the VMs on it are not performing, therefore impacting your confidence about adding VMs.

At the VM level, you need to monitor the following five components for its infrastructure portion:

  • Virtual CPU
  • Virtual RAM
  • Virtual network
  • Virtual disk IOPS
  • Usable disk capacity left in the Guest OS.

Getting vCPU and vRAM into a healthy range requires finding a balance. Undersizing leads to poor performance and oversizing leads to monetary waste as well as poor performance. The actual healthy range depends upon your expected utilization, and it normally varies from tier to tier. It also depends on the nature of the workload (online versus batch). For example, in tier 1 (the highest tier), you will have a lower range for the OLTP type of workload as you do not want to hit 100 percent at peak. The overall utilization will be low as you are catering for a spike. For batch workload, you normally tolerate a higher range for long-running batch jobs, as they tend to consume all the resources given to it. In a non-production environment, you normally tolerate a higher range, as the business expectation is lower (because they are paying a lower price).

Generally speaking, virtual network is not something that you need to worry about from a capacity point of view. You can create a super metric in vRealize Operations that tracks the maximum of all of your vNIC utilization from all VMs. If the maximum is, say, 80 percent, then you know that the rest of the VMs are lower than that. You can then plot a chart that shows this peak utilization in the last three months. We will cover this in more detail in one of the use cases discussed in the final chapter.

You should monitor the usable disk capacity left inside the Guest OS. Although vCenter does not provide this information, vRealize Operations does—provided your VM has VMware Tools installed (which it should have as a part of best practice).

You should use Reservation sparingly as it impacts the HA slot size, increases management complexity, and prevents you from oversubscribing. In tier 1, where there is no oversubscription because you are guaranteeing resource to every VM, reservation becomes unnecessary from a capacity management point of view. You may still use it if you want a faster boot, but that’s not from a capacity point of view. In tier 3, where cost is the number-one factor, using Reservation will prevent you from oversubscribing. This negates the purpose of tier 3 in the first place.

You should avoid using Limit as it leads to unpredictable performance. The Guest OS does not know that it is artificially limited.

I hope you find it useful. I will now cover capacity management at the Infrastructure level in this next post.