Tag Archives: Virtual Machine

Right-sizing Virtual Machines

This post is part of Operationalize Your World program. Do read it first to get the context.

Over Provisioning is a common malpractice in real life SDDC. P2V is a common reason, as the VM was simply matched to the physical server size. Another reason is conservative sizing by vendor, which was further added by Application Team.

I’ve seen large enterprise customers try to do a mass adjustment, downsizing many VMs, only to have the effort backfired when performance suffer.

Since performance is critical, you should address it from this angle. Educate VM Owner that right sizing actually improves performance. Carrot is a lot more effective than stick, especially for those with money. Saving money is a weak argument in most cases, as VM Owners have paid for the VMs.

Why oversized VM is bad

Use the picture below to explain why they are bad for VM Owner

  • Boot time.
    • If a VM does not have reservation, vSphere will create a swap file the size of the configured RAM. The bigger the VM, the longer it takes to create the file, especially in slow storage.
  • vMotion
    • The bigger the VM, the longer it takes to do vMotion and Storage vMotion.
  • NUMA
    • This happens when the VM cannot fit into a single socket.
    • It also happens when the VM’s active vCPU are more than what is available at that point in time. Example:
      • you have a 12 vCPU VM on a 12-core Socket. This is fine, if it’s the only VM running on the box. In reality, you have other VMs competing for resources. If the 12 vCPU wants to run, but Socket 0 has 6 free cores and Socket 1 has 6 free cores, the VM will be spread across both sockets.
  • Co-Stop and Ready
    • It will experience higher co-stop and ready time. Even if not all vCPU is used by the application, the Guest OS will still demand all the vCPUs be provided by the hypervisor.
  • Snapshot
    • Longer to snapshot, especially if memory snapshot is included.
  • Processes
    • The Guest OS is not aware of the NUMA nature of the physical motherboard, and thinks it has a uniform structure. It may move processes within its own CPUs, as it assumes it has no performance impact. If the vCPUs are spread into different NUMA node, example a 20 vCPU on a box with 2-socket and 20 cores, it can experience the ping-pong effect.
  • Visibility
    • Lack of performance visibility at the individual vCPU or virtual core level. Majority of the counter is at the VM level, which is an aggregate of all of its vCPU. It does not matter whether you use virtual socket or virtual core.

Impacts of Large VMs

  • Large VMs are also bad for other VMs, not just for themselves. They can impact other VMs, large or small. ESXi VMkernel scheduler has to find available cores for all the vCPUs, even though they are idle. Other VMs maybe migrated from core to core, or socket to socket, as a result. There is a counter in esxtop that tracks this migration.
  • Large VMs tend to have slower performance. ESXi may not have all the available vCPU for them. Large VMs are slower as all their vCPU have to be scheduled. The counter CPU Co-Stop tracks this.
  • Large VMs reduce consolidation ratio. You can pack more vCPU with smaller VM than with big VM.

As a Service Provider, it actually hit your bottom line. Unless you have progressive pricing, you make more money with smaller VM as you sell more vCPU. Example. If you have 2 socket, 20 cores, 40 threads, you can have either:

  • 1x 20-vCPU VM with moderately high utilisation
  • 40x 1 vCPU VM with moderately high utilization

In the above example, you sell 20 vCPU vs 40 vCPU.

Approach to Right-sizing

Focus on large VM

  • Every downsize is a battle because you are changing paradigm with “Less is More”. Plus, it requires downtime.
  • Downsizing from 4 vCPU to 2 does not buy much nowadays with >20 core Xeon.
  • No one likes to give up that they are given, especially if they are given little. By focusing on the large ones, you spend 20% effort to get 80% result.

Focus on CPU first, then RAM

  • Do not change both at the same time.
    • It’s hard enough to ask apps team to reduce CPU, so asking for both will be even harder.
    • If there is a performance issue after you reduced both CPU and RAM…., you have to bring both back up, even though it was caused by just one of them.
  • RAM in general is plentiful as RAM is cheap
  • RAM monitoring is hard to measure, even with agents. If the App manages its own memory, it needs application-specific counter. Apps like Java VM, Databases do not expose to Windows/Linux how it manages its RAM.

Disk right size needs to be done at Guest OS partition level

  • VM Owners won’t agree that you resize their partition.
  • Windows and Linux have different partition names. This can make reporting difficult across OSes

Technique

The technique we use for both CPU and RAM are the same. I’d use CPU as an example.

The first thing you need is to create a dynamic group that capture all the large VMs. Create 1 group for CPU, and one for RAM.

Once you create a group, the next step is to create a super metric. You should create 2 super metrics:

  1. Maximum CPU Workload among these large VMs
    • You expect this number to be hovering around 80%, as it only takes 1 VM among all the large VMs for the line chart to spike.
    • If you have many large VMs, one of them tend to have high utilisation at any given time.
    • If this number is low, that means a severe wastage!
  2. Average CPU Workload among these large VMs
    • You expect this number to hovering around 40%, indicating sizing was done correctly.
    • If this chart is below <25% all the time for entire month, then all the large VMs are over sized.

You do not need to create the Minimum. There is bound to be a VM who are idle at any given time.

In a perfect world, if all the large VMs are right sized, which scenario will you see?

All good. The 2 line chart shows us the degree of over provisioning. Can you tell a limitation?

It lies in the counter itself. We cannot distinguish if the CPU usage is due to real demand or not. Real demand comes from the app. Non-real demands comes from the infra, such as:

  • Guest OS reboot
  • AV full scan.
  • Process hang. This can result in 100% CPU Demand. How to distinguish a runaway process?

If your Maximum line is constantly ~100%, you may have a runaway process.

Now that you’ve got the technique, we are ready to implement it. Follow these 2 blogs

  1. CPU right sizing.
  2. RAM right sizing

Which VDI User needs more CPU or RAM – Part 2?

In the previous post, I shared how we can quickly answer fundamental questions such as:

  • Is there any users out there who needs more RAM or CPU?
  • If yes, who and how much short are they? What time and how often they did this situation?

We covered CPU, let’s cover RAM now 🙂

RAM is not so simple. As you can see here, the Cached Memory and Free Memory are not visible outside the Guest OS. This means the counter you need to use should ideally be from the Guest, and not from the Hypervisor. The post here shows that it is possible that they differ.

The good thing in VDI is, Horizon View comes with the agent out of the box. The vRealize Operations for Horizon agent has been integrated into the base Horizon View agent. As a result, there is no need to deploy the vRealize Operations End Point agent.

Now, there are 2 ways we can determine when a user needs more RAM:

  • RAM usage is high.
  • Available RAM is low

I’m using the second one as it’s easier for you to see. If I show RAM Usage is 13574 MB, you still need to know the total configured RAM (e.g. 16 GB RAM), and then subtract the number. Well, that will take you to the Available RAM 🙂

Since we have lots of VDI users, the first thing we need to do is to ensure no one has high utilization that is too high, or runs out of available RAM. Super Metric comes in handy here. To find out if anyone runs out of Available RAM, you can create the super metric below.

Available RAM

Once you do that, it’s a matter of showing them as a line chart on the dashboard.

Available RAM 2

You do the same thing for the Committed Byte. Why do I use Committed Byte and not Memory In Use? Can you guess why?

Memory In Use can easily be determined. It is just Total RAM – Available RAM.

Committed Byte, on the other hand, does not always go hand in hand with Memory Usage. See this blog for the explanation. So we need to complement our Available RAM (MB) with Committed Memory (%). vRealize Operations for Horizon has the metric too.

The 2 super metrics will provide a good overview of the entire environment. We can just see the 2 line charts, and at a glance we know if everyone is doing well. If not, the list next to it will tell us which user was affected. The list is just using the standard View widget, which I covered in previous post.

RAM 1

V4V 6.2 lets you map the user name with the VM name and Windows name.

Memory

Hope that helps you in making sure your VDI users are happy, and productive! 🙂

Who snapshot what VM and when

I got a request from my customer to track the VM snapshot operations. They need to track creation and deletion. Basically, who snapshot what VM and when. So I tried in the lab. I simply created a snapshot. I waited for a few seconds, then proceeded to delete it. You can see the activity in the vSphere Web Client below.

Notice the snapshot name is not shown in the vCenter task list. In production environment, you should have a meaningful snapshot name. If you have a naming pattern, you can actually build a Log Insight query based on it. Let’s see if Log Insight captures the name of the snapshot!

Who snapshot what VM and when: VM snapshot

Where do they show up? Well, the awesome folks at Log Insight has created an out of the box dashboard for you. Just go to the “Virtual Machine – Snapshots” like I did below. Notice Log Insight has categorised the 2 events nicely.

VM snapshot - 1

You can drill down to the Interactive Analytics. Here is what they look like. In this example, I’ve modified the chart so it’s simpler for me.

VM snapshot - 2

If you want to know the actual query, here is what they look like. Yup, just 2 variables are all you need. In the example below, I’ve also extended the time line to the past 7 days as I got curious if anyone else have done any snapshot. Good to know no one did.

VM snapshot - create 001

Now… can you guess the snapshot name? It’s in the log above. Hints: I was singing an old song by The Beatles. Ok, it wasn’t technically singing, it was a bad attempt at singing 🙂

Wait a minute! We have not shown the User who made the changes. To do that, you need to use the vc_username field, and add the word Snapshot in the text field. To make it easier to see, use the Field Table. I’ve provided an example below.

VM snapshot - 5

There you go. Now you know who snapshot what VM and when. Have fun combing the logs with Log Insight. Easier than grep right 😉 (just kidding!)