Tag Archives: Right sizing

Right-sizing Virtual Machines

This post is part of Operationalize Your World program. Do read it first to get the context.

Over Provisioning is a common malpractice in real life SDDC. P2V is a common reason, as the VM was simply matched to the physical server size. Another reason is conservative sizing by vendor, which was further added by Application Team.

I’ve seen large enterprise customers try to do a mass adjustment, downsizing many VMs, only to have the effort backfired when performance suffer.

Since performance is critical, you should address it from this angle. Educate VM Owner that right sizing actually improves performance. Carrot is a lot more effective than stick, especially for those with money. Saving money is a weak argument in most cases, as VM Owners have paid for the VMs.

Why oversized VM is bad

Use the picture below to explain why they are bad for VM Owner

  • Boot time.
    • If a VM does not have reservation, vSphere will create a swap file the size of the configured RAM. The bigger the VM, the longer it takes to create the file, especially in slow storage.
  • vMotion
    • The bigger the VM, the longer it takes to do vMotion and Storage vMotion.
  • NUMA
    • This happens when the VM cannot fit into a single socket.
    • It also happens when the VM’s active vCPU are more than what is available at that point in time. Example:
      • you have a 12 vCPU VM on a 12-core Socket. This is fine, if it’s the only VM running on the box. In reality, you have other VMs competing for resources. If the 12 vCPU wants to run, but Socket 0 has 6 free cores and Socket 1 has 6 free cores, the VM will be spread across both sockets.
  • Co-Stop and Ready
    • It will experience higher co-stop and ready time. Even if not all vCPU is used by the application, the Guest OS will still demand all the vCPUs be provided by the hypervisor.
  • Snapshot
    • Longer to snapshot, especially if memory snapshot is included.
  • Processes
    • The Guest OS is not aware of the NUMA nature of the physical motherboard, and thinks it has a uniform structure. It may move processes within its own CPUs, as it assumes it has no performance impact. If the vCPUs are spread into different NUMA node, example a 20 vCPU on a box with 2-socket and 20 cores, it can experience the ping-pong effect.
  • Visibility
    • Lack of performance visibility at the individual vCPU or virtual core level. Majority of the counter is at the VM level, which is an aggregate of all of its vCPU. It does not matter whether you use virtual socket or virtual core.

Impacts of Large VMs

  • Large VMs are also bad for other VMs, not just for themselves. They can impact other VMs, large or small. ESXi VMkernel scheduler has to find available cores for all the vCPUs, even though they are idle. Other VMs maybe migrated from core to core, or socket to socket, as a result. There is a counter in esxtop that tracks this migration.
  • Large VMs tend to have slower performance. ESXi may not have all the available vCPU for them. Large VMs are slower as all their vCPU have to be scheduled. The counter CPU Co-Stop tracks this.
  • Large VMs reduce consolidation ratio. You can pack more vCPU with smaller VM than with big VM.

As a Service Provider, it actually hit your bottom line. Unless you have progressive pricing, you make more money with smaller VM as you sell more vCPU. Example. If you have 2 socket, 20 cores, 40 threads, you can have either:

  • 1x 20-vCPU VM with moderately high utilisation
  • 40x 1 vCPU VM with moderately high utilization

In the above example, you sell 20 vCPU vs 40 vCPU.

Approach to Right-sizing

Focus on large VM

  • Every downsize is a battle because you are changing paradigm with “Less is More”. Plus, it requires downtime.
  • Downsizing from 4 vCPU to 2 does not buy much nowadays with >20 core Xeon.
  • No one likes to give up that they are given, especially if they are given little. By focusing on the large ones, you spend 20% effort to get 80% result.

Focus on CPU first, then RAM

  • Do not change both at the same time.
    • It’s hard enough to ask apps team to reduce CPU, so asking for both will be even harder.
    • If there is a performance issue after you reduced both CPU and RAM…., you have to bring both back up, even though it was caused by just one of them.
  • RAM in general is plentiful as RAM is cheap
  • RAM monitoring is hard to measure, even with agents. If the App manages its own memory, it needs application-specific counter. Apps like Java VM, Databases do not expose to Windows/Linux how it manages its RAM.

Disk right size needs to be done at Guest OS partition level

  • VM Owners won’t agree that you resize their partition.
  • Windows and Linux have different partition names. This can make reporting difficult across OSes

Technique

The technique we use for both CPU and RAM are the same. I’d use CPU as an example.

The first thing you need is to create a dynamic group that capture all the large VMs. Create 1 group for CPU, and one for RAM.

Once you create a group, the next step is to create a super metric. You should create 2 super metrics:

  1. Maximum CPU Workload among these large VMs
    • You expect this number to be hovering around 80%, as it only takes 1 VM among all the large VMs for the line chart to spike.
    • If you have many large VMs, one of them tend to have high utilisation at any given time.
    • If this number is low, that means a severe wastage!
  2. Average CPU Workload among these large VMs
    • You expect this number to hovering around 40%, indicating sizing was done correctly.
    • If this chart is below <25% all the time for entire month, then all the large VMs are over sized.

You do not need to create the Minimum. There is bound to be a VM who are idle at any given time.

In a perfect world, if all the large VMs are right sized, which scenario will you see?

All good. The 2 line chart shows us the degree of over provisioning. Can you tell a limitation?

It lies in the counter itself. We cannot distinguish if the CPU usage is due to real demand or not. Real demand comes from the app. Non-real demands comes from the infra, such as:

  • Guest OS reboot
  • AV full scan.
  • Process hang. This can result in 100% CPU Demand. How to distinguish a runaway process?

If your Maximum line is constantly ~100%, you may have a runaway process.

Now that you’ve got the technique, we are ready to implement it. Follow these 2 blogs

  1. CPU right sizing.
  2. RAM right sizing

Which VDI User needs more CPU or RAM?

VDI workload differs to Server workload. As a result, we cannot use the same approach to right size them. You probably know it well, so let me just highlight some of the differences

  • Usage is spiky, not predictable.
    • A server generally speaking has a nice predictable pattern on any given 5 minutes. The CPU and RAM does not go from 5% to 95% back and forth within 1 minute.
  • Human does not work non-stop.
    • Typing time, thinking time, meeting, coffee break, travelling, public holiday, sick leave, etc. A server, being a machine, has none of this 🙂
    • In this age of mobile cloud, there is no fixed “working hours”. Each user has his or her own work schedule. We cannot average across a long time period. 1 hour is probably as long as you want it when it comes to averaging.
  • There is non-user workload
    • Windows weekly AV scan, Windows update, Windows patch, Horizon events (e.g. recompose, rebalance), VSAN events (e.g. rebalance, staging)
    • Large application install
  • Issues like runaway process can chew up CPU for >10 minutes.
    • I’ve seen this on a web based application.
  • User does not want to wait for hours when performance issue happens
    • You are user too. How long are you willing to wait for an application to launch? Yup, 1 minute 🙂

For the time being, I’d ignore Disk (IOPS) and Network, and just focus on Compute (CPU and RAM) for the time being.

As I have shared in this blog, RAM has different behaviour to CPU. As a result, we need a different counters for CPU and RAM.

For CPU, we should use the data from outside the Guest. 
For RAM, we should use the data from inside the Guest.

Picking the right counter is critical. As you can see here, choosing the wrong counter can result in wrong decision.

Set aside the technology and tool, when should we give a user more CPU or more RAM?

  • Well… when she needs more.

How do we define “more”?

We must see her workload certainly.

  • If we want to be less generous, we consider the workload in the past 1 week.
  • If we want to provide a high performance, snappy VDI experience, then any given day is enough to warrant an upsize. We don’t wait for 1 week of unacceptable performance.

Ok, how do we get insight into the workload in the past 1 day? There is no point in getting the average of the last 24 hours, as she likely only generate workload for 8 hours. Maybe even less, as she may have meetings, phone calls, or even not in the office. As we said, the average will be low.

What we need is the Max of any given 5 minutes. This gives us insight whether she demanded more resource. 5 minute is a good and balanced window. Going to 1 minute will be too sensitive. Going to 10 minutes is too long for a user to wait.

vRealize Operations provides this via its View widget. The following screenshot show that it can display the Maximum during the sample period.

CPU 1

Beside the Maximum, what else do you notice?

Yup, I show Standard Deviation. I’ve shown below how you add it. Michael Ryom has written an excellent explanation here. Please read it first.

CPU 2

I’d use a simple example. Say user Marie CPU Workload average is 50% in the past 1 day. The standard deviation is 10%. That means in the past 24 hours, 95% of her workload falls between 30% – 70%. Standard Deviation formula states that 95% of the data falls within 2 standard deviation. If the max is 95%, that means she only hit that workload 5% in the past 1 day. That’s still 72 minutes, a long time from her viewpoint. 3 Standard Deviation takes us to 99.7%. That means 99.7% of the time, her CPU workload falls between 20 – 80%. That 0.3% translates into 4 minutes in the last 24 hours. So as what Michael said, the devil is in the detail, and now you have the details 🙂

Let’s now take a real example. Notice the first one has average of 21.52%. Standard Deviation is only 2.24%. Maximum is however a whopping 96%. So it is off the range. We can tell quickly that it not normal. Since the sample period below is 24 hours (1440 minutes), that means this is a one off data in vRealize Operations.

CPU 3

Zooming into the VM to plot the entire 24 hours, we can see it’s indeed one off. Bingo! 🙂

CPU 4

Now that you have insight, you can confidently decide if that’s a one off instance, or something that does need an upsize. BTW, since this is VDI, your starting line is probably 2 vCPU (I’d avoid going 1 vCPU) and you should only increment 1 vCPU at a time. Another word, I won’t jump from 2 vCPU to 4, 6, 8. I’d go 2, 3, 4, 5 as that hits my consolidation ratio.

Yes, that’s all you need to find out which User needs more CPU. Simple, yet accurate. Sometimes as engineer, we over engineering a solution 🙂

What about RAM?

Well… that’s a topic of another blog. I want you to review this first. Let me know which counter to use!

Hint: it is not Memory Usage, Memory Consumed, Memory Workload, Memory Active

You can find the answer here.

Are those Large VMs using the resources given to them?

This post is part of Operationalize Your World program. Do read it first to get the context.

In the previous post, I covered the reason why over-provisioned VMs are bad. We also talked about the technique. Let’s discuss the implementation for CPU in this blog.

Create a dynamic group that capture all the large VMs. Depending on your environment, you can either grab those 8 or more vCPU VM, or 6 or more. In the screenshot below, I’m using 8 vCPU.

What do you notice about the CPU Utilization in the following screenshot?

1

The Large VMs as a group is only using 7.61% max!

That means not a single one of them used >8% CPU over a period of 24 hours. This is an example of severe over provisioning.

Once you have the super metrics, you can display them in the dashboard. You can use line chart or View. I use View as I do not need to show them as 2 separate charts. What do you see from the following example?

  • The area marked 1 is not what you want to see. None of the Large VMs are doing any work. This means they are all oversized.
  • The area marked 2 is healthier. At any given moment, one of the VM are doing work. Demand counter can go above 100% as hypervisor performs IO (storage or network) using another core.
  • The average remains low all the time. This is over 1 month period, with 5 minute granularity. It shows majority of the VMs are over sized.

Now, the above is good as overall. But it’s missing something. Can you guess what?

Yes, it’s missing the VMs themselves. What if upper management want to see at a glance all the VMs utilisation?

We can create a table that has data transformation. The table complements the line chart by listing all the VMs. From the list, you can see which VMs is the most over provisioned, because the list is sorted. You can sort by 5-minute or 1 hour peak.

What’s the limitation of the table?

  • It does not show the VM distribution. Where are the large VMs? Do they exist in a cluster where they are not supposed to exist?

This is where the Heat Map comes in.

  • We group them by Cluster, then by ESXi, so we can see where they are. You want to see them well spread in your clusters, and not concentrated in just 1 host.
  • The heat map is sized by vCPU configuration. In this way, the bigger VM will have bigger box. A 32 vCPU VM will have a box that is 4x larger than a 8 vCPU VM, so it will stand out. You can see in the following example that some large VMs are much larger than the rest. I have monster VMs in this environment.

A great feature of heat map is color. It’s very visual. We know that both under provisioning and over provisioning are bad. So I set the color spectrum. I choose

  • black for 0
  • red for 100
  • green for 50

If I do the right sizing, I’d see mostly green. If I under provision, I’d see mostly red. If I over provision, which you can expect in most environment, guess what? They are black!

3

That’s all you need to see the overall picture.

A VM Owner does not care about your overall picture. She just cares about her VM. That means we need to drill down into individual VM.

To facilitate that, we need a list of VMs. I use a Top-N as enables me to sort the VM. The good thing about Top-N is you can go back to any period in time. Heat map only allows you to see current data.

The time line in Top-N is set to just 1 hour. No point setting it longer as it will average it. What you want is already provided by the View List. Use that to pick the VM to downsize. The Top-N is merely to drive the widgets.

We also add table. It shows the individual vCPU peak utilisation. It’s showing in seconds, following vCenter real-time chart. 20 seconds = 100%.

The table does not answer quickly what is the CPU utilisation 95% of the time. This is where the Forensic comes in. It shows the 95th percentile. You expect that green vertical line to be at 80% mark, indicating it’s correctly size.

The table and Forensic are useful. What’s their limitation?

  • They not as user friendly as a line chart.
  • Plus, VM Owner wants to see the utilization of each vCPU. This lets her clearly if a specific peak was genuine demand or not.

The chart is busy as the 5 minute granularity is maintained. No roll up. You can zoom into any specific time of your interest.

I’m only showing the first 16 vCPU. You can configure to show the rest. My screen not big enough to show all 16 vCPU. If yours is not big enough, or you need to show >16, create multiple View widgets.

How do they fit together on the dashboard? Here is how they look like.

I hope you found it useful. Happy rightsizing!