Monthly Archives: January 2017

Right-sizing Virtual Machines

This post is part of Operationalize Your World program. Do read it first to get the context.

Over Provisioning is a common malpractice in real life SDDC. P2V is a common reason, as the VM was simply matched to the physical server size. Another reason is conservative sizing by vendor, which was further added by Application Team.

I’ve seen large enterprise customers try to do a mass adjustment, downsizing many VMs, only to have the effort backfired when performance suffer.

Since performance is critical, you should address it from this angle. Educate VM Owner that right sizing actually improves performance. Carrot is a lot more effective than stick, especially for those with money. Saving money is a weak argument in most cases, as VM Owners have paid for the VMs.

Why oversized VM is bad

Use the picture below to explain why they are bad for VM Owner

  • Boot time.
    • If a VM does not have reservation, vSphere will create a swap file the size of the configured RAM. The bigger the VM, the longer it takes to create the file, especially in slow storage.
  • vMotion
    • The bigger the VM, the longer it takes to do vMotion and Storage vMotion.
  • NUMA
    • This happens when the VM cannot fit into a single socket.
    • It also happens when the VM’s active vCPU are more than what is available at that point in time. Example:
      • you have a 12 vCPU VM on a 12-core Socket. This is fine, if it’s the only VM running on the box. In reality, you have other VMs competing for resources. If the 12 vCPU wants to run, but Socket 0 has 6 free cores and Socket 1 has 6 free cores, the VM will be spread across both sockets.
  • Co-Stop and Ready
    • It will experience higher co-stop and ready time. Even if not all vCPU is used by the application, the Guest OS will still demand all the vCPUs be provided by the hypervisor.
  • Snapshot
    • Longer to snapshot, especially if memory snapshot is included.
  • Processes
    • The Guest OS is not aware of the NUMA nature of the physical motherboard, and thinks it has a uniform structure. It may move processes within its own CPUs, as it assumes it has no performance impact. If the vCPUs are spread into different NUMA node, example a 20 vCPU on a box with 2-socket and 20 cores, it can experience the ping-pong effect.
  • Visibility
    • Lack of performance visibility at the individual vCPU or virtual core level. Majority of the counter is at the VM level, which is an aggregate of all of its vCPU. It does not matter whether you use virtual socket or virtual core.

Impacts of Large VMs

  • Large VMs are also bad for other VMs, not just for themselves. They can impact other VMs, large or small. ESXi VMkernel scheduler has to find available cores for all the vCPUs, even though they are idle. Other VMs maybe migrated from core to core, or socket to socket, as a result. There is a counter in esxtop that tracks this migration.
  • Large VMs tend to have slower performance. ESXi may not have all the available vCPU for them. Large VMs are slower as all their vCPU have to be scheduled. The counter CPU Co-Stop tracks this.
  • Large VMs reduce consolidation ratio. You can pack more vCPU with smaller VM than with big VM.

As a Service Provider, it actually hit your bottom line. Unless you have progressive pricing, you make more money with smaller VM as you sell more vCPU. Example. If you have 2 socket, 20 cores, 40 threads, you can have either:

  • 1x 20-vCPU VM with moderately high utilisation
  • 40x 1 vCPU VM with moderately high utilization

In the above example, you sell 20 vCPU vs 40 vCPU.

Approach to Right-sizing

Focus on large VM

  • Every downsize is a battle because you are changing paradigm with “Less is More”. Plus, it requires downtime.
  • Downsizing from 4 vCPU to 2 does not buy much nowadays with >20 core Xeon.
  • No one likes to give up that they are given, especially if they are given little. By focusing on the large ones, you spend 20% effort to get 80% result.

Focus on CPU first, then RAM

  • Do not change both at the same time.
    • It’s hard enough to ask apps team to reduce CPU, so asking for both will be even harder.
    • If there is a performance issue after you reduced both CPU and RAM…., you have to bring both back up, even though it was caused by just one of them.
  • RAM in general is plentiful as RAM is cheap
  • RAM monitoring is hard to measure, even with agents. If the App manages its own memory, it needs application-specific counter. Apps like Java VM, Databases do not expose to Windows/Linux how it manages its RAM.

Disk right size needs to be done at Guest OS partition level

  • VM Owners won’t agree that you resize their partition.
  • Windows and Linux have different partition names. This can make reporting difficult across OSes


The technique we use for both CPU and RAM are the same. I’d use CPU as an example.

The first thing you need is to create a dynamic group that capture all the large VMs. Create 1 group for CPU, and one for RAM.

Once you create a group, the next step is to create a super metric. You should create 2 super metrics:

  1. Maximum CPU Workload among these large VMs
    • You expect this number to be hovering around 80%, as it only takes 1 VM among all the large VMs for the line chart to spike.
    • If you have many large VMs, one of them tend to have high utilisation at any given time.
    • If this number is low, that means a severe wastage!
  2. Average CPU Workload among these large VMs
    • You expect this number to hovering around 40%, indicating sizing was done correctly.
    • If this chart is below <25% all the time for entire month, then all the large VMs are over sized.

You do not need to create the Minimum. There is bound to be a VM who are idle at any given time.

In a perfect world, if all the large VMs are right sized, which scenario will you see?

All good. The 2 line chart shows us the degree of over provisioning. Can you tell a limitation?

It lies in the counter itself. We cannot distinguish if the CPU usage is due to real demand or not. Real demand comes from the app. Non-real demands comes from the infra, such as:

  • Guest OS reboot
  • AV full scan.
  • Process hang. This can result in 100% CPU Demand. How to distinguish a runaway process?

If your Maximum line is constantly ~100%, you may have a runaway process.

Now that you’ve got the technique, we are ready to implement it. Follow these 2 blogs

  1. CPU right sizing.
  2. RAM right sizing

How to prove your IaaS Cluster is fast

This post is part of Operationalize Your World program. Do read it first so you get the context.

You provide IaaS to your customers. A cluster, be it VMware vSphere Cluster or Microsoft Hyper-V Cluster, is a key component to this service offering. In hyper-converged era, your cluster does storage too. You take time to architect it, making sure all best practices are considered. You also consider performance and ensure it’s not a slow system.

Why is it then, when a VM owner complains that her VM is slow and she blames your infrastructure, you start troubleshooting? Doesn’t that show your lack of confidence in your own creation? If your cluster is indeed fast, why can’t you just show it and be done in 60 seconds?

Worth pondering, isn’t it? 😉

It’s hard to reach formal agreement with customers quickly and consistently when the line is not clearly drawn. If you have disagreement with your customers, especially paying customers, guess who win 🙂

You need to be able to show something like this.

In the above chart, there is a threshold. It defines what is acceptable level of performance. It quantifies what exactly you mean when you promise “fast”. It is your Performance SLA. Review this if you need more details.

You assure them it will be fast, and you’ve got it backup with measureable metrics. You prove that with the 2nd line. That’s the actual performance. Fast or not is no longer debatable.

You measure performance every 5 minutes, not every hour. In a month, that is 12 x 24 x 30 = 8650 proofs. Having that many data points backing you up helps in showing that you’re doing your job.

Now that you’ve got the Performance SLA, how do you implement it in vRealize Operations?

I’ll take disk latency as an example, as it’s easy to understand.

The chart below shows the various disk latency among 6 VMs, from 9:00 am until 9:35 am. What do you spot?

The average is good. They are mostly below 5 ms.

The worst is bad. It is hovering around 20 ms. It is 4x higher than average, indicating a VM is hit. The storage subsystem is unable to serve all VMs. It’s struggling to deliver.

Let’s plot a line along the worst (highest) disk latency. The bold red line is the maximum among all the disk latency from all the VM. We call this Max (VM Disk Latency) in the cluster.

A cluster typically have a lot more VMs than 6. It’s common to see >100 VMs. Plotting >100 lines will make the chart unreadable. Plus, at this junction, you’re interested in the big picture first. You want to know if the cluster is performing fast.

This is the power of super metric. it tracks the maximum among all VMs, creating its own metric as a result. You lose information on which metric in the super metric, as it’s made of >1 VM.

The next chart has all the details removed. We just have the Maximum and the Average. It’s clear now that the max is much higher than average.

We added 3 dotted line in the above chart. They are the 3 possible outcome. If your Maximum is:

  • below the line, then you are good. The cluster is serving all its VM well.
  • near the threshold, then your capacity is full. Do not add more VM.
  • above the threshold, then your cluster is full. Move VM to reduce demand before VM Owner complains.

Can you see the importance of the Performance SLA?

It’s there to protect your job. Without the line, your reputation is at risk. Say you’ve been delivering Disk Latency at <1 ms on your all flash SSD array. Everyone is happy. Of course! 🙂

You then do a storage maintenance for just 1 hour. During that period, disk latency went up to 4 ms. It is still a respectable number. In fact, it’s a pretty damn good number. But you got a complaint. It happened to coincide with the time you did the maintenance.

Can you guess who is responsible for the slowness experience by business?

You bet. Your fault 🙁

But if you have established a Performance SLA, you’re protected. Say you promise 5 ms. You will be able to say “Yes, I knew it would go up as we’re doing maintenance. I’ve catered for this in my design. I knew we could still deliver as per agreed SLA.”

Let’s now show a real example. This is what it actually looks like in vR Ops 6.4.

Notice the Maximum is >10x higher than the average, and the average is very stable. Once the Cluster is unable to cope, you’d see pattern like this. Almost all VMs can be served, but 1-2 were not served well. The maximum is high because there is always 1 VM that wasn’t served.

Only when the Cluster is unable to serve ~50% of the VMs, will average become high too.

BTW, do you notice the metric names differ?

  • The Max is a super metric.
  • The Average is a regular metric

This is because metric at higher-object (e.g. cluster, Host) are all average. None of them is the real peak. Review this “when is a peak not a true peak” article.

The above is for Disk. IaaS consists of providing the following as a service:

  1. CPU
  2. RAM
  3. Disk
  4. Network

Hence we need to display 4 line charts, showing that each service is delivered well.

As every Service Tier performance is different, you need to show it per service tier. A Gold Tier delivers faster performance than Silver Tier, but if it’s higher than its SLA, it’s still not performing. Performance is relative to what you promise.

Since VMs move around in a cluster due to DRS and HA, we need to track at Cluster level. Tracking at Resource Pool level is operationally challenging. Do not mix service tier, as Tier 3 performance can impact Tier 1. The only way you can protect higher tier is with Reservation, which has its own complication operationally.

Once I know what to display, I’d normally do a whiteboard, often with customers. It helps me to think clearly.

This is what the dashboard looks like. It starts with a list of clusters. Selecting a cluster, will automatically show the performance. It shows CPU, RAM and Disk. Network drop packet should be 0 at all times, hence not shown. You can track it at data center level, not cluster.

The final dashboard can be seen here . As performance has to be considered in capacity, we show how it’s done in a series of post here.

NOC Dashboards for SDDC – Part 2

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Dashboard: Performance

Is my IaaS performing?

That’s the key question that you need to answer. You need to show if the clusters are coping well. Show how the clusters are performing in terms of CPU, RAM and Disk.

The above dashboard is per Service Tier. Do you know why?

Yes, the threshold differs for each tier. What is acceptable in Tier 3 may not be acceptable in Tier 1.

The good thing about line chart is it provides visibility beyond present time. You can show the last 6 hours and still get good details. Showing >24 hours will result in visualization that is too static, not suitable for NOC use case.

Limitation & customisation:

  • You need 1 widget per Service Tier.
  • If you only have a few clusters, you can show multiple Service Tiers in 1 dashboard. 1 row per tier results in simpler visualisation.
  • In environment with >10 clusters, we can group them into Service Tier. Focus more on the highest tier (Tier 1).
  • In environment with >100 clusters, we need another grouping in between. Group the Tier 1 clusters into physical location.

When a cluster is unable to cope, is it because it’s having high utilization? I show CPU, RAM and Disk here. You can add Network as you know the physical limit of ESXi vmnic.

Disk is tricky for 2 reasons:

  • It is not in %. There is no 100% for IOPS or Throughput. The good thing is when you architected the array or vSAN, you did have certain IOPS number in mind right? 😉 Well, now you can get the storage vendor to prove it that it does indeed perform as per the PowerPoint 😉 If not, you get free hardware if they promise a fast performance that will meet business requirement.
  • You need to show both IOPS and Throughput. While they are related, you can have low IOPS and high throughput, and vice versa.

If the cluster utilization is high, the next dashboard drills into each ESXi.

We can also see if there are unbalanced. In theory, they should not, if you set DRS to Fully Automated, and pick at least the middle level sensitivity (3). DRS in vSphere 6.5 considers Network also, so you can have unbalanced CPU/RAM.

With the dashboard above, we can tell if ESXi CPU Utilisation is healthy or not.

  • Low value does not mean VM performs fast. A VM is concerned with VM Contention metric, not ESXi Utilization. Low value means we over invest. It is not healthy as we waste money and power.
  • High value means we risk performance (read: contention)

For ESXi, go with higher core count. You save license if you can reduce socket.

We can also tell if ESXi RAM Utilisation is healthy or not.

  • Customers tend to overspend on RAM and underspend on CPU. The reason is this.
  • For RAM, we have 2 metrics:
    • Active RAM
    • Mapped RAM
  • The value you want is somewhere in between Active and RAM.

In the dashboard, the 3 widgets have different range. The range I set is 30 – 90, 50 – 100 and 10 – 90.

Why not 0 – 100?

It is not 100% because you want to cater for HA. Your ESXi should not hit 100% as if you have HA, it would be beyond 100, meaning performance will be badly affected.

If the cluster or ESXi utilization is high, is it because there are VMs generating excessive workloads?

The dashboard above answers if we have VMs that dominate the shared environment.

  • CPU: show a heat map showing all VMs, sized by CPU Demand in GHz (not %), color by contention
  • RAM: show a heat map showing all VMs, sized by Active RAM, color by contention
  • Storage: show a heat map showing all VMs, sized by IOPS color by latency.

At a glance, we can tell the workload distribution among the VMs. We can also tell if they being served well or not.

Limitation & customisation:

  • You need 1 widget per Service Tier.
  • You can change the threshold anytime. If you want a brand new storage from Finance, set the max to 1 ms 😉
  • In larger environment, group your heatmap (e.g. by cluster, host, folder).
  • We can show individual VM, but we can’t show the history as there are too much data to show.
  • This needs to be done per Tier. 1 dashboard per Tier, as the threshold varies per tier.

Hope you find it useful. For the product-specific implementation, review this blog. To prevent vROps session from timing out, implement this trick by Sunny.