Author Archives: Iwan Rahabok

About Iwan Rahabok

A father of 2 little girls, my pride and joy. The youngest one quickly said "I'm the joy!"

VM Key Performance Indicators

What are the counters that can impact a VM performance? The diagram may surprise you as it does not follow popular thinking. Notice the red dotted line. It means that Utilization does not impact performance.

Take VM CPU for example. You do want the VM to run at 100% CPU utilization, as that means the CPU is doing as much work as possible. The VM performance is not impacted so long it has no queue. It’s handling 100% of the demand perfectly. Specific to VM, a queue may happen inside the VM (read: Guest OS) or outside the VM (read: hypervisor layer).

Now, a VM running at 100% certainly has no more capacity for additional workload. It is full. If you think the VM needs to run more workload in the future, then add more resource. If not, then adding more CPU will in fact slow down its performance. This is one reason why I was a proponent of dropping the Stress counter in vR Ops. It was based on Demand counter.

Let’s switch to ESXi. An ESXi does not have performance problem if it’s not struggling to meet demand. If utilization = 100%, but none of its VMs are contending for resource, that’s in fact the perfect situation. You’re getting 100% of your money!

While Capacity and Performance are closely related, they are not identical.

Now that we’re clear on the difference between VM Performance and VM Capacity, let’s zoom into the counters. The following diagram shows the actual counters on each layer (Guest OS, VM and ESXi) that could impact performance.

Do you agree with the choice of counters? 🙂

Have you worked out why CPU Demand is not there? Notice I’m using Run – Overlap. If not, review VM CPU counter and VM RAM counter.

Still disagree? Let me know at e1@vmware.com. Michael Ryom emailed me, pointing out that since utilization does not impact performance, I should not put the counters there. Good point, and thank you Michael. I’ve updated the blog so now it states “that could impact performance”.

Some of the above counters are not available unless you install an agent. Some counters, such as Disk Driver queue, is not available unless you install specific tracing software. Specific to Windows storage driver, there is no API so agent won’t help.

Some counters are OS specific. I think Committed % does not impact Linux the way it impacts Windows, due to difference in Memory Management.

Some counters do not have established or proven guidelines. Outstanding IO is one such counter. Storage specialists told me that so long latency is low, the OIO matters less. I’m keen to hear your real world experience.

If you do not want to deploy agent (which I agree, as they incur overhead operationally), here is what you can monitor today with vRealize Operations + vSphere Tools.

The CPU Context Switch is an interesting one. I have not found a guidance on what a bad value is. Read this and let me know your thought. For RAM Page-In Rate, I think a percentage is better than absolute. My take is 1% is not good. What’s your take?

Which counters should you use for your IaaS Performance SLA? Here is my recommendation:

  • CPU Ready.
  • RAM Contention
  • Network TX Dropped Packet.
  • Disk Latency

Do you know why I only use CPU Ready, and exclude CPU Co-Stop and CPU Contention? Email me the answer 🙂 It took me years to vrealize the mistake.

IaaS SLA complements Application SLA, which in turn complements Business SLA. Application SLA depends on each apps, hence it takes a lot more effort to establish. It also does not answer if the problem is caused by Infra or Apps.

Hope it’s useful. In the next blog, I will share how you can quickly monitor the counters above on thousands of VMs.

VCDX, meet VCOX

Thanks for the positive feedback on the articles The Rise and Fall of Infrastructure Architect and Purpose-driven Architecture. Do read them first as this post builds from there.

I see Architecture and Operations as 2 equally large realms. While we certainly consider Operations when designing, it is not a part of Architecture. They complete each other, like Yin and Yang. They impact each other, like night and day. While I subscribe to the school of thought that the same person can be good in both Architecture and Operations, I’m yet to meet such person. I’m not a VCDX, so I will focus on what VCOX. Inspired by VCDX, I created this term to acknowledge the size of this world. To be 100% clear, VCOX is just a term I created. It has nothing to do with VCDX.

Architecture is Day 1, Operations is Day 2. Day 2 impacts Day 0, which is Planning. Why?

Because we begin with the end in mind. The End State drives your Plan. Your Plan drives your Architecture. So it’s 2 –> 0 –> 1, not 0, 1, 2.

I’ll use an example to illustrate how Day 2 impact Day 1. Say you an internal cloud provider, and you plan to charge per VM (e.g. $1 per vCPU per month). You plan to have 2 classes of offerings:

  • Gold: suitable for production workload. Performance optimized.
  • Silver: suitable for non production. Cost optimized

For Gold, you don’t overcommit CPU and RAM. Now…., if 1 CPU typically uses 4 GB RAM, then a 40-core ESXi will only need 160 GB. If you buy a 1 TB RAM, then you won’t be able to sell 864 GB as you have no vCPU to sell. This means your hardware spec is impacted.

You also promise the concept of Availability Zone. In the even of cluster failure, you cap the number of VMs affected. If you cap say 200 production VMs, then your cluster size cannot be too big.

In your service offering, you include the ability for customer to check her own VM health, and how her VM is served by the underlying platform. This means your architecture needs to know how to associate tenants with their VMs.

Your CIO wants a live information projected for his peers to see on how IT is serving the business. This requires you to think of the KPI. How do you know NSX is performing fast enough for its consumers?

I hope the above provide examples that Day 2 is where you want to start.

Operations cover the following pillars of management (planning, monitoring, troubleshooting):

  • Budget: Costing, Pricing and the business of IT
  • Capacity: it’s highly related to cost. Insufficient budget Ă  Overcommit Ă  Capacity Management.
  • Performance: focus on proactive and early warning. Performance SLA
  • Availability
  • Configuration: drift management
  • Compliance: security compliance, internal audit compliance
  • Inventory: license, hardware
  • Management Reporting

Notice a big pillar is missing above?

Yes, I did not cover Automation. IMHO, that’s part of Architecture. You should not automate what you cannot operate. So I see automation as not part of operations. Automation is a feature of your Architecture. It’s like automatic car. That’s a feature of the car. How you operate the car so passengers arrive at the destination on time, that’s operation.

VCOX answers questions such as:

  • Prove that the IaaS is cheaper than comparable IaaS. If it’s VMware SDDC, then prove that it’s cheaper than VMware on AWS. If it’s not, then the business case is weakened from CFO viewpoint.
  • Prove that Actual meets Plan. The architecture is built for a purpose. Quantify that purpose, and prove that it’s met.
  • The architecture carries a set of KPI. This enables its performance to be monitored. What are these KPIs. For each metric, what are the thresholds?
  • Is Operations performing? A poor sign of operations are lots of alerts, fire-fighting, blamestorming, hectic and intense day. The team is under stress as they struggle to operate the architecture.

From my interactions with customers, I notice that the Architect is not leading Day 0. They provide input to the Planning stage, but not the lead architect driving it. The Architect tends to focus on technical, something that CFO and CIO value less (hence they spend less time on it).

That’s my observation from travelling 200 days a year meeting customers, partners and internal. I hope it’s useful to you. Let me know what you think, as we are in early days of VCOX.

vRealize Operations 6.7 VM Memory counters

Accuracy of Guest OS memory has been a debate for a long time. I’ve been at VMware for 1 decade and remember the debate between Active RAM and Consumed. Along came the visibility into the Guest OS by vR Ops, which was a major step forward. It is not, however, the final solution yet, as we still have to consider applications.

Take a look at the following utilization diagram:

The first bar is a generic guidance. The 2nd bar is specific to RAM.

  1. When you spend on infrastructure, you want to use it well. After all, you pay for the whole box. So ideally, it’s 100%. The first bar above shows the utilization range (0% – 100%). Green is where you want the utilization to be. Below 50% is blue, symbolizing cold. The company is wasting money if the utilization is below 50%. So below the green zone, there is a wastage zone. On the other hand, higher than 75% opens the risk that performance may be impacted. Hence I put yellow and red threshold. The green zone is actually a narrow band.
  2. In general, applications tend to work on a portion of its Working Set at any given time. The process is not touching all its memory all the time. As a result, the rest becomes cache. This is why it’s fine to have active + cache beyond 95%. If your ESXi is showing 98%, do not panic. Windows and Linux are doing this too. The modern-day OS is doing its job caching all those pages for you. So you want to keep the Free pages low.

Cache is an integral part of memory management, as the more you cache, the lower your chance of hitting a cache miss. This makes sense. RAM is much faster than Disk, so if you have it, why not use it? Remember when Windows XP introduced pre-fetch, and subsequently Windows SuperFetch? If not, here is a good read.

As you can read from the SuperFetch, Memory Management is a complex topic. There are many techniques involved. Unfortunately, this is simplified in the UI. All you see is something like this:

Linux and VMkernel also has its fair share of simplifying this information. This Linux Ate My RAM is pretty famous. For ESXi, a common misperception is “we are short on RAM, but fine on CPU”, when it is actually the other way around! To prove it, check the Max VM CPU Contention and Max VM RAM Contention counter for each cluster.

Windows Memory Counters

  • In Use: this is what Windows needs to operate. Not all of them are actively in used though, which is why the VM Active Memory counter from hypervisor can be lower than this. Notice that Windows compresses its active RAM, even though it has plenty of Free RAM available. This is a different behaviour to ESXi, which do not compress unless it’s running low on Free. Formula is In use = Total – (Modified + Standby + Free)
  • Committed: the currently committed virtual memory, although not all of them are written to the pagefile yet. Commit can go up without In Use going up, as Brandon shares here.
  • Commit Limit: Commit Limit is physical RAM + size of the page file. Since the pagefile is normally configured to map the physical RAM, the Commit Limit tends to be 2x.
  • Modified: page that was modified but no longer used, hence it’s available for other usage. It’s counted as part of Available. The API name is guest.mem.modifiedPages (Win32_PerfRawData_PerfOS_Memory#ModifiedPageListBytes)
  • Standby: Windows has 3 levels of standby. They are:
    • guest.mem.standby.core (Win32_PerfRawData_PerfOS_Memory#StandbyCacheCoreBytes)
    • guest.mem.standby.normal (Win32_PerfRawData_PerfOS_Memory#StandbyCacheNormalPriorityBytes)
    • guest.mem.standby.reserve  (Win32_PerfRawData_PerfOS_Memory#StandbyCacheReserveBytes)
  • Free: guest.mem.free (Win32_PerfRawData_PerfOS_Memory#FreeAndZeroPageListBytes)
  • Cached = Modified + Standby
  • Available = Free + Standby
  • Paged pool: this is a part of Cache Bytes. Based on this great article, it includes Pool Paged Resident Bytes, the System Cache Resident Bytes, the System Code Resident Bytes and the System Driver Resident Bytes.
  • Non-paged pool: this is kernel RAM. It cannot be paged out. It’s part of In Use.

As a result, determining what Windows actually use is difficult. If the UI is not clear, the API is even more challenging. The name in the API does not map to the name in the UI.

Linux Memory Counters

As you can guess from above, Linux does it differently 🙂 There are 5 counters that we’re interested:

  • Total: guest.mem.total (/proc/meminfo#MemTotal)
  • Buffers: guest.mem.buffers (/proc/meminfo#Buffers)
  • guest.mem.cached (/proc/meminfo#Cached)
  • guest.mem.slabReclaim (/proc/meminfo#SReclaimable)
  • Available: guest.mem.available (/proc/meminfo#MemAvailable since Linux 3.14)
  • Free: guest.mem.free (/proc/meminfo#MemFree)
Used = total - free - buffers - cache 
Cache = guest.mem.cached + guest.mem.slabReclaim

The above mapping and calculation are based on latest Linux document and source code. As older Linux has used different formula, the future may change.

vR Ops Counters

How are the above Guest OS metrics appears in vR Ops? VMware vSphere Tools comes to the rescue! It provides a set of counters (details here). The counter Memory Needed considers Standby Cache Normal Priority as part of needed memory. As you can see in this example, that is a whopping 40% of 10 GB RAM. Notice the Free and Zero is left with just 1 MB. Windows does not include this, hence Windows reports a lower number than vR Ops.

vR Ops can’t change the value as that value is coming from Tools. We do not have the metric for Cache. We are working closely as a team to provide more details mapping to Guest OS. In the meantime, you can use Endpoint Operations module of vR Ops. Do note that this requires agent deployment, so keep it only for your critical VMs.

The changes in vR Ops 6.7 is the mapping of VM counters. We did not come up with a new counter. All the counters were already available since vR Ops 6.3, and I explained the counters in this post. In 6.7, the Memory Workload counter and Memory Usage now map to Guest OS instead of Active. If the Guest OS is not available, it falls back to Consumed. This is why you start seeing a jump in memory usage. It’s not that vR Ops 6.7 chooses the wrong counter. It’s merely using a different counter.

vR Ops 6.6 uses VM Active
vR Ops 6.7 uses Guest OS, then VM Consumed

If you want to come back to Active, which I’d not recommend, here are the counters you need:

For Alerts, you can simply change the metrics used in the symptom. For Dashboards, you can create a custom dashboard that use these metrics instead.

What counters to use?

It depends on the context. Are you troubleshooting performance or analysing for right sizing? Performance is about “Does the application get what it needs?“. What the long term or big picture is less relevant. Right sizing is about “Overall, what is the right amount needed”. You look at long term. A 5-minute burst should not dictate your overall sizing recommendation.

  • For Performance:
    • VM Memory Contention. No point increasing/shrinking the Guest if the problem is with the hypervisor.
    • Guest OS Page-In Rate. A heavy page-in impacts performance. Page-In is not an metric for capacity due to pre-fetch.
    • Guest OS Memory Needed.
  • For Capacity:
    • For best performance: Guest OS Memory Needed
    • For lowest cost: Guest OS Used (need agent), Guest OS Cache (need agent)

I hope the article helps to clarify. What counter do you want to use in future version of vR Ops? Discuss with your TAM or SE, as they will champion you internally in VMware. If you do not have one, let me know at e1@vmware.com.