Category Archives: Products

Articles covering specific products, such as VMware vSphere, vCloud Suite, NSX, vRealize, Horizon Suite.

Large Scale vSAN Monitoring

Large scale VMware vSAN operations raises the need for easier and faster monitoring. With many and large vSAN clusters, monitoring and troubleshooting become more challenging. To illustrate, let’s take a single vSAN cluster with the following setup:

Here are some of the questions you want to ask in day to day operations:

  • Is any of the ESXi running high CPU utilization?
  • Is any of the ESXi running high Memory utilization?
  • Is any of the NIC running high utilization?
    • With 4 NIC per ESXi, you have 40 TX + 40 RX metrics.
  • Is vSAN vmkernel network congested?
  • Is the Read Cache used?
  • Is the Write Buffer sufficient?
  • Is the Cache Tier performing fast?
    • Each disk has 4 metrics: Read Cache Read Latency, Read Cache Write Latency, Write Buffer Write Latency, Write Buffer Read Latency
    • Since there are 20 disks, you need to check 80 counters
  • Is the Capacity Disks performing fast?
    • Check both Read and Write latency.
    • Total 120 x 2 = 240 counters.
  • Is any of the Disk Group running low on space?
  • Is any of the Disk Group facing congestion?
    • You want to check both the max and count the number of occurrence > 60.
  • Is there outstanding IO on any of the Disk Group?

If you add them the above, you are looking at 530 metrics for this vSAN cluster. And that’s just 1 point in time. In 1 month you’re looking at 530 x 8766 = 4.6+ millions data points!

How do you monitor millions of data so you can be proactive?

vRealize Operation 6.7 sports vSAN KPIs. We collapsed each of those questions. So you only have 12 metrics to check instead of 530, without losing any insight. In fact, you get better early warning, as we hide the average. Early Warning is critical as buying hardware is more than a trip to local DIY hardware store.

The KPIs achieve this simplification by using supermetrics:

Using Min, Max, Count, it picks the early warning.

The KPI has been a hit with customers. But it falls short when you have many vSAN clusters. If you have say 25 hybrid clusters and 25 All Flash clusters, you need to check 50 clusters. While you can click 50x, what you want is to see all 50 at the same time.

This means we need to aggregate the metrics further. There should only be 1 and only 1 metric per cluster.

The challenge is the KPI has different units and scale. How do we normalize them into Green, Yellow, Orange and Red?

We do it by defining a normalization table. We need 1 table for All Flash and 1 for Hybrid, as they have different KPI and threshold. Here is the table for All Flash:

Read Cache Hit Rate (%) is missing from the above as it’s not applicable to All Flash. It does not have dedicated Read Cache.

I’m setting CPU Ready and CPU Co-Stop at 1%, so we can catch early warning. For RAM, as most ESXi sports 512 GB RAM, I set the RAM Contention at 0%.

The metrics that I’m not sure if the Disk Group Congestion. It’s based on 60, which I think is a good starting point in general.

Here is the table for Hybrid:

Do you know why I do not have Utilization counter (e.g. CPU Utilization) there?

Utilization does not impact performance. ESXi running at 99% is not slower than ESXi running at 1%, so long there is no contention or latency. This is vSAN KPI, not vSAN KUI (Key Utilization Indicators). Yes, vSAN KUI needs its own table.

Once you have the table, you can map into threshold. I use Green = 100, Yellow = 67, Orange = 33, Red = 0. I use 0 – 100 scale so it’s easier to see the relative movement. If you don’t want to be confused with %, you can use 0 – 10 or 0 – 50.

vSAN Performance is the average of all these. We are not taking the worst to prevent 1 value from keeping it red all the time. If you take the worst, the value will likely remain constant. That’s not good, as pattern is important in monitoring. The relative movement can be more important than the absolute value.

You implement the above using super metric. You need 2 super metrics, 1 for Hybrid and 1 for All Flash. For simplicity, I’d not use Policy but rather apply both super metrics to all my vSAN clusters. I then use the correct metrics when building the dashboard.

Hope you find it useful.

VM Key Performance Indicators

What are the counters that can impact a VM performance? The diagram may surprise you as it does not follow popular thinking. Notice the red dotted line. It means that Utilization does not impact performance.

Take VM CPU for example. You do want the VM to run at 100% CPU utilization, as that means the CPU is doing as much work as possible. The VM performance is not impacted so long it has no queue. It’s handling 100% of the demand perfectly. Specific to VM, a queue may happen inside the VM (read: Guest OS) or outside the VM (read: hypervisor layer).

Now, a VM running at 100% certainly has no more capacity for additional workload. It is full. If you think the VM needs to run more workload in the future, then add more resource. If not, then adding more CPU will in fact slow down its performance. This is one reason why I was a proponent of dropping the Stress counter in vR Ops. It was based on Demand counter.

Let’s switch to ESXi. An ESXi does not have performance problem if it’s not struggling to meet demand. If utilization = 100%, but none of its VMs are contending for resource, that’s in fact the perfect situation. You’re getting 100% of your money!

While Capacity and Performance are closely related, they are not identical.

Now that we’re clear on the difference between VM Performance and VM Capacity, let’s zoom into the counters. The following diagram shows the actual counters on each layer (Guest OS, VM and ESXi) that could impact performance.

Do you agree with the choice of counters? 🙂

Have you worked out why CPU Demand is not there? Notice I’m using Run – Overlap. If not, review VM CPU counter and VM RAM counter.

Still disagree? Let me know at Michael Ryom emailed me, pointing out that since utilization does not impact performance, I should not put the counters there. Good point, and thank you Michael. I’ve updated the blog so now it states “that could impact performance”.

Some of the above counters are not available unless you install an agent. Some counters, such as Disk Driver queue, is not available unless you install specific tracing software. Specific to Windows storage driver, there is no API so agent won’t help.

Some counters are OS specific. I think Committed % does not impact Linux the way it impacts Windows, due to difference in Memory Management.

Some counters do not have established or proven guidelines. Outstanding IO is one such counter. Storage specialists told me that so long latency is low, the OIO matters less. I’m keen to hear your real world experience.

If you do not want to deploy agent (which I agree, as they incur overhead operationally), here is what you can monitor today with vRealize Operations + vSphere Tools.

The CPU Context Switch is an interesting one. I have not found a guidance on what a bad value is. Read this and let me know your thought. For RAM Page-In Rate, I think a percentage is better than absolute. My take is 1% is not good. What’s your take?

Which counters should you use for your IaaS Performance SLA? Here is my recommendation:

  • CPU Ready.
  • RAM Contention
  • Network TX Dropped Packet.
  • Disk Latency

Do you know why I only use CPU Ready, and exclude CPU Co-Stop and CPU Contention? Email me the answer 🙂 It took me years to vrealize the mistake.

IaaS SLA complements Application SLA, which in turn complements Business SLA. Application SLA depends on each apps, hence it takes a lot more effort to establish. It also does not answer if the problem is caused by Infra or Apps.

Hope it’s useful. In the next blog, I will share how you can quickly monitor the counters above on thousands of VMs.


Thanks for the positive feedback on the articles The Rise and Fall of Infrastructure Architect and Purpose-driven Architecture. Do read them first as this post builds from there.

I see Architecture and Operations as 2 equally large realms. While we certainly consider Operations when designing, it is not a part of Architecture. They complete each other, like Yin and Yang. They impact each other, like night and day. While I subscribe to the school of thought that the same person can be good in both Architecture and Operations, I’m yet to meet such person. I’m not a VCDX, so I will focus on what VCOX. Inspired by VCDX, I created this term to acknowledge the size of this world. To be 100% clear, VCOX is just a term I created. It has nothing to do with VCDX.

Architecture is Day 1, Operations is Day 2. Day 2 impacts Day 0, which is Planning. Why?

Because we begin with the end in mind. The End State drives your Plan. Your Plan drives your Architecture. So it’s 2 –> 0 –> 1, not 0, 1, 2.

I’ll use an example to illustrate how Day 2 impact Day 1. Say you an internal cloud provider, and you plan to charge per VM (e.g. $1 per vCPU per month). You plan to have 2 classes of offerings:

  • Gold: suitable for production workload. Performance optimized.
  • Silver: suitable for non production. Cost optimized

For Gold, you don’t overcommit CPU and RAM. Now…., if 1 CPU typically uses 4 GB RAM, then a 40-core ESXi will only need 160 GB. If you buy a 1 TB RAM, then you won’t be able to sell 864 GB as you have no vCPU to sell. This means your hardware spec is impacted.

You also promise the concept of Availability Zone. In the even of cluster failure, you cap the number of VMs affected. If you cap say 200 production VMs, then your cluster size cannot be too big.

In your service offering, you include the ability for customer to check her own VM health, and how her VM is served by the underlying platform. This means your architecture needs to know how to associate tenants with their VMs.

Your CIO wants a live information projected for his peers to see on how IT is serving the business. This requires you to think of the KPI. How do you know NSX is performing fast enough for its consumers?

I hope the above provide examples that Day 2 is where you want to start.

Operations cover the following pillars of management (planning, monitoring, troubleshooting):

  • Budget: Costing, Pricing and the business of IT
  • Capacity: it’s highly related to cost. Insufficient budget à Overcommit à Capacity Management.
  • Performance: focus on proactive and early warning. Performance SLA
  • Availability
  • Configuration: drift management
  • Compliance: security compliance, internal audit compliance
  • Inventory: license, hardware
  • Management Reporting

Notice a big pillar is missing above?

Yes, I did not cover Automation. IMHO, that’s part of Architecture. You should not automate what you cannot operate. So I see automation as not part of operations. Automation is a feature of your Architecture. It’s like automatic car. That’s a feature of the car. How you operate the car so passengers arrive at the destination on time, that’s operation.

VCOX answers questions such as:

  • Prove that the IaaS is cheaper than comparable IaaS. If it’s VMware SDDC, then prove that it’s cheaper than VMware on AWS. If it’s not, then the business case is weakened from CFO viewpoint.
  • Prove that Actual meets Plan. The architecture is built for a purpose. Quantify that purpose, and prove that it’s met.
  • The architecture carries a set of KPI. This enables its performance to be monitored. What are these KPIs. For each metric, what are the thresholds?
  • Is Operations performing? A poor sign of operations are lots of alerts, fire-fighting, blamestorming, hectic and intense day. The team is under stress as they struggle to operate the architecture.

From my interactions with customers, I notice that the Architect is not leading Day 0. They provide input to the Planning stage, but not the lead architect driving it. The Architect tends to focus on technical, something that CFO and CIO value less (hence they spend less time on it).

That’s my observation from travelling 200 days a year meeting customers, partners and internal. I hope it’s useful to you. Let me know what you think, as we are in early days of VCOX.