Tag Archives: VSAN

Large Scale vSAN Monitoring

Large scale VMware vSAN operations raises the need for easier and faster monitoring. With many and large vSAN clusters, monitoring and troubleshooting become more challenging. To illustrate, let’s take a single vSAN cluster with the following setup:

Here are some of the questions you want to ask in day to day operations:

  • Is any of the ESXi running high CPU utilization?
  • Is any of the ESXi running high Memory utilization?
  • Is any of the NIC running high utilization?
    • With 4 NIC per ESXi, you have 40 TX + 40 RX metrics.
  • Is vSAN vmkernel network congested?
  • Is the Read Cache used?
  • Is the Write Buffer sufficient?
  • Is the Cache Tier performing fast?
    • Each disk has 4 metrics: Read Cache Read Latency, Read Cache Write Latency, Write Buffer Write Latency, Write Buffer Read Latency
    • Since there are 20 disks, you need to check 80 counters
  • Is the Capacity Disks performing fast?
    • Check both Read and Write latency.
    • Total 120 x 2 = 240 counters.
  • Is any of the Disk Group running low on space?
  • Is any of the Disk Group facing congestion?
    • You want to check both the max and count the number of occurrence > 60.
  • Is there outstanding IO on any of the Disk Group?

If you add them the above, you are looking at 530 metrics for this vSAN cluster. And that’s just 1 point in time. In 1 month you’re looking at 530 x 8766 = 4.6+ millions data points!

How do you monitor millions of data so you can be proactive?

vRealize Operation 6.7 sports vSAN KPIs. We collapsed each of those questions. So you only have 12 metrics to check instead of 530, without losing any insight. In fact, you get better early warning, as we hide the average. Early Warning is critical as buying hardware is more than a trip to local DIY hardware store.

The KPIs achieve this simplification by using supermetrics:

Using Min, Max, Count, it picks the early warning.

The KPI has been a hit with customers. But it falls short when you have many vSAN clusters. If you have say 25 hybrid clusters and 25 All Flash clusters, you need to check 50 clusters. While you can click 50x, what you want is to see all 50 at the same time.

This means we need to aggregate the metrics further. There should only be 1 and only 1 metric per cluster.

The challenge is the KPI has different units and scale. How do we normalize them into Green, Yellow, Orange and Red?

We do it by defining a normalization table. We need 1 table for All Flash and 1 for Hybrid, as they have different KPI and threshold. Here is the table for All Flash:

Read Cache Hit Rate (%) is missing from the above as it’s not applicable to All Flash. It does not have dedicated Read Cache.

I’m setting CPU Ready and CPU Co-Stop at 1%, so we can catch early warning. For RAM, as most ESXi sports 512 GB RAM, I set the RAM Contention at 0%.

The metrics that I’m not sure if the Disk Group Congestion. It’s based on 60, which I think is a good starting point in general.

Here is the table for Hybrid:

Do you know why I do not have Utilization counter (e.g. CPU Utilization) there?

Utilization does not impact performance. ESXi running at 99% is not slower than ESXi running at 1%, so long there is no contention or latency. This is vSAN KPI, not vSAN KUI (Key Utilization Indicators). Yes, vSAN KUI needs its own table.

Once you have the table, you can map into threshold. I use Green = 100, Yellow = 67, Orange = 33, Red = 0. I use 0 – 100 scale so it’s easier to see the relative movement. If you don’t want to be confused with %, you can use 0 – 10 or 0 – 50.

vSAN Performance is the average of all these. We are not taking the worst to prevent 1 value from keeping it red all the time. If you take the worst, the value will likely remain constant. That’s not good, as pattern is important in monitoring. The relative movement can be more important than the absolute value.

You implement the above using super metric. You need 2 super metrics, 1 for Hybrid and 1 for All Flash. For simplicity, I’d not use Policy but rather apply both super metrics to all my vSAN clusters. I then use the correct metrics when building the dashboard.

Hope you find it useful.

The rise of single-socket ESXi Host

When I first started doing VMware back in mid 2008, it was common to see 4-socket ESXi host. A cluster of 8 nodes means the cluster will have 32 sockets. This means 32 licenses of vSphere. On top of that, customers also have to pay for Guest OS. Some customers had to pay both RedHat and Windows.

With each passing year, Intel delivered more and more cores. I have customers who have excess vSphere license as they went from dual-core to 12-core over several years.

Fast forward to the current time. Intel launched an 18-core Xeon E5-2699 V3 in Q4 2014, followed by Xeon E6-2699 V4 launched in Q1 2016, and then Xeon Platinum 8176 in July 2017. This sports 28 cores! AMD has also joined in with Epyc.

The VMmark result shows near linear performance compared with older Xeon. Some of my customers have managed to reduce the number of ESXi Hosts. This is a good news to them, as that means less:

  • power consumption
  • UPS facility
  • data center space
  • air-cond facility
  • smaller VMware environment (for large customer, this makes management easier)
  • fewer vSphere licence (which means they can use the maintenance budget to get NSX, vSAN and vRealize)
  • less Windows Datacenter licence as it gives unlimited VM
    • Note: this does not apply to a single socket. See below.
  • less RHEL license as it gives unlimited VM
  • less software license that charge per physical socket. For example, if you run Oracle softwares or Microsoft SQL Server, the savings will be more than the infrastructure saving.

Going forward, I see customers with large ESXi farm to further increase their consolidation ratio in the next 1-2 years. I see that 20:1 is becoming common. This means

  • 15:1 for Tier 1 workload
  • 30:1 for Tier 2 workload
  • 60:1 for Tier 3 workload (double of Tier 2, as price is also half)

On the other scale, I see customers with very small number of VM to go down to 1 socket ESXi. This actually opens up a possibility for use cases that vSAN, NSX or vSphere could not address due to cost. Remote branches or ROBO is such a use case. In this use case, a 4-node cluster for vSAN may not make financial sense. That means 8 license. By going single-socket, the cost is reduced by 50%.

Thanks to Patrick Terlisten (@PTerlisten) and Flemming Riis (@FlemmingRiis) who corrected me on Twitter that Windows Datacenter Edition comes with 2 physical socket entitlement. It cannot be split into 2 separate physical servers. I found that a document from Microsoft titled “Volume Licensing reference guide Windows Server 2012 R2”, dated Nov 2013, stated it clearly on page 10:

Can I split my Windows Server 2012 R2 license across multiple servers?
No. Each license can be assigned only to a single physical server.

The extra core also supports the converged architecture. Storage and Network can now be run in software. We can use the extra core for services such as:

  • vSAN
  • vSphere Replication
  • vSphere Data Protection
  • NSX distributed router
  • NSX Edge
  • Trend Micro anti virus
  • F5 load balancer
  • Palo Alto Network firewall
  • etc

With a single-socket, the form factor has to be 1RU max. 2RU will be considered too space-consuming. In some cases such as VxRail, Supermicro and Nutanix, the 2RU form factor actually hosts 4 nodes, making each node 0.5 RU so to speak.

In the 1RU form factor, you do not have to compromise storage. For example, the Dell PowerEdge r630 provides 24 SSD, giving you 23 TB raw capacity. It has 24 x 1.8” SSD – up to 23 TB via 0.96 TB hot-plug SATA SSD.

You also do not need to compromise on RAM. Review this post to see that it’s a common mistake to have ESXi with excess RAM.

We know that Distributed Storage works much better on 10Gb networking than 1Gb networking. To get a pair of 24-ports 10Gb switch can be cost prohibitive. The good thing is there are vendors who supply 12-port switches, such as XSNetGear and ZyXEL.

I’d like to end this post by getting it validated by practitioners, folks who actually run technology like VSAN in production to see if this single-socket idea makes sense. I discussed this with Neil Cresswell, CEO of Indonesian Cloud, and this is what he has to say:

“With the rapidly increasing performance of CPUs, once again the discussion of scale up (fewer big servers) or scale out (more small servers) comes back into play. I remember a client a few years ago that bought an IBM 8 Socket Server and were running 100 VMs on it; at that time I thought they were crazy; why do that vs having 2x 4 socket servers with 50 VMs each! Well, that same argument is now true for 2 socket vs single socket. You can have a dual socket server running 50VMs or 2x single socket servers each running 25 VMs.
As a big believer in risk mitigation, I for one would always prefer to have more servers to help spread my risk of failure (smaller failure domain).”

What’s your take? Do you see customers (or your own infrastructure) adopting single-socket ESXi host? Do you see distributed storage at ROBO becomes visible with this smaller config?

I’m keen to hear your thought.