Large scale VMware vSAN operations raises the need for easier and faster monitoring. With many and large vSAN clusters, monitoring and troubleshooting become more challenging. To illustrate, let’s take a single vSAN cluster with the following setup:
Here are some of the questions you want to ask in day to day operations:
- Is any of the ESXi running high CPU utilization?
- Is any of the ESXi running high Memory utilization?
- Is any of the NIC running high utilization?
- With 4 NIC per ESXi, you have 40 TX + 40 RX metrics.
- Is vSAN vmkernel network congested?
- Is the Read Cache used?
- Is the Write Buffer sufficient?
- Is the Cache Tier performing fast?
- Each disk has 4 metrics: Read Cache Read Latency, Read Cache Write Latency, Write Buffer Write Latency, Write Buffer Read Latency
- Since there are 20 disks, you need to check 80 counters
- Is the Capacity Disks performing fast?
- Check both Read and Write latency.
- Total 120 x 2 = 240 counters.
- Is any of the Disk Group running low on space?
- Is any of the Disk Group facing congestion?
- You want to check both the max and count the number of occurrence > 60.
- Is there outstanding IO on any of the Disk Group?
If you add them the above, you are looking at 530 metrics for this vSAN cluster. And that’s just 1 point in time. In 1 month you’re looking at 530 x 8766 = 4.6+ millions data points!
How do you monitor millions of data so you can be proactive?
vRealize Operation 6.7 sports vSAN KPIs. We collapsed each of those questions. So you only have 12 metrics to check instead of 530, without losing any insight. In fact, you get better early warning, as we hide the average. Early Warning is critical as buying hardware is more than a trip to local DIY hardware store.
The KPIs achieve this simplification by using supermetrics:
Using Min, Max, Count, it picks the early warning.
The KPI has been a hit with customers. But it falls short when you have many vSAN clusters. If you have say 25 hybrid clusters and 25 All Flash clusters, you need to check 50 clusters. While you can click 50x, what you want is to see all 50 at the same time.
This means we need to aggregate the metrics further. There should only be 1 and only 1 metric per cluster.
The challenge is the KPI has different units and scale. How do we normalize them into Green, Yellow, Orange and Red?
We do it by defining a normalization table. We need 1 table for All Flash and 1 for Hybrid, as they have different KPI and threshold. Here is the table for All Flash:
I’m including Utilization even though it does not impact performance. ESXi running at 99% is not slower than ESXi running at 1%, so long there is no contention or latency. The reason is convenience, as it’s hard to monitor when there are >1 counter. You need to bring it down to 1 counter.
I’m setting CPU Ready, CPU Co-Stop and RAM Contention at low numbers, so we can catch early warning. You can adjust after you import.
Here is the table for Hybrid. It has Read Cache Hit Rate (%)
Once you have the table, you can map into threshold.
vSAN Performance is the average of all these. We are not taking the worst to prevent 1 value from keeping it red all the time. If you take the worst, the value will likely remain constant. That’s not good, as pattern is important in monitoring. The relative movement can be more important than the absolute value.
You implement the above using super metric. Yup, heaps of them 🙂
Hope you find it useful. I will share how the above is implemented in future post.