Tag Archives: vSphere

Large Scale vSAN Monitoring

Large scale VMware vSAN operations raises the need for easier and faster monitoring. With many and large vSAN clusters, monitoring and troubleshooting become more challenging. To illustrate, let’s take a single vSAN cluster with the following setup:

Here are some of the questions you want to ask in day to day operations:

  • Is any of the ESXi running high CPU utilization?
  • Is any of the ESXi running high Memory utilization?
  • Is any of the NIC running high utilization?
    • With 4 NIC per ESXi, you have 40 TX + 40 RX metrics.
  • Is vSAN vmkernel network congested?
  • Is the Read Cache used?
  • Is the Write Buffer sufficient?
  • Is the Cache Tier performing fast?
    • Each disk has 4 metrics: Read Cache Read Latency, Read Cache Write Latency, Write Buffer Write Latency, Write Buffer Read Latency
    • Since there are 20 disks, you need to check 80 counters
  • Is the Capacity Disks performing fast?
    • Check both Read and Write latency.
    • Total 120 x 2 = 240 counters.
  • Is any of the Disk Group running low on space?
  • Is any of the Disk Group facing congestion?
    • You want to check both the max and count the number of occurrence > 60.
  • Is there outstanding IO on any of the Disk Group?

If you add them the above, you are looking at 530 metrics for this vSAN cluster. And that’s just 1 point in time. In 1 month you’re looking at 530 x 8766 = 4.6+ millions data points!

How do you monitor millions of data so you can be proactive?

vRealize Operation 6.7 sports vSAN KPIs. We collapsed each of those questions. So you only have 12 metrics to check instead of 530, without losing any insight. In fact, you get better early warning, as we hide the average. Early Warning is critical as buying hardware is more than a trip to local DIY hardware store.

The KPIs achieve this simplification by using supermetrics:

Using Min, Max, Count, it picks the early warning.

The KPI has been a hit with customers. But it falls short when you have many vSAN clusters. If you have say 25 hybrid clusters and 25 All Flash clusters, you need to check 50 clusters. While you can click 50x, what you want is to see all 50 at the same time.

This means we need to aggregate the metrics further. There should only be 1 and only 1 metric per cluster.

The challenge is the KPI has different units and scale. How do we normalize them into Green, Yellow, Orange and Red?

We do it by defining a normalization table. We need 1 table for All Flash and 1 for Hybrid, as they have different KPI and threshold. Here is the table for All Flash:

Read Cache Hit Rate (%) is missing from the above as it’s not applicable to All Flash. It does not have dedicated Read Cache.

I’m setting CPU Ready and CPU Co-Stop at 1%, so we can catch early warning. For RAM, as most ESXi sports 512 GB RAM, I set the RAM Contention at 0%.

The metrics that I’m not sure if the Disk Group Congestion. It’s based on 60, which I think is a good starting point in general.

Here is the table for Hybrid:

Do you know why I do not have Utilization counter (e.g. CPU Utilization) there?

Utilization does not impact performance. ESXi running at 99% is not slower than ESXi running at 1%, so long there is no contention or latency. This is vSAN KPI, not vSAN KUI (Key Utilization Indicators). Yes, vSAN KUI needs its own table.

Once you have the table, you can map into threshold. I use Green = 100, Yellow = 67, Orange = 33, Red = 0. I use 0 – 100 scale so it’s easier to see the relative movement. If you don’t want to be confused with %, you can use 0 – 10 or 0 – 50.

vSAN Performance is the average of all these. We are not taking the worst to prevent 1 value from keeping it red all the time. If you take the worst, the value will likely remain constant. That’s not good, as pattern is important in monitoring. The relative movement can be more important than the absolute value.

You implement the above using super metric. You need 2 super metrics, 1 for Hybrid and 1 for All Flash. For simplicity, I’d not use Policy but rather apply both super metrics to all my vSAN clusters. I then use the correct metrics when building the dashboard.

Hope you find it useful.

Purpose-driven Architecture

When you architect IaaS or DaaS, what end goals do you have in mind? I don’t mean the design considerations, such as best practices. I mean the business result that your architecture has to deliver. A sign that your architecture has failed to deliver is you get into this situation:

The goal of IaaS is to ensure the VMs are running well. The goal of DaaS is to ensure End Users are getting good desktop experience. Have you defined well or good?

Let’s zoom into discuss IaaS. Say you’re architecting for 10K VM in 2 datacenters. You envisage 2K VM in the first month, then ramp up to 10K within the first year. Do you know the basic info about each of these 10K VMs, so that you can architect an infra to serve them well?

  • How big are they? vCPU, RAM, Disk
  • How intense are they? CPU Utilization, RAM utilisation, Disk IOPS, Network throughput?
  • Their workload pattern? Daily, weekly, monthly, etc.

You don’t. Even the applications team don’t know. Their vendors don’t know either, as you’re talking about the future.

So why then, do you promise that your IaaS will serve them well?

That’s a mistake you make as Systems Architect. It’s akin to promising the highway you architect will serve all the cars, buses and motorcycle well, when you have no idea how many they are and how often they will use it.

Can you do something about it?

Yes. You simply provide a good set of choice. The principle you share to your customers are the common sense used in all service industry:

You want it cheap, it won't be fast.
You want it fast, it won't be cheap.

You then offer a few class of service. Give 2-3 good choices, at difference price point. The highest price has the best performance.

  • Your price has to be cheaper than VMware on AWS, else what’s the point. VMware on AWS  has identical architecture to yours, as it’s using the same software and providing same capabilities. This assures your customers that they are getting good price.
  • Your performance is well defined. It is not subject to interpretation. You put a Performance SLA on the table, assuring your customers that you’re confidence of delivering as promised.

You then architect your IaaS to deliver the above classes of service. The class of service is your business offering. It’s the purpose of your architecture. With class of service clearly defined, the question below becomes easy to answer.

When you know exactly the quality of service you need to deliver, the operations team will not suffer. You handover your architecture to them with ease, as it can be operated easily. It has clear definition of performance and capacity.

Keep the summary below when you are architecting IaaS or DaaS.

For more details, review Operationalize Your World.

Keeping VMware Tools current

Keeping VMware Tools current is one of the best practices of vSphere operations. VMware Tools interfaces with both ESXi and VM (the virtual motherboard or virtual machine). Hence, there are 2 comparisons to consider:

  1. VM Hardware version
  2. ESXi version

From the vSphere API, here is what you get when you query it:

  • Guest Tools Current
  • Guest Tools Not Installed
  • Guest Tools Supported New
  • Guest Tools Supported Old
  • Guest Tools Too Old
  • Guest Tools Unmanaged

What do they mean?

  • Guest Tools Not Installed:
    • Tools are not installed on the VM. You should install it as you get both drivers and visibility.
  • Current
    • Tools version matches with the Tools available with ESXi. Each ESXi has a version of Tools that comes with it. See this for the list. This is the ideal scenario.
  • Supported New
    • Newer than the ESXi VMware tools version, but it is supported.
  • Supported Old
    • The opposite of New. It is also supported. Even it is older by 0.0.1 is considered old. It does not have to far behind.
  • Too Old
    • Tools version is older than the minimum supported version of Tools across all ESXi versions. Minimum supported version is the oldest version of Tools we support. Basically, guest is running unsupported Tools. You should upgrade. As of now for Linux and Windows guests. minimum supported version is the Tools version bundled with ESXi 4.0 which is 8.0.1. Supporting such old versions is challenging. We are planning to change this in future to something newer. In the meantime, you should upgrade as might not work as expected
  • Unmanaged
    • Tools installed in the guest did not come from ESXi, so Tools is not being managed by ESXi host. It may be supported or maybe not, depends on what type of Tools is running in the guest. We support open-vm-tools packaged by Linux vendors and OSPs, which both show up as unmanaged.
    • If a customer builds their own open-vm-tools from source code, we may not support that because we will not know if they have done it correctly or not.

Operationalize Your World has a dashboard that highlights the VMs not running the current or supported new. You should expect the number to be minimal, or ideally none.