Tag Archives: Performance

How to prove your IaaS Cluster is fast

This post is part of Operationalize Your World program. Do read it first so you get the context.

You provide IaaS to your customers. A cluster, be it VMware vSphere Cluster or Microsoft Hyper-V Cluster, is a key component to this service offering. In hyper-converged era, your cluster does storage too. You take time to architect it, making sure all best practices are considered. You also consider performance and ensure it’s not a slow system.

Why is it then, when a VM owner complains that her VM is slow and she blames your infrastructure, you start troubleshooting? Doesn’t that show your lack of confidence in your own creation? If your cluster is indeed fast, why can’t you just show it and be done in 60 seconds?

Worth pondering, isn’t it? 😉

It’s hard to reach formal agreement with customers quickly and consistently when the line is not clearly drawn. If you have disagreement with your customers, especially paying customers, guess who win 🙂

You need to be able to show something like this.

In the above chart, there is a threshold. It defines what is acceptable level of performance. It quantifies what exactly you mean when you promise “fast”. It is your Performance SLA. Review this if you need more details.

You assure them it will be fast, and you’ve got it backup with measureable metrics. You prove that with the 2nd line. That’s the actual performance. Fast or not is no longer debatable.

You measure performance every 5 minutes, not every hour. In a month, that is 12 x 24 x 30 = 8650 proofs. Having that many data points backing you up helps in showing that you’re doing your job.

Now that you’ve got the Performance SLA, how do you implement it in vRealize Operations?

I’ll take disk latency as an example, as it’s easy to understand.

The chart below shows the various disk latency among 6 VMs, from 9:00 am until 9:35 am. What do you spot?

The average is good. They are mostly below 5 ms.

The worst is bad. It is hovering around 20 ms. It is 4x higher than average, indicating a VM is hit. The storage subsystem is unable to serve all VMs. It’s struggling to deliver.

Let’s plot a line along the worst (highest) disk latency. The bold red line is the maximum among all the disk latency from all the VM. We call this Max (VM Disk Latency) in the cluster.

A cluster typically have a lot more VMs than 6. It’s common to see >100 VMs. Plotting >100 lines will make the chart unreadable. Plus, at this junction, you’re interested in the big picture first. You want to know if the cluster is performing fast.

This is the power of super metric. it tracks the maximum among all VMs, creating its own metric as a result. You lose information on which metric in the super metric, as it’s made of >1 VM.

The next chart has all the details removed. We just have the Maximum and the Average. It’s clear now that the max is much higher than average.

We added 3 dotted line in the above chart. They are the 3 possible outcome. If your Maximum is:

  • below the line, then you are good. The cluster is serving all its VM well.
  • near the threshold, then your capacity is full. Do not add more VM.
  • above the threshold, then your cluster is full. Move VM to reduce demand before VM Owner complains.

Can you see the importance of the Performance SLA?

It’s there to protect your job. Without the line, your reputation is at risk. Say you’ve been delivering Disk Latency at <1 ms on your all flash SSD array. Everyone is happy. Of course! 🙂

You then do a storage maintenance for just 1 hour. During that period, disk latency went up to 4 ms. It is still a respectable number. In fact, it’s a pretty damn good number. But you got a complaint. It happened to coincide with the time you did the maintenance.

Can you guess who is responsible for the slowness experience by business?

You bet. Your fault 🙁

But if you have established a Performance SLA, you’re protected. Say you promise 5 ms. You will be able to say “Yes, I knew it would go up as we’re doing maintenance. I’ve catered for this in my design. I knew we could still deliver as per agreed SLA.”

Let’s now show a real example. This is what it actually looks like in vR Ops 6.4.

Notice the Maximum is >10x higher than the average, and the average is very stable. Once the Cluster is unable to cope, you’d see pattern like this. Almost all VMs can be served, but 1-2 were not served well. The maximum is high because there is always 1 VM that wasn’t served.

Only when the Cluster is unable to serve ~50% of the VMs, will average become high too.

BTW, do you notice the metric names differ?

  • The Max is a super metric.
  • The Average is a regular metric

This is because metric at higher-object (e.g. cluster, Host) are all average. None of them is the real peak. Review this “when is a peak not a true peak” article.

The above is for Disk. IaaS consists of providing the following as a service:

  1. CPU
  2. RAM
  3. Disk
  4. Network

Hence we need to display 4 line charts, showing that each service is delivered well.

As every Service Tier performance is different, you need to show it per service tier. A Gold Tier delivers faster performance than Silver Tier, but if it’s higher than its SLA, it’s still not performing. Performance is relative to what you promise.

Since VMs move around in a cluster due to DRS and HA, we need to track at Cluster level. Tracking at Resource Pool level is operationally challenging. Do not mix service tier, as Tier 3 performance can impact Tier 1. The only way you can protect higher tier is with Reservation, which has its own complication operationally.

Once I know what to display, I’d normally do a whiteboard, often with customers. It helps me to think clearly.

This is what the dashboard looks like. It starts with a list of clusters. Selecting a cluster, will automatically show the performance. It shows CPU, RAM and Disk. Network drop packet should be 0 at all times, hence not shown. You can track it at data center level, not cluster.

The final dashboard can be seen here . As performance has to be considered in capacity, we show how it’s done in a series of post here.

Proving your VDI is performing well

How do we know a user is getting a good performance on her/his VDI session? When she called help desk and shared that her Windows is slow, how can help desk quickly determine where the root cause is?

I shared the metrics you need, and showed a sample dashboard here. There were a few requests to add the threshold line. This makes it easier for the Help Desk team. They just need to know that Actual should be better than Threshold.

So here is the updated dashboard. I’ve added a Threshold metric for each counter. You do this by creating a super metric, with a constant value. For example, if the CPU Contention you tolerate is 2%, then simply specify 2 as the super metric formula. No need to add the %.


Compared to the previous example, you will notice it’s neater. I’ve organised the widget, so Network widgets are grouped together, CPU widgets are grouped together, etc. I’ve also named the metrics to just Actual and Threshold. Lastly, I’ve removed the units and displayed it at the Widget title bar.

The help desk just need 3 steps:

  1. Search the user. We use the MS AD Login ID. I did a search above. It’s showing 2 results as this user ID has 2 concurrent active sessions (on different Horizon pools). I
  2. Select a session. All the counters from Horizon session (from V4H adapter) are automatically shown.
  3. Click on the VM object. This is to display the VM counters (from vSphere adapter)

Hope it helps!

Released – VMware Performance and Capacity Management

Glad to share that the 2nd edition is finally out. It is now available for order. If you use Amazon, it’s here.

From both the amount of effort, and the resultant book, this to me is more like 2.0 than 1.1. Page wise, it is 500+ pages, doubling the 1st edition. The existing 8 chapters have been expanded and reorganized, resulting in 15 chapters.

It now has 3 distinct parts, whereas the 1st edition has no part. The 3 parts are structured specifically in that order to make it easier for you to see the big picture. You will find the key changes versus the 1st edition below.

It’s a surprise how much things changed in just 14 months. I certainly did not expect some of the changes back in Jan 2015!

  • Major improvement in monitoring beyond vSphere.
    • vRealize and its ecosystems have huge improvement in Storage, Network, Application monitoring. This includes newer technology such as VSAN and NSX.
    • Many adapters (management packs) and content pack were released for both vRealize Operations and vRealize Log Insight. I’m glad to see thriving ecosystems. Blue Medora especially have moved ahead very fast.
  • Rapid adoption of NSX and VSAN, that I had to add them. They were not plan of the original 2nd edition.
  • Rapid adoption of VDI monitoring using vRealize. I had to include VDI use cases.
  • Adoption by customers, partners and internal have increased.
    • In the original plan, I wasn’t planning of asking any partners to contribute. So I’m surprised that 2 partners agreed right away.
    • It is much easier to ask for review, as people are interested and want to help.
  • vSphere 6.0, 6.0 U1 and U2 were released.
    • Since the book focus on Operations (and not architecture), the impact of both releases is very minimal.
    • Very few counters have changed since with vSphere 5.5.
  • vRealize Operations had 6.1 and 6.2 releases. Log Insight has many releases too.
    • Again, this has minimal impact, since the book is not a product book.

You can find more details of the book here.