Tag Archives: performance management

vRealize Operations tutorial videos

This post is part of Operationalize Your World post. Do read it first to get the context.

I try not to duplicate videos already made by others, and will link to theirs instead if I find theirs to be relevant. In general, the videos are applicable to all 6.x versions as the features I used are fairly basic.

My video has no sound for 2 reasons. I tend to make wrong pronunciation and my England isn’t exactly clear. I added music from YouTube as compensation, hope you like it 🙂

The workshop does not cover Installation & Configuration as there are many materials covering it, plus I’m only given around 4 hours to cover both vRealize Operations and Log Insight 🙂

Here is the videos so far:

  1. How to determine if a VM slowness is not caused by your shared infrastructure
    1. This is probably the #1 request I get.
    2. This demonstrates how useful Performance SLA is. Without it, you’re defenceless!
    3. I’ve added a variant to the dashboard, where you can also show the VM utilization.
  2. vSphere overall performance
    1. How is your IaaS serving the VMs?
    2. This shows that you need to define what you mean by Performance. It cannot be subjective!
  3. How are the VMs using your shared Storage and Network?
    1. Is there anyone abusing the shared IaaS?
    2. The challenge with Network IO and Storage IO is generally you do not cap. So it’s possible for a VM to excessively use it. The question is: who and when?
  4. How to create super metric.
    • I created my Top 3 most frequently asked super metrics. They are the first 3 I created in almost all my engagements.
    • I created multiple in 1 video so you can see that some steps can be done together. You can also import them, as per what Sunny has shared here.
  5. How to create view
    1. I use View widget heavily as it allows data transformation. This video also includes steps how you can use it report.
  6. Tag and Custom Dashboard
    • Matthias Eisner shows you how to Tags to group objects. He showed how to create tags, and he created a custom dashboard to show an application. Justin, our UI Architect, also shares about Custom Dashboard.
  7. How to create a multi-tenant structure
    • One of my customers enhance this by having 2 levels. 1 tenant has many Apps, and 1 app has many VMs.
  8. How to create alert for specific customer
  9. How to create a new alert definition
    • Justin, our UI Architect, share how it is done.

BTW, if you want to see them as 1 link, see this.

VMware Performance SLA

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Google “performance SLA” VMware, and you will find only few relevant articles. The string performance SLA has to be within a quote, as it is not performance and SLA, but Performance SLA. Yes, I’m after web page with the words Performance SLA together. You will get many irrelevant results if you simply google VMware Performance SLA without the quote.

I just tried it again (2 April 2017). It’s only 6000 results, up from 2330 results in Nov 2016, and 1640 results on Oct 2015. The first 10 are shown below. Notice 8 of them are actually from my blog, book or event. If you ask your peers, you will not find many customers have Performance SLA.

I didn't change the screenshot below, as it's similar to what I got in April 2017


I checked beyond the first 10 results. Other than my own articles, Google returned only 5 relevant articles. The rest are actually not relevant. An example of relevant article is by former colleague, and a good friend Michael Webster. All the relevant articles are good and informative articles. They also mention Performance SLA. They just do not define and quantify what Performance SLA is. If something is not quantified, it is subjective. It’s hard to reach formal agreement with customers quickly and consistently when the line is not clearly drawn. If you have disagreement with your customers, especially paying customers, guess who win 🙂

A former colleague, Scott Drummonds, covered Performance SLA in his old blog back in 2010. It is unsurprising to me, knowing Scott, that he had thought about it years ago! However, what he covered was Application layer, not IaaS layer. He also did not provide a counter to measure. Certainly, it was virtually impossible to provide that years ago, considering the maturity of the IaaS at that time.

Availability SLA protects you when there is downtime. Performance SLA protects you when there is performance issue. How?

If you are within your SLA, you are safe.

Below is example of Performance SLA. Describe (or define) the service for each of the 4 infrastructure component (CPU, RAM, Disk, and Network).

Take note:

  • Do not set SLA at the individual vCPU level. Set at the whole VM level. It’s much harder to comply and monitor at per vCPU.
  • Do not set SLA at both Read and Write latency. Set at the aggregate.
  • All numbers are measured in 5 minute average. If a spike only lasts for 1 minute, then it calms down for the remaining 4 minutes, it won’t show up.
  • To change the power management settings, see this KB (VM application runs slower than expected in ESXi)
  • The SLA impacts your architecture. It poses a constraint as it sets a formal threshold.
    • Example: how do we ensure Tier 3 storage does not impact Tier 1 since they are on the same array (spindles, CPU, etc). Some storage array has shares internally. Even on vSAN, this is challenging.

The above is is an example as your policy as IaaS provider may vary. For each, list all the properties that impact the quality of the service.

Notice what’s missing in the table?

Something that you normally have if you are doing Capacity Management based on spreadsheet 🙂

Yup, it’s the Consolidation Ratio.

It’s not there because it’s not relevant. In fact, the ratios can be misleading as it does not take into account VM utilization, VM size, ESXi power management, DRS, backup period, etc. It is definitely a good guide for initial planning. Once you are in production, you need to monitor based on what’s happening in production. Performance Experts like Mark Achtemichuk has explained it well here. I recommend you read it first.


Great! Let’s dive deeper. I will take one component, Storage, as it’s the easiest to understand.

  • VM Disk Latency for Tier 1: 5 ms
  • VM Disk Latency for Tier 2: 15 ms
  • VM Disk Latency for Tier 3: 25 ms
  • All values measured as 5-minute average.
  • SLA is breached when value exceed SLA at any given 5 minutes, 365 days a year.

When a VM owner complains that her VM is slow because of storage, and that VM resides on a Tier 2 storage, both of you can see the VM disk latency. If it’s below 15 ms, it’s not your fault. Perhaps her application needs a faster storage, and she can pay more and upgrade to Tier 1. If it higher than 15 ms, you as IaaS provider does not even have to wait until she complains 🙂 Better still, do something before she notices.

What number should you set?

  • If you have no data, the above is a good starting point.
  • If you have vR Ops running, you can set a number based on your actual data.
    • There is no point is setting something much higher or lower than what you actually have.
    • Use the super metric preview as shown below.
    • I plotted 1 month data (that’s 8650 data points). That’s more than enough.
    • Take the maximum. That’s your baseline.


Now… you have thousands of VMs under your management. I guess what you want is to be alerted if any of them hit the SLA you promise.

Yes, vRealize Operations can alert you, so you can proactively do something before VM owner complains. Brandon Gordon, Integration Architect at VMware, showed me how we can achieve the above in vRealize Operations.

See the screenshot below, courtesy of Brandon.

alert 4

In order to get such alerts at each VM, you need to create and define the alerts. Brandon has defined for CPU, RAM and Disk. He has also defined it for each tier.

alert 2

Hope you find useful, just like many of my customers have. If not, drop me a note.

Presentation from VMware vForum Singapore

My good buddy Sunny Dua and I had the joy of co-presenting at VMware vForum Singapore. We had 2 sessions, but sadly he was only able to make it for 1 due to his engagement.

The first session was a 90 minute workshop with just 40 people. The audience was capped at 40. On hindsight, we should have allocated more as it was filled up fast, and folks were forming long queue! The room was full.

The second session was a high level session of just 30 minutes, with around 200+ audience. It’s open to all.

There were a lot of questions during the 90-minute workshop, as the audience realized (pun intended) as they need to change their mindset. All these years I’m doing performance and capacity management, it’s amazing how many customers are still not clear on the difference. This is not surprising, because there is no difference between performance and capacity in HDDC.

You can get the full deck from here. It builds from our VMworld deck, and we added more depth as we had more time.

Feel free to use it, and let’ us know how it has helped you. One thing that keeps me going in sharing the knowledge is the many emails, WhatsApp, LinkedIn message I got from customers/partners on how changing their paradigm has helped them in managing their SDDC better. They’d been managing it like a HDDC all along without realising it.

All the best in SDDC Operations!