VMware Performance SLA

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Google “performance SLA” VMware, and you will find only few relevant articles. The string performance SLA has to be within a quote, as it is not performance and SLA, but Performance SLA. Yes, I’m after web page with the words Performance SLA together. You will get many irrelevant results if you simply google VMware Performance SLA without the quote.

I just tried it again (2 April 2017). It’s only 6000 results, up from 2330 results in Nov 2016, and 1640 results on Oct 2015. The first 10 are shown below. Notice 8 of them are actually from my blog, book or event. If you ask your peers, you will not find many customers have Performance SLA.

I didn't change the screenshot below, as it's similar to what I got in April 2017

performance-sla-nov-2016

I checked beyond the first 10 results. Other than my own articles, Google returned only 5 relevant articles. The rest are actually not relevant. An example of relevant article is by former colleague, and a good friend Michael Webster. All the relevant articles are good and informative articles. They also mention Performance SLA. They just do not define and quantify what Performance SLA is. If something is not quantified, it is subjective. It’s hard to reach formal agreement with customers quickly and consistently when the line is not clearly drawn. If you have disagreement with your customers, especially paying customers, guess who win 🙂

A former colleague, Scott Drummonds, covered Performance SLA in his old blog back in 2010. It is unsurprising to me, knowing Scott, that he had thought about it years ago! However, what he covered was Application layer, not IaaS layer. He also did not provide a counter to measure. Certainly, it was virtually impossible to provide that years ago, considering the maturity of the IaaS at that time.

Availability SLA protects you when there is downtime. Performance SLA protects you when there is performance issue. How?

If you are within your SLA, you are safe.

Below is example of Performance SLA. Describe (or define) the service for each of the 4 infrastructure component (CPU, RAM, Disk, and Network).

Take note:

  • Do not set SLA at the individual vCPU level. Set at the whole VM level. It’s much harder to comply and monitor at per vCPU.
  • Do not set SLA at both Read and Write latency. Set at the aggregate.
  • All numbers are measured in 5 minute average. If a spike only lasts for 1 minute, then it calms down for the remaining 4 minutes, it won’t show up.
  • To change the power management settings, see this KB (VM application runs slower than expected in ESXi)
  • The SLA impacts your architecture. It poses a constraint as it sets a formal threshold.
    • Example: how do we ensure Tier 3 storage does not impact Tier 1 since they are on the same array (spindles, CPU, etc). Some storage array has shares internally. Even on vSAN, this is challenging.

The above is is an example as your policy as IaaS provider may vary. For each, list all the properties that impact the quality of the service.

Notice what’s missing in the table?

Something that you normally have if you are doing Capacity Management based on spreadsheet 🙂

Yup, it’s the Consolidation Ratio.

It’s not there because it’s not relevant. In fact, the ratios can be misleading as it does not take into account VM utilization, VM size, ESXi power management, DRS, backup period, etc. It is definitely a good guide for initial planning. Once you are in production, you need to monitor based on what’s happening in production. Performance Experts like Mark Achtemichuk has explained it well here. I recommend you read it first.

Done?

Great! Let’s dive deeper. I will take one component, Storage, as it’s the easiest to understand.

  • VM Disk Latency for Tier 1: 5 ms
  • VM Disk Latency for Tier 2: 15 ms
  • VM Disk Latency for Tier 3: 25 ms
  • All values measured as 5-minute average.
  • SLA is breached when value exceed SLA at any given 5 minutes, 365 days a year.

When a VM owner complains that her VM is slow because of storage, and that VM resides on a Tier 2 storage, both of you can see the VM disk latency. If it’s below 15 ms, it’s not your fault. Perhaps her application needs a faster storage, and she can pay more and upgrade to Tier 1. If it higher than 15 ms, you as IaaS provider does not even have to wait until she complains 🙂 Better still, do something before she notices.

What number should you set?

  • If you have no data, the above is a good starting point.
  • If you have vR Ops running, you can set a number based on your actual data.
    • There is no point is setting something much higher or lower than what you actually have.
    • Use the super metric preview as shown below.
    • I plotted 1 month data (that’s 8650 data points). That’s more than enough.
    • Take the maximum. That’s your baseline.

Alerts

Now… you have thousands of VMs under your management. I guess what you want is to be alerted if any of them hit the SLA you promise.

Yes, vRealize Operations can alert you, so you can proactively do something before VM owner complains. Brandon Gordon, Integration Architect at VMware, showed me how we can achieve the above in vRealize Operations.

See the screenshot below, courtesy of Brandon.

alert 4

In order to get such alerts at each VM, you need to create and define the alerts. Brandon has defined for CPU, RAM and Disk. He has also defined it for each tier.

alert 2

Hope you find useful, just like many of my customers have. If not, drop me a note.