Tag Archives: Performance SLA

VMware Performance SLA

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Google “performance SLA” VMware, and you will find only few relevant articles. The string performance SLA has to be within a quote, as it is not performance and SLA, but Performance SLA. Yes, I’m after web page with the words Performance SLA together. You will get many irrelevant results if you simply google VMware Performance SLA without the quote.

I just tried it again (2 April 2017). It’s only 6000 results, up from 2330 results in Nov 2016, and 1640 results on Oct 2015. The first 10 are shown below. Notice 8 of them are actually from my blog, book or event. If you ask your peers, you will not find many customers have Performance SLA.

I didn't change the screenshot below, as it's similar to what I got in April 2017

performance-sla-nov-2016

I checked beyond the first 10 results. Other than my own articles, Google returned only 5 relevant articles. The rest are actually not relevant. An example of relevant article is by former colleague, and a good friend Michael Webster. All the relevant articles are good and informative articles. They also mention Performance SLA. They just do not define and quantify what Performance SLA is. If something is not quantified, it is subjective. It’s hard to reach formal agreement with customers quickly and consistently when the line is not clearly drawn. If you have disagreement with your customers, especially paying customers, guess who win 🙂

A former colleague, Scott Drummonds, covered Performance SLA in his old blog back in 2010. It is unsurprising to me, knowing Scott, that he had thought about it years ago! However, what he covered was Application layer, not IaaS layer. He also did not provide a counter to measure. Certainly, it was virtually impossible to provide that years ago, considering the maturity of the IaaS at that time.

Availability SLA protects you when there is downtime. Performance SLA protects you when there is performance issue. How?

If you are within your SLA, you are safe.

Below is example of Performance SLA. Describe (or define) the service for each of the 4 infrastructure component (CPU, RAM, Disk, and Network).

Take note:

  • Do not set SLA at the individual vCPU level. Set at the whole VM level. It’s much harder to comply and monitor at per vCPU.
  • Do not set SLA at both Read and Write latency. Set at the aggregate.
  • All numbers are measured in 5 minute average. If a spike only lasts for 1 minute, then it calms down for the remaining 4 minutes, it won’t show up.
  • To change the power management settings, see this KB (VM application runs slower than expected in ESXi)
  • The SLA impacts your architecture. It poses a constraint as it sets a formal threshold.
    • Example: how do we ensure Tier 3 storage does not impact Tier 1 since they are on the same array (spindles, CPU, etc). Some storage array has shares internally. Even on vSAN, this is challenging.

The above is is an example as your policy as IaaS provider may vary. For each, list all the properties that impact the quality of the service.

Notice what’s missing in the table?

Something that you normally have if you are doing Capacity Management based on spreadsheet 🙂

Yup, it’s the Consolidation Ratio.

It’s not there because it’s not relevant. In fact, the ratios can be misleading as it does not take into account VM utilization, VM size, ESXi power management, DRS, backup period, etc. It is definitely a good guide for initial planning. Once you are in production, you need to monitor based on what’s happening in production. Performance Experts like Mark Achtemichuk has explained it well here. I recommend you read it first.

Done?

Great! Let’s dive deeper. I will take one component, Storage, as it’s the easiest to understand.

  • VM Disk Latency for Tier 1: 5 ms
  • VM Disk Latency for Tier 2: 15 ms
  • VM Disk Latency for Tier 3: 25 ms
  • All values measured as 5-minute average.
  • SLA is breached when value exceed SLA at any given 5 minutes, 365 days a year.

When a VM owner complains that her VM is slow because of storage, and that VM resides on a Tier 2 storage, both of you can see the VM disk latency. If it’s below 15 ms, it’s not your fault. Perhaps her application needs a faster storage, and she can pay more and upgrade to Tier 1. If it higher than 15 ms, you as IaaS provider does not even have to wait until she complains 🙂 Better still, do something before she notices.

What number should you set?

  • If you have no data, the above is a good starting point.
  • If you have vR Ops running, you can set a number based on your actual data.
    • There is no point is setting something much higher or lower than what you actually have.
    • Use the super metric preview as shown below.
    • I plotted 1 month data (that’s 8650 data points). That’s more than enough.
    • Take the maximum. That’s your baseline.

Alerts

Now… you have thousands of VMs under your management. I guess what you want is to be alerted if any of them hit the SLA you promise.

Yes, vRealize Operations can alert you, so you can proactively do something before VM owner complains. Brandon Gordon, Integration Architect at VMware, showed me how we can achieve the above in vRealize Operations.

See the screenshot below, courtesy of Brandon.

alert 4

In order to get such alerts at each VM, you need to create and define the alerts. Brandon has defined for CPU, RAM and Disk. He has also defined it for each tier.

alert 2

Hope you find useful, just like many of my customers have. If not, drop me a note.

12 KPIs for high performance VDI

How do we know a user is getting a good performance on her/his VDI session? When she called help desk and shared that her Windows is slow, how can help desk quickly determine where the root cause is?

I covered that VDI workload differs to server workload here. In this context, the PCoIP metrics certainly play a key role in determining a good user experience.

There are 4 basic elements of infrastructure (CPU, RAM, Network and Storage). For each, there are certainly multiple metrics that can impact a user experience. I came up with the following 12 metrics. For each, I give my 2 cents on what I think a healthy value should be for a “snappy Windows”. Something that matches the experience you have on a US $1000 PC (with 27″ monitor).

VDI high performance

[12 Jan 2016: V4V 6.2 can monitor the Disk Queue Length]

You should certainly establish your own threshold. The good thing is you can use vRealize Operations super metric preview to see your historical data.

By having 12 metrics that you check, you are far more comprehensive than checking say 4 metrics.

The above counter requires vRealize Operations for Horizon View. For short, we typically call it V4V. This is because it relies on agent inside the Guest OS (Windows in VDI case). You do not get PCoIP metrics from standard vRealize Operations, even if you install the End Point agent. That metric comes from View agent, which includes V4V agent.

Can you notice what metric is missing?

There is no Memory Consumed and Memory Active metrics. See this for explanation.

Take note that V4V 6.2 cannot measure Disk Queue Length. It’s not a major limitation as Disk Queue Length typically develops if the disk latency is high, or there is an issue with the driver.

Once you have the above metric, it is pretty easy to create a dashboard for Help Desk. Here is what it looks like. I’m only showing the top part of the dashboard.

VDI high performance 2

The help desk just need 2 steps:

  1. Search the user. We use the MS AD Login ID. I did a search above, hence it’s only showing 1 result as I typed the full user ID and there is only 1 match. I selected a user, and all the counters from Horizon session (from V4V adapter) are automatically shown.
  2. Click on the VM object. This is to display the VM counters (from vSphere adapter)