Monthly Archives: January 2016

VMware Performance SLA

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Google “performance SLA” VMware, and you will find only few relevant articles. The string performance SLA has to be within a quote, as it is not performance and SLA, but Performance SLA. Yes, I’m after web page with the words Performance SLA together. You will get many irrelevant results if you simply google VMware Performance SLA without the quote.

I just tried it again (2 April 2017). It’s only 6000 results, up from 2330 results in Nov 2016, and 1640 results on Oct 2015. The first 10 are shown below. Notice 8 of them are actually from my blog, book or event. If you ask your peers, you will not find many customers have Performance SLA.

I didn't change the screenshot below, as it's similar to what I got in April 2017

performance-sla-nov-2016

I checked beyond the first 10 results. Other than my own articles, Google returned only 5 relevant articles. The rest are actually not relevant. An example of relevant article is by former colleague, and a good friend Michael Webster. All the relevant articles are good and informative articles. They also mention Performance SLA. They just do not define and quantify what Performance SLA is. If something is not quantified, it is subjective. It’s hard to reach formal agreement with customers quickly and consistently when the line is not clearly drawn. If you have disagreement with your customers, especially paying customers, guess who win 🙂

A former colleague, Scott Drummonds, covered Performance SLA in his old blog back in 2010. It is unsurprising to me, knowing Scott, that he had thought about it years ago! However, what he covered was Application layer, not IaaS layer. He also did not provide a counter to measure. Certainly, it was virtually impossible to provide that years ago, considering the maturity of the IaaS at that time.

Availability SLA protects you when there is downtime. Performance SLA protects you when there is performance issue. How?

If you are within your SLA, you are safe.

Below is example of Performance SLA. Describe (or define) the service for each of the 4 infrastructure component (CPU, RAM, Disk, and Network).

Take note:

  • Do not set SLA at the individual vCPU level. Set at the whole VM level. It’s much harder to comply and monitor at per vCPU.
  • Do not set SLA at both Read and Write latency. Set at the aggregate.
  • All numbers are measured in 5 minute average. If a spike only lasts for 1 minute, then it calms down for the remaining 4 minutes, it won’t show up.
  • To change the power management settings, see this KB (VM application runs slower than expected in ESXi)
  • The SLA impacts your architecture. It poses a constraint as it sets a formal threshold.
    • Example: how do we ensure Tier 3 storage does not impact Tier 1 since they are on the same array (spindles, CPU, etc). Some storage array has shares internally. Even on vSAN, this is challenging.

The above is is an example as your policy as IaaS provider may vary. For each, list all the properties that impact the quality of the service.

Notice what’s missing in the table?

Something that you normally have if you are doing Capacity Management based on spreadsheet 🙂

Yup, it’s the Consolidation Ratio.

It’s not there because it’s not relevant. In fact, the ratios can be misleading as it does not take into account VM utilization, VM size, ESXi power management, DRS, backup period, etc. It is definitely a good guide for initial planning. Once you are in production, you need to monitor based on what’s happening in production. Performance Experts like Mark Achtemichuk has explained it well here. I recommend you read it first.

Done?

Great! Let’s dive deeper. I will take one component, Storage, as it’s the easiest to understand.

  • VM Disk Latency for Tier 1: 5 ms
  • VM Disk Latency for Tier 2: 15 ms
  • VM Disk Latency for Tier 3: 25 ms
  • All values measured as 5-minute average.
  • SLA is breached when value exceed SLA at any given 5 minutes, 365 days a year.

When a VM owner complains that her VM is slow because of storage, and that VM resides on a Tier 2 storage, both of you can see the VM disk latency. If it’s below 15 ms, it’s not your fault. Perhaps her application needs a faster storage, and she can pay more and upgrade to Tier 1. If it higher than 15 ms, you as IaaS provider does not even have to wait until she complains 🙂 Better still, do something before she notices.

What number should you set?

  • If you have no data, the above is a good starting point.
  • If you have vR Ops running, you can set a number based on your actual data.
    • There is no point is setting something much higher or lower than what you actually have.
    • Use the super metric preview as shown below.
    • I plotted 1 month data (that’s 8650 data points). That’s more than enough.
    • Take the maximum. That’s your baseline.

Alerts

Now… you have thousands of VMs under your management. I guess what you want is to be alerted if any of them hit the SLA you promise.

Yes, vRealize Operations can alert you, so you can proactively do something before VM owner complains. Brandon Gordon, Integration Architect at VMware, showed me how we can achieve the above in vRealize Operations.

See the screenshot below, courtesy of Brandon.

alert 4

In order to get such alerts at each VM, you need to create and define the alerts. Brandon has defined for CPU, RAM and Disk. He has also defined it for each tier.

alert 2

Hope you find useful, just like many of my customers have. If not, drop me a note.

VMware vCenter Server 6.0 Update 1b

The VMware vSphere team released Update 1b a few days ago. The Build no is 3343019.

As usual, it’s wise to review the Release Notes before making changes to live environment. You need to review both the vCenter Server release notes and ESXi release notes.

If you are already on Update 1, and you are using the vCenter appliance, the update to 1b is pretty straight forward. If you are not yet on Update 1, then there are more steps required. Others have documented the steps well, and some good examples are here and here.

The steps are identical for vCenter appliance and the PSC appliance.

Let’s start with the PSC. I don’t think the order matters since this is just a minor update. However, the manual says that “Before you update a vCenter Server with an external PSC, you must apply the patches to the PSC and its replicating partners, if any in the vCenter SSO domain. ”

The VMware ASEAN Lab has 2 external PSC VMs in a single domain. Since this is the setup, let’s start with the first PSC, before we do its replicating partner.

You can do the update via CLI or UI. I’ll do the UI one here to give you the screenshots. Login at root (not administrator@vsphere.local) to the address https://Your-PSC-address:5480/

PSC 01

I had configured my vCenter to automatically check when I updated it to Update 1. So it’s a pleasant experience to see that it has detected the update. Notice the build number matches. There is a KB article linked to it, which gives a bit more info, such as the size of the update (1.5 GB). At 1.5 GB, this will take a while to complete.

To update, simply click on the Install Updates link and follow the wizard.

PSC 02

Example of third-party products are JRE, tcServer, and SLES OS components. Proceed to update, and you will see the familiar progress below. Click on Show Details to see the actual commands executed. The last status is shown at the top. So if you want to see from the beginning, scroll down. The Stage Packages step took 5 minutes in my case for a PSC and 28 minutes for vCenter. It is safe to click the browser refresh button.

PSC 03

The longest step is the Pre install scripts. In my case, this has been running for >10 minutes.

PSC 04

I had to go and pick up my wife at the airport, so I left the upgrade. When I’m back, it’s already done. This is what it should look like. Notice the build number and release date matches the release notes.

PSC 05

And that’s it! You then repeat for the other replicating PSC in the domain. Once done, you do the same steps for the vCenter.

You might be curious how the update impacts the load on the PSC VM. This VM is a 2 vCPU. As you can see below, the spike is minimal. The CPU Run hit 10348 for 20 seconds, which is around 25% as this is a 2 vCPU (max is 40000).

PSC 30

Let’s look at Storage. The spike t 5:20 pm is the time I did the update. It’s below 300 IOPS for each read or write.

PSC 31

If you want to configure the auto update, simply click on the Settings button. It checks for a weekly update, which is reasonable in most cases

PSC 26

BTW, the PSC also has the https://Your-PSC-address/psc address, while the vCenter only has the https://Your-PSC-address:5480/. The /psc requires the administrator@vsphere.local, not root. When you login, you get the screen below.

PSC 1

From here, there is a link to the https://Your-PSC-address:5480/ address

PSC 2

Happy updating! FYI: the 2 vCenter and 2 PSC got updated successfully. Didn’t hit any error.