Approach to define SLA

How do you measure your SLA if you have different classes of service? For example, your offer a higher availability for Gold, and lower one for Silver. This is common and expected.

You offer 99.99% for Gold, and 99.9% for Silver. Both are measured against the same benchmark, which is the ideal, perfect availability (No Downtime).

Just because something is up, does not mean it’s fast. A VM can be up, but if it’s so slow, it is as good as dead. So you need another kind of SLA to complement Availability SLA. You need Performance SLA.

Another reason why you need another SLA is availability is a given. It does not matter what the number is. If it’s down, you better hurry to bring up!

Performance SLA needs to follow a consistent approach with other SLA. The higher the class of service, the higher the SLA. They can’t be the same number, else it’s confusing.

So it will look something like this

Gold: Performance SLA is 99.9%
Silver: Performance SLA is 99%

Another word, a VM in Silver environment will expect that it does not get what it demands as often as a VM in Gold. If the VM Owner wants to have better or more consistent performance, then simply pay more and upgrade to gold cluster.

This approach is easier than setting up a different SLA for each tier. Take for example

Gold: VM Memory Contention: 0.5%
Silver: VM Memory Contention: 1.0%

You notice the problem already?

That’s right! It’s hard to explain why 0.5 and 1, and not other numbers. It’s also hard to explain the gaps between them.

There is a 2nd problem. If you set different standards, it is possible that Silver will perform better than Gold, because it has lower standard!

It’s much easier to set a high standard (similar to the No Downtime situation) and just measure the failure to meet. You expect Silver to fail more often.

Operationally, having a single threshold is easier to set up. No need to play with vRealize Operations policy. You can also have mixed classes of VM in the same cluster, as the SLA threshold is the same.

Hope it addresses why Operationalize Your World applies a single threshold.

BTW, I encourage you not to modify the threshold. It’s more important to establish the baseline, and see its relative movement over time. Reason is infra performance don’t have perfect correlation with the business. It’s more important to know your performance pattern, than aiming for a perfect number. Don’t be obsessed with the number.

Leave a Reply