Monthly Archives: March 2017

Architecting VMware vSphere for Operations

As an Architect, you take into account many requirements when designing your VMware vSphere environment. As an Operations person, I shall not question your Architecture. I’m sure it is fast, highly available, right on budget, etc. My role is to help you prove to CIO that what you architect actually lives up to its expectation. “Plan meets Actual” is what an Architect wants, because that means your architecture delivers its promise.

The plan of the Architecture exists in some diagram and documents. It's static.
The reality of the Architecture exists in Datacenter. It's live.

When proving, here are the questions we should be able to answer:

  • Availability: Does the IaaS deliver the promised Availability SLA? If not, what was its actual number, when was it breached, how long, which VMs affected? For each VM, when exactly it happened and ended?
  • Performance: Does the IaaS serve all its VMs well? An IaaS platform provides CPU, RAM, Disk and Network as services. If has to deliver these 4 resources when asked by each VM, 24 x 7 x 365 days a year. If not, which VMs were affected by what and when?

Those are simple questions, but are very difficult to answer. Let say you have 10,000 VMs. How do you answer that? How do you provide answer over time (e.g. monthly), proving you handle the peaks well?

To complicate matters, you need to able to answer per Business Units. Business Units A will not care about other business units. Since a Business Unit has >1 apps, you also need to answer per application. An application owner only cares about her applications.

There are a few things you need to do, so you are in the position to prove.

Step 1: Reflect the business in vSphere

Does your vCenter show all the business units? Can you show how the business is mapped into your vSphere environment? You design vSphere so Business can run on it, so where is the Business? A company is made up business units, which may have multiple departments. The structure below is a typical example.

An application typically has multiple tiers. Does your vSphere understand that?

Map the above as folders in your vCenter.

I see many naming convention that is not operations-friendly. It’s impossible to guess what it is. The names are very similar and hard to read, hence it’s easy for operators to make mistake! The naming convention typically originates from mainframe or MS-DOS era, where you cannot have space and have limited characters. Examples are SG1-D01-INS-0001W-PRD. Can you guess what on earth that is? You’re right, you can’t. Imagine there are 1000s of them like that, and you have new operators joining the help desk team.

Folder, Tags and Annotation

Have you seen a vSphere environment where there are tags and annotation everywhere?

It’s rare to meet customers with a 100% well-thought and documented approach to the 3 features above. They may have general guidelines, but not explicit Do’s and Dont’s. As a result, these 3 features are used wrongly.

Use Tags when the values are discrete, ideally Yes/No. I’d use tag to tag the following:

  • VMs with RDM.
  • VMs with MSCS or Linux Clustering.

Do not use Tags when the values are unlikely to be common. Use annotation for this. Examples are VM Owner Name, Email Address, Mobile Number. In an environment where there are >10K VMs, there can be 1000 VM Owners.

Do not use Tags to tag Service Tier. For Infra objects such as Cluster and Datastore Cluster, that should be clearly reflected in the name itself. I’d prefix all Tier 1 clusters with Tier 1, so the chance of deploying into the wrong tier is minimized.

Step 2: Design Service Tiers into vSphere

Does your vSphere understand that there are different classes of service? Are Tier 1 clusters and datastores clearly labelled?

You should avoid mixing multiple classes of service into a single cluster or datastore. While it is technically possible to segregate, it’s operationally challenging. Resource Pool expects the number of VMs for each pool to be identical among the sibling pool.

For each tier, you need to have both Availability SLA and Performance. For Performance SLA, review this doc.

Step 3: Define and Map Tiers in vR Ops

Now that you’ve considered service tier into your vSphere architecture, time to show it. You cannot show it in vSphere as vCenter does not understand Performance SLA and Availability SLA. You can use vR Ops do this. Follow this step.

Step 4: Map Applications in vR Ops

Use custom groups to create applications. If you have a proper naming convention, it should not be difficult to select members of the applications. All you need is a query that says the names contains XYZ. There should not be a need for regular expression.

Once apps are mapped, you can do something like this.

Step 5: Consider Debug-ability

Things go wrong. Especially in production. Your architecture should lend itself for troubleshooting.

A major area is to ensure the counters are reliable, else it’s hard to troubleshoot performance. The CPU Contention counter, which is the main counter for IaaS Performance SLA, is greatly affected by Power Management. Ensure your ESXi power management follow this guide by Sunny Dua.

Once you have that in place, you will be able to prove that your Architecture lives up to its expectation. Use the dashboards from Operationalize Your World to show that proof!

Operationalize Your World: Import steps for older versions

Glad to see that vRealize Operations 6.5 sports the ability to import & export Groups. This makes it much easier for you to replicate the Operationalize Your World dashboards into your environment. The entire chunk of group creation is no longer required!

Starting from 6.5, you can also import & export multiple groups. No need to create it one by one. The import dialog box is similar to other import dialog boxes.

The group description is actually pretty simple. Let’s use an example to show it. This is what you see on the UI. It’s a group called Idle VMs that grabs all VMs whose CPU Idle Time is greater than 99%.

Here is the JSON file. I’ve highlighted some of them, so you can see how it’s mapped.

I’ve been using 6.5 release and generally found it stable. There are also a number of bug fixes addressed in this release.

As not all customers are on vRealize Operations 6.5 yet, I’m keeping the instructions to create the groups in this article. Use it as the guide if you are running 6.3 and 6.4. If you are on 6.5 or higher, refer to the simplified step here.

Video Instruction

I have also simplified the steps by manually creating a dummy policy to help you do bulk import. As a result, the video guide is no longer required. I’m keeping the videos here just in case you need it.

Group Creation

Create the following groups. All of them. Note it’s case sensitive! If you do not use these names exactly, your dashboard will get hourglass icon.

  • Under the group type Class of Service
    • Tier 1 (Gold)
    • Tier 2 (Silver)
    • Tier 3 (Bronze).
  • Under the group type Function
    • Datastores (Shared)
    • Datastores (Local)
  • Under the group type VM Types
    • Idle VMs
    • Large VMs (CPU)
    • Large VMs (RAM) –> do note this requires vSphere 6U1
    • Powered On VMs
    • Powered Off VMs
    • VM with no VMware Tools
    • VM with VMware Tools installed
  • Under the group type Tenants
    • Tenant ABC, etc.
    • [e1: this part is optional. Create 1 group per tenant. See this for details]
Once created. Do not rename the object.

For each tier, ensure you select the right Cluster, VM and Datastore that you have planned earlier. Do not do an impromptu planning. That’s an oxymoron 🙂

You need to select these 3 objects. If you do not select the object, you cannot apply the Performance SLA.

Group - Service Tier

For datastore, exclude local datastore unless they are part of your official Service Tier. vR Ops 6.3 has a property for that. Pretty cool!

local

In any group, always do a preview before you save. Take note the total number of member.

group - always do preview

For the Idle VMs group, define what suits your operations. I use >99% CPU Idle Time. This is based on a 30-day period (default setting in the policy), so that translates into maximum of 7.2 hours in the last 30 days.

idle-vms

The default setting in policy is 90%, which differs to what I use in the group. Does it mean you do not have to worry about the Policy Settings for Idle?

You are right! Don’t worry about the policy. I do not use Is Idle metric. I use the CPU Idle Time metric. No need to modify the default settings as you’re not using it.

For the Large VMs groups, you can change the definition to suit your need. I’d recommend changing below from 4 vCPU to at least 7. If you have a lot of 8 vCPU VMs, then change to 8 so they are not included. Focus on the big ones.

Large VM - CPU

Make sure you only choose powered on VMs too, else they get added! See below on how to add this condition. You can use this metric, or use the Summary | Running metric. For RAM, I use in-guest metric as it’s more accurate. Just because a VM is powered on, does not mean the OS is running.

large-vm-ram

For the Powered Off VMs group, I define them as VMs that are off for >50% in the past 30 days and they are powered off at the moment. This is conservative, as that VM needs to be powered off for a total of 15 days in the past 30 days.

Powered Off VM

For group of VM with VMware Tools installed and group of VM with no VMware Tools, use the property shown below. I cloned the group, and simply change from is to is not.

group - VMware Tools