Architecting VMware vSphere for Operations

As an Architect, you take into account many requirements when designing your VMware vSphere environment. As an Operations person, I shall not question your Architecture. I’m sure it is fast, highly available, right on budget, etc. My role is to help you prove to CIO that what you architect actually lives up to its expectation. “Plan meets Actual” is what an Architect wants, because that means your architecture delivers its promise.

The plan of the Architecture exists in some diagram and documents. It's static.
The reality of the Architecture exists in Datacenter. It's live.

When proving, here are the questions we should be able to answer:

  • Availability: Does the IaaS deliver the promised Availability SLA? If not, what was its actual number, when was it breached, how long, which VMs affected? For each VM, when exactly it happened and ended?
  • Performance: Does the IaaS serve all its VMs well? An IaaS platform provides CPU, RAM, Disk and Network as services. If has to deliver these 4 resources when asked by each VM, 24 x 7 x 365 days a year. If not, which VMs were affected by what and when?

Those are simple questions, but are very difficult to answer. Let say you have 10,000 VMs. How do you answer that? How do you provide answer over time (e.g. monthly), proving you handle the peaks well?

To complicate matters, you need to able to answer per Business Units. Business Units A will not care about other business units. Since a Business Unit has >1 apps, you also need to answer per application. An application owner only cares about her applications.

There are a few things you need to do, so you are in the position to prove.

Step 1: Reflect the business in vSphere

Does your vCenter show all the business units? Can you show how the business is mapped into your vSphere environment? You design vSphere so Business can run on it, so where is the Business? A company is made up business units, which may have multiple departments. The structure below is a typical example.

An application typically has multiple tiers. Does your vSphere understand that?

Map the above as folders in your vCenter.

Have a naming convention that can easily separate one Application from another. Avoid having complicated and cryptic naming convention, where it’s easy to make mistake from one VM to another. Make each name different enough.

Step 2: Design Service Tiers into vSphere

Does your vSphere understand that there are different classes of service? Are Tier 1 clusters and datastores clearly labelled?

You should avoid mixing multiple classes of service into a single cluster or datastore. While it is technically possible to segregate, it’s operationally challenging. Resource Pool expects the number of VMs for each pool to be identical among the sibling pool.

For each tier, you need to have both Availability SLA and Performance. For Performance SLA, review this doc.

Step 3: Define and Map Tiers in vR Ops

Now that you’ve considered service tier into your vSphere architecture, time to show it. You cannot show it in vSphere as vCenter does not understand Performance SLA and Availability SLA. You can use vR Ops do this. Follow this step.

Step 4: Map Applications in vR Ops

Use custom groups to create applications. If you have a proper naming convention, it should not be difficult to select members of the applications. All you need is a query that says the names contains XYZ. There should not be a need for regular expression.

Once apps are mapped, you can do something like this.

Once you have that in place, you will be able to prove that your Architecture lives up to its expectation. Use the dashboards from Operationalize Your World to show that proof!

Operationalize Your World: Import steps for older versions

Glad to see that vRealize Operations 6.5 sports the ability to import & export Groups. This makes it much easier for you to replicate the Operationalize Your World dashboards into your environment. The entire chunk of group creation is no longer required!

You can also import & export multiple groups. No need to do it one by one. The import dialog box is similar to other import dialog boxes.

The group description is actually pretty simple. Let’s use an example to show it. This is what you see on the UI. It’s a group called Idle VMs that grabs all VMs whose CPU Idle Time is greater than 99%.

Here is the JSON file. I’ve highlighted some of them, so you can see how it’s mapped.

 

I’ve been using 6.5 release and generally found it stable. There are also a number of bug fixes addressed in this release.

As not all customers are on vRealize Operations 6.5 yet, I’m keeping the instructions to create the groups here. Use it as the guide if you are running 6.3 and 6.4. If you are on 6.5, refer to the simplified step here.

Video Instruction

I have also simplified the steps by manually creating a dummy policy to help you do bulk import. As a result, the video guide is no longer required. I’m keeping the videos here just in case you need it.

Group Creation

Create the following groups. All of them. Note it’s case sensitive! If you do not use these names exactly, your dashboard will get hourglass icon.

  • Under the group type Class of Service
    • Tier 1 (Gold)
    • Tier 2 (Silver)
    • Tier 3 (Bronze).
  • Under the group type Function
    • Datastores (Shared)
    • Datastores (Local)
  • Under the group type VM Types
    • Idle VMs
    • Large VMs (CPU)
    • Large VMs (RAM) –> do note this requires vSphere 6U1
    • Powered On VMs
    • Powered Off VMs
    • VM with no VMware Tools
    • VM with VMware Tools installed
  • Under the group type Tenants
    • Tenant ABC, etc.
    • [e1: this part is optional. Create 1 group per tenant. See this for details]
Once created. Do not rename the object.

For each tier, ensure you select the right Cluster, VM and Datastore that you have planned earlier. Do not do an impromptu planning. That’s an oxymoron 🙂

You need to select these 3 objects. If you do not select the object, you cannot apply the Performance SLA.

Group - Service Tier

For datastore, exclude local datastore unless they are part of your official Service Tier. vR Ops 6.3 has a property for that. Pretty cool!

local

In any group, always do a preview before you save. Take note the total number of member.

group - always do preview

For the Idle VMs group, define what suits your operations. I use >99% CPU Idle Time. This is based on a 30-day period (default setting in the policy), so that translates into maximum of 7.2 hours in the last 30 days.

idle-vms

The default setting in policy is 90%, which differs to what I use in the group. Does it mean you do not have to worry about the Policy Settings for Idle?

You are right! Don’t worry about the policy. I do not use Is Idle metric. I use the CPU Idle Time metric. No need to modify the default settings as you’re not using it.

For the Large VMs groups, you can change the definition to suit your need. I’d recommend changing below from 4 vCPU to at least 7. If you have a lot of 8 vCPU VMs, then change to 8 so they are not included. Focus on the big ones.

Large VM - CPU

Make sure you only choose powered on VMs too, else they get added! See below on how to add this condition. You can use this metric, or use the Summary | Running metric. For RAM, I use in-guest metric as it’s more accurate. Just because a VM is powered on, does not mean the OS is running.

large-vm-ram

For the Powered Off VMs group, I define them as VMs that are off for >50% in the past 30 days and they are powered off at the moment. This is conservative, as that VM needs to be powered off for a total of 15 days in the past 30 days.

Powered Off VM

For group of VM with VMware Tools installed and group of VM with no VMware Tools, use the property shown below. I cloned the group, and simply change from is to is not.

group - VMware Tools

vRealize Operations Troubleshooting webcast

I just got to know that VMware Education is running a free webinar. It’s only 1 hour.

The date is 28 February 2017. There are 3 different sessions so there should be one that meets your time zone:

  • 08:00 AM – 9:00 AM PST
  • 12:00 PM – 1:00 PM GMT
  • 12:00 PM – 1:00 PM SGT

The topic covers:

  • Brief Introduction to vRealize Operations
  • vRealize Operations Manager Components Layers Functionality
  • Configuration Files
  • vcopsClusterManager.py & vcopsConfigureRoles.py tools
  • Top Trending Issues
    • Node status stuck at “Waiting for Analytics”
    • Troubleshooting “Double Master” issue
    • Manually removing a node at the time of troubleshooting
    • Adapter collection issue

Your IM questions will be answered throughout the broadcast, plus we’ll finish up with a 10-minute Q&A session.

You can register here. I have registered myself as I can benefit from this session! Registration was fast and the course is complimentary.