How to upgrade to vRealize Operations 7.0

This post was contributed by someone I respect for his dedication to vR Ops. Varghese Philipose, Staff TAM, Dubai, has done many upgrades. He is part of the awesome TAM team in Middle East, and they have successfully done >100 upgrades. I’ve been to Dubai several times and meet their customers. The enthusiasm from the customers on how vR Ops have helped them is the proof of the TAM team dedication.

Phase 1 : Running Pre-upgrade Assessment tool
Only for 6.6.1 or older. 6.7 does not need this.

  • Download upgrade assessment tool: vRealize Operations 7.0 – Upgrade Assessment Tool (File Type: pak, File size: 5.97 MB)
  • Launch a browser. Go to
    https://master-node-FQDN-or-IP-address/admin, where the address is the master node Administrator interface.
  • Click Software Update in the left panel.
  • Click Install a Software Update in the main panel.
  • Follow the steps in the wizard to locate and install your PAK file. Check Install the PAK file even if it is already installed.
  • Install the Upgrade Assessment Tool.
  • Wait for the software update to complete.

Note: If a cluster fails and the status changes to offline during the installation process of a PAK file update then some nodes become unavailable. To fix this, go to the Administrator interface and manually take the cluster offline and click Finish Installation to continue the installation process.

Next step is the Pre-upgrade assessment report. Follow these:

  • Navigate to the Support > Support Bundles tab.
  • Download the light support bundle that was generated from the installation of the Upgrade Assessment Tool.
  • In the downloaded support bundle, open the cluster_timestamp_nodeaddress/nodeaddress_timestamp_nodeaddress/apuat-data/report/index.html file.
  • A list of all potentially impacted user content is displayed in the Impacted Components Summary page.
  • Make the recommended changes. In most cases, you just need to replace metrics with the recommended ones.
  • If you have Management Pack that you do not use, remove it

Phase 2 : Upgrading to vRops 7.0 

Download the following upgrade files by using the download link from myvmware portal

File NameBuild#
vRealize Operations – Virtual Appliance OS upgrade10098133
vRealize Operations – Virtual Appliance upgrade10098133

  1. Make sure the root logins to all the vRops nodes are available.
  2. Make sure all the vRops nodes have sufficient disk space for upgrade ( Minimum 5% free disk space required) : Check with command “df -h” from SSH session to each node.
  3. Optional step: If you have a large vRops cluster with more than 4 nodes , the upgrade process can be made faster by pre-copying upgrade files to all nodes. Pre-copy the upgrade files to all nodes except Master Node – follow steps as in KB 2127895
  4. Take the vR Ops cluster offline from Admin UI and shut down all the nodes.
  5. Optional but highly recommended. Take the snapshot of all the nodes.
  6. Power on vR Ops nodes in the sequence starting Master Node, Replica Node, Data Nodes and Remote Collectors.
  7. Take the cluster offline, if it comes online after powering ON. This step reduces the time taken for upgrade.
  8. Initiate vR Ops Virtual Appliance OS upgrade from Admin UI. This upgrades the base SuSE Linux OS
  9. Once OS upgrade is completed , take the cluster Offline and start the Virtual Appliance upgrade.
  10. Once the upgrade is completed, verify functionality. If all functionality is reported correctly, take the cluster offline, shutdown nodes and delete the snapshots. Once snapshots are removed, power on the nodes in sequence (follow same steps in Step #6 above ).
  11. Once all nodes are powered on , bring the vRops cluster Online.

VMC Migration: Before vs After comparison

When you are migrating your customers workload to another infrastructure, the onus is on you to prove that you’re not causing problems to the VMs or Applications. This is especially true if it’s your idea to migrate, and you’re not giving them a choice.

There are many examples of migration. Popular ones are:

  • From old DC to new DC.
  • From on-prem to VMC.
  • From on-prem to Cloud. This is typically vSphere as you can simply move without changing VM.

In the above, you typically change all infrastructure. New server, new network, new storage, new vSphere. You may virtualize network by adding NSX. You may also virtualize storage by going vSAN.

Regardless, your Application Team do not and should not care. It’s transparent to them. In fact, it should be better as you’re using faster & bigger hardware. You have more CPU cores, faster RAM, faster storage, bigger network, less network hops, etc.

And that’s exactly where the problem might start 😉

A VM that takes 8 hours to complete the batch job may now take 2 hours. So it completes the same amount of work, doing as many Disk, Network, CPU, RAM in 4x shorter duration.

So what happens to the VM IOPS? Yes, it went up by 400%.

What happens to VM CPU Usage? It also went up by 400%. It has to, as it completes the same amount of logic. Suddenly, a VM that runs relatively idle at 20% becomes highly utilization 80%.

All the above is fine, if not for the next factor. Can you guess what is it?

Hint: it’s how you justify the budget to your management.

Yes, you promise higher consolidation. You have more CPU cores, more RAM, so logically you use higher over-commit ratio. As Mark said, use it carefully.

Since you have to increase overcommit ratio, how do you then prove that performance will not be affected as you drive utilization higher?

The answer is to look at what KPI can impact a VM performance. The article here provides the answer. A VM Owner looks at her VM performance, not your IaaS utilization.

The above is for a VM. It does not answer how the IaaS platform cope. This is where the Cluster KPI comes in.

With the above 2 dashboards, you can monitor and prove both the consumer layer (VM) and provider layer (Infra).

Complaint-based Operations

How do you know that the IaaS Platform (be it on-prem or in the cloud) is serving its workload well? If you depend on complaint, then you run a complaint-based operations.

Changing from reactive to proactive is unfortunately a complex undertaking, especially in large organisation where there are many roles and people. It requires transformation and resetting behaviour. It is not easy to get customers to agree on SLA when you’ve promised them “good” for years.

So what can you do?

  1. Measure your actual performance
  2. Improve it if the reality is not what you expect from a decent IaaS. If there is no complaint, this is even better as it means you do not have to panic and rush the improvement.
  3. Get buy in from your management on the SLA. Establish this Internal SLA.

The diagram below shows how Internal SLA represents the intermediate step. The Formal SLA is typically less stringent, giving yourself buffer.

You do not need the following in this intermediate step:

  • Class of Service. You don’t have to have Gold, Silver, etc tiers. You can keep your mixed workloads in the same clusters or datastores. This means in vR Ops, you do not need to create policy. Just use the base, active policy. This simplifies adoption.
  • Per VM SLA. You’re measuring the Infra for now. If you mix workload, then per VM is impossible to achieve.

Since you’re measuring just the infra, it’s a lot easier to implement. You start by establishing the KPI of a cluster. Below is my recommendation.

The above has Disk, hence no need to look at Datastore. You just need to focus on cluster object. This also removes complexity when a cluster has multiple datastores, and vice versa.

A Cluster KPI is simply average of the above metrics. Now that you a single metric for a cluster that combines its Performance and Utilization, you can aggregate. I’d recommend you count the number of Clusters who KPI falls into the Red zone. If your environment is healthy, then the count will result in 0. You can repeat this check every 5 minutes. In a performing IaaS, you will get a flat line as shown below. Since it’s a trend line, you can see the performance over time, giving you insight if there is a pattern. Since you’re counting the number of clusters in the red zone, this metric is scalable across thousand of clusters. You certainly expect all your clusters to perform!

The table complement the line chart. It lists each cluster, sorted by the worst performing cluster. You can click into the cluster, and drill down into the cluster KPI, if you customize the cluster summary page.

Implementation

The above requires super metric, my favourite feature in vR Ops. Here is what it looks like:

I removed the disk metrics as I want to avoid double counting when used with vSAN. I also remove 1 network metric.

Below is what it looks like. I did a preview on one of the cluster to show you what it looks like.

Hope it helps you taking the first step toward proactive operations.