Tag Archives: vROps

Do I need to upsize my vRealize Operations?

One common mistake I see in the field is oversized vRealize Operations. I guess the thinking is bigger is better is hard to let go. It does not help that the official sizing guide is conservative. It is conservative for a good reason. There is a wide permutation of vR Ops deployment.

So if your deployment is a simple one, with no management pack and End Point Operations, there is a good chance that you are better off with smaller deployment. So how do you check?

I’ll use an actual example and run through my thought process. The example below is from real production environment, not a lab. The environment is a mid-size, around 3000 VM on 300 ESXi hosts.

The environment is heavy on vCenter folders, vR Ops custom groups, super metrics and alerts. It also has integration with ticketing system. The result is 6000 objects and 10 million metrics. The actual collection is 5500 objects.

To see the above break down, go to Cluster Management screen, as shown below. What can you tell from it?

The vR Ops has 5 nodes. It’s clustered and well balanced. Each node handles around 1100 objects. It’s also using Remote Collector to offload the 5-minute processing. As you’ll see later, that strategy pays off well.

You can see the breakdown of Objects being collected. The number 211 was made of 205 + 6.

Now that we know what the deployment is, we can see each node. From the screen below, you can see again that Remote Collector has a subset of the full node. It only has 4 main modules (Collector, Suite API, Watchdog and Admin UI). There is no Persistence and Databases there.

From the above, we can see the full metrics and property of each node. We can also drill down into each modules.

Remember the 5500 objects being collected? Let’s see the history. I’m plotting since Day 1. This is a new vR Ops, so it only goes back to 1 March.

Notice it starts from 0, as that’s when we deployed it. It was a phased deployment. We registered more vCenter, so the number of VMs, objects and VM went up. The CPU Usage didn’t jump accordingly, indicating it has more than enough CPU to handle the extra load. Another word, the additional load was too small to make a difference.

Since the 5 nodes are well balance, let’s take 1 of them, so we can dive deeper. I added Guest OS RAM this time around.

We see the similar jump in objects and metrics. That’s expected by now. The impact on CPU was also minimal.

The spike you see in CPU is actually a daily chart. We will show that later on that it happens at midnight. The daily spike eventually became higher. I’m not sure exactly, but it’s a daily calculation (e.g. capacity or DT). It’s not super metric or groups, as these are calculated every 5 minutes.

The additional load was actually decent. It was in fact 2x load, as you can see below. I used a more detailed chart, and you can see here the sharp jump as we added a few vCenter. The vCenter in turn brings all the objects.

The sharp jump makes a tiny difference in CPU Usage. From the pattern below, you won’t believe that there was 2x load. To me, the extra load was absorbed by Remote Collector.

The RAM pattern was puzzling. I don’t know why. BTW, this counter is from Guest OS, not from VM level. I do expect memory to be fully used, as it’s just a form of cache. I just don’t know why Free RAM went higher ahead of the addition.

Let’s look at Network. The pattern match CPU. The absolute number was low though. 10K KBps = 10 MBps = 80 Mbps.

Let’s look at Storage. The pattern match CPU. Read is higher, because at night vR Ops does its capacity and DT, and that means it’s reading a lot of data. The absolute number was low though. 1000 IOPS for 1100 objects means 1 object = 1 IOPS.

I said earlier we would dive into the CPU. Here is a 7-day chart. You can see there is daily peak at midnight. But what about the 2nd peak, the one I marked with “?

To answer that, we have to zoom into that period. Here is what it looks like. Turned out, there was a problem. Notice there was no collection. So when we rectified the problem, vR Ops has to catch up.

From the chart, we can also see that the daily calculation does not last >15 minutes. The burst was short.

Hope that helps you in right-sizing your vR Ops.

vSphere Capacity Reclamation

This post continues from the Operationalize Your World post. Do read it first so you get the context.

There are 5 Reclamation levels you can do. Start from the easiest one first.

Let’s go through the table above:

  • Non VM is the easiest, because they are not owned by someone else. They are yours! Non VM objects, such as template and ISO should be kept in 1 Datastore per physical location. Naturally, you can only reclaim Disk, and not CPU & RAM.
  • Orphaned VM and orphaned vmdk are next, as they are not even registered in vCenter. If they are, they may appear italicized, indicating something wrong. They may not have owners too. Take note vR Ops 6.4 cannot check orphaned vmdk.
  • Powered Off VM is harder, as there is now owner of the VM. You need to deal with VM Owner before you delete them.
  • Idle VM is a great target, as you can now claim CPU and RAM when you power them off. You cannot claim disk yet as you are not deleting them yet.
  • Active VM is the hardest. Focus on large VM. Take on CPU and RAM separately. Easier to tackle when you split them. Divide and conquer.
  • Claiming CPU and RAM from Small VM can be futile, regardless if it’s idle. A idle VM with 1 vCPU cannot be further reduced. It should be powered off. By powering them off first, it’s a safer procedure. You can simply power on if the VM is actually being used.
  • Snapshot. This is actually not as hard as CPU and RAM, hence in the actual dashboard we list them separately.

Why do cars have brake?

So it can go faster!

Take advantage of Powered Off as the brake for your Idle VMs. If you treat Idle and Powered off as 1 continuum, you can power off the Idle VMs earlier. You get the benefit of CPU and RAM reclamation.

What value is considered as Idle?

  • It has to be defined, so it’s measurable and not subjective. Declare it as a formal policy, so you don’t end up arguing with your customers.
  • Default setting in vR Ops policy is CPU Demand = 100 MHz. A VM using 100 Hz or less CPU will be considered Idle.
  • While a VM uses CPU, RAM, Disk and Network, we only use CPU as a definition for Idle. I think there is no need to consider all 4, and states all 4 must be idle, because they are inter-related. It takes CPU cycle to process Network Packets and perform Disk activity. Data from NIC and Disk must be copied to RAM also, and the copying effort requires CPU cycle.
  • How long has it been under that threshold?
    VM does not use CPU non-stop for months. There are times it’s idle, and it’s normal. A month-end VM that processes payroll can be idle for 29 days! The default value of 90% will miss this.

Because of these month-end VMs, I recommend you change the definition from 90% to 99%. Even 99% for 30 days can still wrongly mark active VM as Idle. 1% active means it’s only active for a total of 8 hours (0.3 days) in 30 days. Notice it’s a total, not 1 continuous 8 hours. It’s accumulative within 30 days.

A VM that is idle for 30 days straight, then active for 8 hours, will only need 8 hours to be marked as non idle. A VM that does not accumulate 8 hours, will obviously need more time. The Idle decision logic runs only every 24 hours. So the VM may be marked idle for days after it’s gone active.

The drawback of setting at 99% is we wait the full 30 days before deciding. In some corner case, the VM may never be marked as idle. Take a scenario:

  • A VM was active and serve its purpose for months.
  • After 2 years, the application is being decommissioned as new version released.
  • As a result, the VM goes idle, as it is simply waiting to be deleted. But because we set at 99%, the logic will wait for the full 30 days before deciding.
  • It’s consuming CPU/RAM during the period, as basic services like AV and OS Patches still run. If these non-app workload adds up to >8 hours in 30 days, the VM will never be marked as Idle

Solution: increase threshold from 100 MHz to the amount you think it’s suitable. If possible, power off the VM if it’s really not used.

Powered Off is simpler than Idle, as it’s binary.

A VM that has been powered off for at least 15 days, will take 15 days for it to be marked as Powered On. This creates problem as it’s not a VM you can reclaim.

Solution: add “Is it Powered On now?” into the formula. If a VM is running, it’s no longer considered powered off right away.

This is where the setting is in vR Ops 6.4.

You need to modify the value in your active policy:

  • Change idle from 90% to 99%
  • Change powered off from 90% to 50%

The above is the first of a set of vR Ops dashboard for Capacity Reclamation. I added a short Read Me for 2 reasons:

  • There are 4 dashboards.
    1. The dashboard above
    2. Idle VMs and Powered Off VM. See below.
    3. Active VM: CPU. See this
    4. Active VM: RAM. See this.
  • Reclamation is quite complex when you look at the details. There are many things we can reclaim.

You can replace the Read Me widget with a picture if you know the target screen resolution. I didn’t use image as it will make your import harder.

The above is the 2nd dashboard. It shows the Powered Off VMs and Idle VMs.

The summary at the top tells how much you can reclaim. The table shows where you can claim it.

For the powered off VMs, the widget gives the summary. It tells you how many VMs, and how much space. The table provides details.

The numbers will not be identical due to rounding. The summary is shown in TB while the table in GB. Just in case you’re wondering. 3.7 TB is the correct rounding for 3769.36 GB as there are 1024 GB in 1 TB. 3769/1024 is actually less than 3.7 TB.