Tag Archives: vRealize Operations

Monitoring Active Directory with vRealize Operations

This blog is contributed by my friend Luciano Gomes, a VMwareTAM in Rio de Janeiro Area, Brazil. Thank you, Lucky!

In this post, I would like to show how you can monitor Microsoft Active Directory.

First, let’s get the prerequisites out of the way:

  1. vRealize Operations (Enterprise, not Advance. AD is considered as application, not infrastructure)
  2. Endpoint Operations Agent running on each AD machine you are monitoring.
  3. AD Solution for vR Ops (here)

With the above done, download the dashboard I created. Import it into vR Ops.

Once done, follow the steps below to configure the Metric Config XML Files. This is required to drive the widgets, so they show the correct metrics.

The above will take you to the Manage Metric Config screen.

  1. Click ReskndMetric folder to expand
  2. Click Green Plus Sign to create a new file.

You will need to repeat this step 4 times, please, make a note and repeat the name exactly like listed below:

ad-server.xml

<?xml version="1.0" encoding="UTF-8"?>
 <AdapterKinds>
 <AdapterKind adapterKindKey="EP Ops Adapter">
 <ResourceKind resourceKindKey="Active Directory">
 <Metric attrkey="AVAILABILITY|ResourceAvailability" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="THROUGHPUT|DSClientBindsperMinute" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="THROUGHPUT|DSDirectorySearchesperMinute" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 </AdapterKind>
 </AdapterKinds>

ad-ldap.xml

<?xml version="1.0" encoding="UTF-8"?>
 <AdapterKinds>
 <AdapterKind adapterKindKey="EP Ops Adapter">
 <ResourceKind resourceKindKey="Active Directory">
<Metric attrkey="Active Directory LDAP:LDAP|AVAILABILITY|ResourceAvailability" label="" unit="" yellow="" orange="" red=""/>
<Metric attrkey="Active Directory LDAP:LDAP|THROUGHPUT|LDAPSearchesperMinute" label="" unit="" yellow="" orange="" red=""/>
<Metric attrkey="Active Directory LDAP:LDAP|THROUGHPUT|LDAPNewConnectionsperMinute" label="" unit="" yellow="" orange="" red=""/>
</ResourceKind>
</AdapterKind>
</AdapterKinds>

ad-authentication.xml

<?xml version="1.0" encoding="UTF-8"?>
 <AdapterKinds>
 <AdapterKind adapterKindKey="EP Ops Adapter">
 <ResourceKind resourceKindKey="Active Directory">
 <Metric attrkey="Active Directory Authentication:Authentication|AVAILABILITY|ResourceAvailability" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="Active Directory Authentication:Authentication|THROUGHPUT|NTLMAuthenticationsperMinute" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="Active Directory Authentication:Authentication|THROUGHPUT|KerberosAuthenticationsperMinute" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="Active Directory Authentication:Authentication|THROUGHPUT|KDCTGSRequestsperMinute" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 </AdapterKind>
 </AdapterKinds>

VM-OS-AD-metrics.xml

<?xml version="1.0" encoding="UTF-8"?>
 <AdapterKinds>
 <AdapterKind adapterKindKey="EP Ops Adapter">
 <ResourceKind resourceKindKey="Windows">
 <Metric attrkey="FileServer Logical Disk:C:|UTILIZATION|Avg.Disksec/Transfer" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="FileServer Mount:C:\ (local/NTFS)|UTILIZATION|UsePercent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="FileServer Physical Disk:0 C:|UTILIZATION|%DiskTime" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="FileServer Physical Disk:0 C:|UTILIZATION|Avg.DiskQueueLength" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="FileServer Physical Disk:0 C:|UTILIZATION|CurrentDiskQueueLength" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="FileServer Physical Disk:0 C:|UTILIZATION|DiskReadBytes/sec" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="FileServer Physical Disk:0 C:|UTILIZATION|DiskWriteBytes/sec" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="NetworkServer Interface:Network Interface eth10 (ethernet)|THROUGHPUT|BitsReceivedperSecond" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="NetworkServer Interface:Network Interface eth10 (ethernet)|THROUGHPUT|BitsTransmittedperSecond" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="UTILIZATION|CpuUsage" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="UTILIZATION|PercentUsedMemory" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="UTILIZATION|PercentUsedSwap" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 </AdapterKind>
 <AdapterKind adapterKindKey="VMWARE">
 <ResourceKind resourceKindKey="VirtualMachine">
 <Metric attrkey="cpu|capacity_contentionPct" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="cpu|usage_average" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="diskspace|actual.capacity.normalized" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="diskspace|underusedpercent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="mem|host_contentionPct" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="mem|usage_average" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="storage|totalReadLatency_average" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="storage|totalWriteLatency_average" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 </AdapterKind>
 </AdapterKinds>

That’s it!

To use the Dashboard, watch the 45 seconds video below:

PS: If you like the soundtrack, the music is Plain Truth offered for free by Gunnar Olsen. Nice right! 🙂

Hope you find it useful. Do reach out via Linkedin and Twitter. Thanks for reading!

Which VMs need more resources?

You can reduce the following resources from a VM:

  • CPU
  • RAM
  • Storage

Network isn’t something you can reduce, but you know that already 🙂

You can check which VMs need more resources by building a dashboard like the one below. It’s a simple dashboard, which you can customize and enhance. It lets you reduce the resources independently.

I’ve marked the above dashboard with numbers, so we can refer to them:

  1. This is a table that lists all VMs. It’s sorted by the highest 1-hour average of CPU Demand and RAM Demand. The table also lists the VM CPU and RAM configuration, so you can see if the VMs are small or large. It also shows the cluster the VMs are located. The table is sorted by the highest CPU Demand. I’m showing both CPU and RAM in a single table. You can clone the view and split them if that suits your operations better.
  2. This is a table that lists all VMs, but focusing on storage only. With storage, we do not have the complexity of checking peak utilisation. We simply need to check the present situation.
  3. This lists the Top-15 VMs with highest CPU Demand and RAM Demand in a given period. The list is now split, as they can be different VMs. Do not that Top-N widget will average the number over the selected period. A VM with cyclical workload may not show up. The Top-N is complemented with a distribution chart. Select a VM from the Top-N, and you can see where the VM utilisation is.
  4. The distribution chart helps you see if the VM is really under resources or not. The 95th percentile is marked with a vertical green line. You expect that line to be at 100%, indicating that the VMs hit 100% utilisation frequently. If the 95th percentile is at a low number, and you do not see the number 100 in the x-axis, that means the VM is not under resourced.
  5. Storage is easier, as we can simply use the last data. As a result, we can show a distribution of all the VMs. We use a heat map as it can show 2 dimensions. Every VM is represented as a box. The bigger the box, the more storage the VM is configured with. The color indicates if the VM use it.
    • 0% = Black. Wastage
    • 10% = Green. Balanced usage
    • 100% = Red. Need more space!

The CPU and RAM have limitations. For example, they may show high utilisation during AV backup. You want to ignore those period. At this moment, the only way is to plot the high usage over a line chart. We use Log Insight for this. The chart below shows VMs that hit high CPU usage in a given period. Every time a VM hits high CPU usage, it will show up here. As you can see, there are only 4 VMs that hit high CPU usage. All other VMs do not need more CPU.

The above is an example from a healthy environment. What about an environment where a lot of VMs are under-sized? You expect to see lots of alarm! That’s what you have below

Hope the above is useful. If not, drop me an email.

Do I need to upsize my vRealize Operations?

One common mistake I see in the field is oversized vRealize Operations. I guess the thinking is bigger is better is hard to let go. It does not help that the official sizing guide is conservative. It is conservative for a good reason. There is a wide permutation of vR Ops deployment.

So if your deployment is a simple one, with no management pack and End Point Operations, there is a good chance that you are better off with smaller deployment. So how do you check?

I’ll use an actual example and run through my thought process. The example below is from real production environment, not a lab. The environment is a mid-size, around 3000 VM on 300 ESXi hosts.

The environment is heavy on vCenter folders, vR Ops custom groups, super metrics and alerts. It also has integration with ticketing system. The result is 6000 objects and 10 million metrics. The actual collection is 5500 objects.

To see the above break down, go to Cluster Management screen, as shown below. What can you tell from it?

The vR Ops has 5 nodes. It’s clustered and well balanced. Each node handles around 1100 objects. It’s also using Remote Collector to offload the 5-minute processing. As you’ll see later, that strategy pays off well.

You can see the breakdown of Objects being collected. The number 211 was made of 205 + 6.

Now that we know what the deployment is, we can see each node. From the screen below, you can see again that Remote Collector has a subset of the full node. It only has 4 main modules (Collector, Suite API, Watchdog and Admin UI). There is no Persistence and Databases there.

From the above, we can see the full metrics and property of each node. We can also drill down into each modules.

Remember the 5500 objects being collected? Let’s see the history. I’m plotting since Day 1. This is a new vR Ops, so it only goes back to 1 March.

Notice it starts from 0, as that’s when we deployed it. It was a phased deployment. We registered more vCenter, so the number of VMs, objects and VM went up. The CPU Usage didn’t jump accordingly, indicating it has more than enough CPU to handle the extra load. Another word, the additional load was too small to make a difference.

Since the 5 nodes are well balance, let’s take 1 of them, so we can dive deeper. I added Guest OS RAM this time around.

We see the similar jump in objects and metrics. That’s expected by now. The impact on CPU was also minimal.

The spike you see in CPU is actually a daily chart. We will show that later on that it happens at midnight. The daily spike eventually became higher. I’m not sure exactly, but it’s a daily calculation (e.g. capacity or DT). It’s not super metric or groups, as these are calculated every 5 minutes.

The additional load was actually decent. It was in fact 2x load, as you can see below. I used a more detailed chart, and you can see here the sharp jump as we added a few vCenter. The vCenter in turn brings all the objects.

The sharp jump makes a tiny difference in CPU Usage. From the pattern below, you won’t believe that there was 2x load. To me, the extra load was absorbed by Remote Collector.

The RAM pattern was puzzling. I don’t know why. BTW, this counter is from Guest OS, not from VM level. I do expect memory to be fully used, as it’s just a form of cache. I just don’t know why Free RAM went higher ahead of the addition.

Let’s look at Network. The pattern match CPU. The absolute number was low though. 10K KBps = 10 MBps = 80 Mbps.

Let’s look at Storage. The pattern match CPU. Read is higher, because at night vR Ops does its capacity and DT, and that means it’s reading a lot of data. The absolute number was low though. 1000 IOPS for 1100 objects means 1 object = 1 IOPS.

I said earlier we would dive into the CPU. Here is a 7-day chart. You can see there is daily peak at midnight. But what about the 2nd peak, the one I marked with “?

To answer that, we have to zoom into that period. Here is what it looks like. Turned out, there was a problem. Notice there was no collection. So when we rectified the problem, vR Ops has to catch up.

From the chart, we can also see that the daily calculation does not last >15 minutes. The burst was short.

Hope that helps you in right-sizing your vR Ops.