vRealize Operations 6.3 Self Monitoring

vRealize Operations 6.3 sports an enhanced self monitoring. This is covered by Michael Ryom on this blog as part of his what’s new in 6.3, so I will start where he left off. So do read his first.

The screenshots in this blog are taken from 6.4 release (see what’s new by Roshan). I do not have 6.3, but I think this self-monitoring feature is the same with 6.3.

vCenter Adapter Details

The vCenter Adapter collects data from vCenter. The dashboard helps answer collection questions from each vCenter, such as:

  • Is there anything wrong in collection? A big drop in the number of objects and metrics can give a clue, especially if you are not removing objects in the associated vCenter.
  • Is collection taking longer than usual?
  • Is collection failing to collect the new objects?

vcenter-adapter-details

The lab has ~300 VMs and 30 ESXi. I added the number of objects and metrics. On average, I get around 160 metrics per object.

As you can see from the above, I have customised it. It is safe to customize, and I do encourage you to do so. Best to follow these 2 rules:

  1. Do not use Admin account. You won’t be able to track what you have changed if you do.
  2. Do not modify the existing object. Clone them, prefix with your company name (e.g. MSFT)

If you want to know where the counters come from, go into edit mode. Notice you cannot edit if you are not using the built-in Admin account. That’s a protection, so you do not accidentally modify OOTB objects.

vcenter-adapter-details-where-it-comes-from

From the above, you can tell the metrics are coming from the vCenter object itself. vSphere World is chosen, as its children are vCenter objects.

Cluster Statistics

The dashboard provides aggregate information at cluster level, so you can see summary before going into each node. There are interesting counters such as Object, Metrics, Alarms and Alerts.

You can click on each scoreboard and the detail line chart will be automatically shown. For example, I clicked on Metric and can see that collection went up on 21 November. If this is not due to new VMs or vSphere infra, then it’s something I’d need to investigate.

self-cluster-statistics

You can also get usual information such as CPU, RAM, Disk and Network. I’ve selected the CPU Usage in the example below.

self-cluster-statistics-cpu

If your vR Ops is slow, you can use the Average IO Transaction Time to tell you if vR Ops is experiencing high disk latency. If the number is much higher here than what you see at the VM level, check if the IO is stuck in the Guest OS.

self-cluster-statistics-latency

We can also see the IOPS. From here we can see there is a daily pattern. There is a daily spike in Writes. The peak hit 4K IOPS sustained over 5 minutes. So the actual IOPS is higher as it is a 300 seconds average. There is also a daily spike in Reads, but at a different time.

daily-iops-patterns

Performance Details

The detail dashboard covers the individual node. The lab only has 1 node, which is what I’d recommend you to deploy. From what I see, a single node with 4 vCPU running the latest Intel Xeon should be able to handle up to 4000 VM. I’m assuming you only use the vSphere Adapters.

performance-node

Can you spot the customisation I made to the dashboard?

Yes, I’ve added extra column. This is how it’s done.

performance-customize

Notice you do not need more than 4 vCPU here.

performance-node-cpu

Can you guess the one time peak around 16 November? Yes, that’s when we upgraded it.

Customising

You might want to customize the dashboard further, or build your own. You may also want to setup new alerts. To do that, you need to know 2 info:

  1. The Relationships among objects, such as the hierarchy.
  2. The metrics and properties for each object.

One way to study is to click on a particular object and see the All Metric page. Below is an example. This one is for the Collector services. You can see the metrics and property you can get from this object.

collector

To create new alert and new symptom, it’s wise to check if existing alert has covered it. For example, here are the symptoms for collector. Notice there are different object type. You need to know that.

symptom

Hope you find it useful. I will expand this post next week once I finish my travel.

vSphere visibility for Network Team

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Similar to the problem face between Storage Team and Platform Team, VMware Admin needs to reach out to Network Team. vRealize Operations can aid in that effort with a set of purpose-built dashboards.

  • Basic Visibility
    • What have we got?
    • Network Team may not have full picture on how Distributed vSwitches, Distributed Port Group, Datacenter, Cluster, ESXi, VM, etc are related.
  • Errors
    • Do we have any errors in our networks? If yes, which VMs and ESXi are affected?
    • Do we have too many special packets? Broadcast, multicast and unknown packets should be kept minimal in the network.
  • Utilization
    • Is any VM or ESXi near its peak?
    • Who are the top consumer for each physical datacenter?
    • How is the workload distributed?
  • Performance
    • When VM Owner thinks Network is the culprit, can both Network Team and Platform verify that quickly?
  • Configuration
    • Are the config consistent? Do they follow best practice?

I’ll show some sample dashboards to get you going.

Overview

The first dashboard provides overall visibility to Network team. It gives insight into the SDDC by showing relevant objects.

  • It shows the total environment at a glance. A Network Admin can see how the virtual network maps to the virtual environment (ESXi, vCenter Datacenter, vCenter).
  • It quickly shows the structure of virtual network. For each Distributed vSwitch, you can see what its port groups are and ESXi are connected to it. You see the config of both objects (port group and ESXi).
  • The heatmap quickly shows all port groups by size, so you can find out your largest ones easily. The color code also lets you see which ones are used the most.

overview

Once you know your environment, you are in a position to do monitoring. The next dashboard shows quickly if there is dropped packet and error packet in your network. The line chart is color coded, and you should expect to see green.

If you do not see green, you can drill down into each Distributed Switch. As a virtual switch can span thousands of ports, it helps if you can drill down by Port Group and ESXi host. The dashboard automatically shows the relevant Port Group and ESXi of the distributed switch.

If there is a need to, you can even drill down to individual VM. The table at the bottom is collapsed by default. Expand it and you’ll see all VMs with dropped packets information.

performance

Other than dropped packets and error packets, Network Admin can also check for multicast and broadcast. The same concept is being applied, hence Network team only need to learn once.

special-packets

Another factor that can impact performance is high utilization. Network team can see the total utilization of the network. Line charts are used instead of a single number, as you can see pattern over time. In fact, 2 line charts are provided: detailed and big picture.

Based on the line charts, you can drill down into a specific time period where the peak was high. The Top-N lists the ESXi with the highest usage. Click on it, and its detailed utilization will be shown. You can see if any of its vmnic is near the physical limit. The super metric takes into account both RX and TX, and you can hit a limit on either.

capacity

If you see an ESXi is saturated, but others are barely used, that means your workload is not well distributed. Note that vSphere 6.0 DRS does not consider network, so unbalance is possible.

You can also see the network utilization per datacenter. You can change the threshold (limitation: same limit for every DC) to suit your need. Click on the datacenter, and all the ESXi and port groups connected to it will be shown. Click on an ESXi, and you can drill down into a VM.

workload-distribution

Another reason for performance is you have VMs that are consuming excessive network resources. Or you have a peak period, where the total is simply much higher than low period. The next dashboard provides 2 line charts. Again, line chart is used as you can see the pattern.

The table provides a list of VM, sorted by their peak utilization. You can find out who are the bursty-users (5 minute highest), and who are the sustained-users (1 hour highest and 1 day highest).

top-consumers

Lastly, it’s all about Service, not System or Architecture. When a VM Owner complains that IaaS is causing the problem, Network Admin and VMware Admin can quickly see the same dashboard to agree whether network is the culprit or not.

vm-troubleshooting

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

SDDC Operations Dashboards for SMB environment

This post continues from the Operationalize Your World post. Do read it first so you get the context.

The SMB segment is a world of its own. There are things that are mandatory in Enterprise segment, but not relevant in SMB segment. As a result, products should be tailored for that market segment.

IMHO, there are actually 4 different market segments when it comes to SDDC Operations. I use No of VM as the marker for each segment. Each of the following segment requires different dashboards and reports.

  1. 100 VM
  2. 1000 VM
  3. 10000 VM
  4. 100000 VM

Now, it will be difficult to create a product with 4 sets of vROps dashboards & reports. I make a compromise on the above, and use this one instead:

  1. 250 VM: SMB market
  2. 2500 VM: Enterprise market
  3. 25000 VM: <give me a name here folks> market

I hope the above is acceptable.

Let’s dive to the 250 VM segment. What are the unique characteristics?

  • 1-2 guys doing everything. No siloes in the team. You and your best friends take care of the whole darn IT.
  • You only have a few clusters. Each cluster only has a few ESXi Host.
  • You know your environment very well because it’s small. They all fit into 1 rack. Architecture is simple. You have a mental picture of it in your head.
  • You don’t buy hardware or VMware every quarter. Likely it’s every 2 years. Capacity planning and monitoring are simple.
  • The workload is quite stable. You are not adding/removing/changing VM every day.
  • Service Tier is an overkill as you only have 1-2 clusters for all workload.

Which of the above points apply to a large environment?

You are right. None.

As a result, we’ve developed a purpose-built dashboard for SMB market. It only has the following

  1. Home
  2. Availability
  3. Performance
  4. Capacity
  5. Reclaimable Capacity
  6. When a VM Owner complains

Home is your main dashboard. It’s the first dashboard you check, likely on a daily basis as part of your cadence. It answers the no 1 question: is everything healthy?

This is what it looks like in vR Ops 6.3. I’ve added explanation so you can easily see that it’s layered into 4 areas.

home

The first element of Health is Availability. If a VM or ESXi is down, there is no need to talk about performance or capacity as the damn thing is dead 🙂 The Availability dashboard gives you details info. You can answer questions such as “When did it go down? For how long?”

availability

Just because something is up, does not mean it’s fast. Performance dashboard provides the info here. The dashboard sports the new concept of Performance, which you can review here. It does not apply the formal SLA, as that’s not applicable in SMB. Even without SLA, you can use it to prove your innocence, or justify new hardware purchase

performance

Generally speaking, Performance problem happens because supply is not being met by demand. The Capacity dashboard gives detail info on the supply side.

capacity

One common issue in virtual environment is VM sprawl. Some of these VMs end up not being used. You can reclaim CPU, RAM and Disk from these VMs.

  • The easiest to reclaim is from orphaned VMs, as they are not even registered in vCenter.
  • The second easiest is snapshot. You should only keep snapshot for 1 day or less.

Once the above is reclaimed, you need to look at Powered Off VMs and Idle VMs

  • CPU and RAM are reclaimed from running VMs, as powered off VMs are no longer consuming the resource.
  • CPU: claim from large VM (e.g. 8 vCPU or more). Avoid reclaiming from 2 vCPU unless you’ve completed the large VMs.
  • RAM: claim from large VM (e.g. 16 GB RAM of more) that has Guest OS metrics. It’s more accurate than hypervisor metric.

The Reclaimable dashboard lists all the VMs that have been idle or powered off. It also lists the orphaned VMs and large snapshots.

reclaimable

Last but not least, your job is actually about making sure the VM is being served well. It’s a service. Your customers don’t care about your infrastructure. So when they complain that their VM has a problem, you need a dashboard that quickly prove if the problem is at your end or their end. TTI is not Time to Investigate, but Time to Innocence 😉

The Troubleshoot a VM dashboard is built exactly for that!

troubleshoot-a-vm

This dashboard is quite long, as it lets you check underlying ESXi and datastore. You can collapse the widget, as shown below, to see more.

troubleshoot-a-vm-2

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.