Tag Archives: VMware

Monitoring NSX Edge, SSL VPN, Firewall and Logical Switch

This blog is contributed by my friend Luciano Gomes, a VMware PSO Senior Consultant in Rio de Janeiro Area, Brazil. Thank you, Lucky!

In this post, I would like to show you how you can monitor NSX Edge, SSL VPN, Firewall and Logical Switch using only one dashboard.

First, let’s get the prerequisites out of the way:

  1. vRealize Operations (Advanced/Enterprise License)
  2. vCenter + NSX
  3. vR Ops Management Pack for NSX

My friend Romain Decker has covered the installation of the Management Pack. Read it here first.

Another friend (life is good when you have many experts as friends!), Lan Nguyen, has documented how to import the dashboard here.

With the above done, go download the Dashboard to be imported here

One done, follow the steps below to configure the Metric Config XML Files.

The above will take you to the Manage Metric Config screen.

  1. Click ReskndMetric folder to expand
  2. Click Green Plus Sign to create a new file.

Give the name exactly below:

Copy and paste this XML below:

<?xml version="1.0" encoding="UTF-8"?>

<AdapterKinds>
 <AdapterKind adapterKindKey="NSX">
 <ResourceKind resourceKindKey="SSLVPNEdgeService">
 <Metric attrkey="clients|clients_active" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|auth_failures" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|utilization" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="clients|workload" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|service_status" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 <ResourceKind resourceKindKey="FirewallEdgeService">
 <Metric attrkey="rule|used" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|service_status" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 <ResourceKind resourceKindKey="EdgeServicesGateway">
 <Metric attrkey="cpu|used_percent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="disk|used_percent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|connection_health" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="interface:Uplink|connected" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="net|usage_average" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="net|used_percent" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="net|maxObserved_KBps" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|attached_vms" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|running" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="status|status" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="mem|used_percent" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 <ResourceKind resourceKindKey="LogicalSwitch">
 <Metric attrkey="port|max" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_packet_pct" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|broadcast_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|multicast_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|used" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|utilization" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|maxobserved_util" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|maxobserved_rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|maxobserved_tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|unicast_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|rx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|multicast_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|dropped_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|broadcast_rx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|unicast_tx_packets" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="port|tx_traffic" label="" unit="" yellow="" orange="" red=""/>
 <Metric attrkey="summary|attached_vms" label="" unit="" yellow="" orange="" red=""/>
 </ResourceKind>
 </AdapterKind>
 </AdapterKinds>

That’s it!

To use the Dashboard, see the image below:

Hope you find it useful. Do reach out via Linkedin and Twitter. Thanks for reading!

NOC Dashboards for SDDC – Part 2

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Dashboard: Performance

Is my IaaS performing?

That’s the key question that you need to answer. You need to show if the clusters are coping well. Show how the clusters are performing in terms of CPU, RAM and Disk.

The above dashboard is per Service Tier. Do you know why?

Yes, the threshold differs for each tier. What is acceptable in Tier 3 may not be acceptable in Tier 1.

The good thing about line chart is it provides visibility beyond present time. You can show the last 6 hours and still get good details. Showing >24 hours will result in visualization that is too static, not suitable for NOC use case.

Limitation & customisation:

  • You need 1 widget per Service Tier.
  • If you only have a few clusters, you can show multiple Service Tiers in 1 dashboard. 1 row per tier results in simpler visualisation.
  • In environment with >10 clusters, we can group them into Service Tier. Focus more on the highest tier (Tier 1).
  • In environment with >100 clusters, we need another grouping in between. Group the Tier 1 clusters into physical location.

When a cluster is unable to cope, is it because it’s having high utilization? I show CPU, RAM and Disk here. You can add Network as you know the physical limit of ESXi vmnic.

Disk is tricky for 2 reasons:

  • It is not in %. There is no 100% for IOPS or Throughput. The good thing is when you architected the array or vSAN, you did have certain IOPS number in mind right? 😉 Well, now you can get the storage vendor to prove it that it does indeed perform as per the PowerPoint 😉 If not, you get free hardware if they promise a fast performance that will meet business requirement.
  • You need to show both IOPS and Throughput. While they are related, you can have low IOPS and high throughput, and vice versa.

If the cluster utilization is high, the next dashboard drills into each ESXi.

We can also see if there are unbalanced. In theory, they should not, if you set DRS to Fully Automated, and pick at least the middle level sensitivity (3). DRS in vSphere 6.5 considers Network also, so you can have unbalanced CPU/RAM.

With the dashboard above, we can tell if ESXi CPU Utilisation is healthy or not.

  • Low value does not mean VM performs fast. A VM is concerned with VM Contention metric, not ESXi Utilization. Low value means we over invest. It is not healthy as we waste money and power.
  • High value means we risk performance (read: contention)

For ESXi, go with higher core count. You save license if you can reduce socket.

We can also tell if ESXi RAM Utilisation is healthy or not.

  • Customers tend to overspend on RAM and underspend on CPU. The reason is this.
  • For RAM, we have 2 metrics:
    • Active RAM
    • Mapped RAM
  • The value you want is somewhere in between Active and RAM.

In the dashboard, the 3 widgets have different range. The range I set is 30 – 90, 50 – 100 and 10 – 90.

Why not 0 – 100?

It is not 100% because you want to cater for HA. Your ESXi should not hit 100% as if you have HA, it would be beyond 100, meaning performance will be badly affected.

If the cluster or ESXi utilization is high, is it because there are VMs generating excessive workloads?

The dashboard above answers if we have VMs that dominate the shared environment.

  • CPU: show a heat map showing all VMs, sized by CPU Demand in GHz (not %), color by contention
  • RAM: show a heat map showing all VMs, sized by Active RAM, color by contention
  • Storage: show a heat map showing all VMs, sized by IOPS color by latency.

At a glance, we can tell the workload distribution among the VMs. We can also tell if they being served well or not.

Limitation & customisation:

  • You need 1 widget per Service Tier.
  • You can change the threshold anytime. If you want a brand new storage from Finance, set the max to 1 ms 😉
  • In larger environment, group your heatmap (e.g. by cluster, host, folder).
  • We can show individual VM, but we can’t show the history as there are too much data to show.
  • This needs to be done per Tier. 1 dashboard per Tier, as the threshold varies per tier.

Hope you find it useful. For the product-specific implementation, review this blog. To prevent vROps session from timing out, implement this trick by Sunny.

Tenant Self-Service Dashboards

This post continues from the Operationalize Your World post. Do read it first so you get the context.

A common request among VMware Admin is to give their customers a self service access to their own VMs. The VM Owners should be given a simple portal, where they can easily see all their VMs and its performance. The solution in this blog is inspired by the work done in this video and this paper. We’ve reduced the visibility and supply a custom dashboard with super metric.

Naturally, VM Owners do not have access to vSphere, as that’s to deep into the kitchen. We are also not assuming that you have vRealize Automation or vCloud Director in-place. So this is just using vR Ops.

Requirements:

  • Tenant can only see her own VMs.
  • Tenant cannot see the underlying infrastructure. It is both irrelevant and not something you’re comfortable disclosing.

This is what the dashboard looks like. It has a simple ReadMe to guide tenant.

dashboard

We’ve added visibility into how the IaaS is serving the VM. Provide that transparency to your customer, and you have a major advantage over the public cloud.

The tenant has very limited access to vR Ops. The following 3 screenshots show what have been removed.

dashboard-2

dashboard-3

vm-drilldown

Implementation

As you can guess, there are 2 parts to the implementation:

  • One-time setup.
    • A general set up you do that is applicable to all tenants
    • You develop your dashboards here.
  • Per Tenant work.
    • A tenant-specific setup that you need to do for each tenant.
    • Things like creating an account for that tenant belongs here.

One-Time Setup

Create a role called Tenants. Purpose is to limit what features it can access

1

The tenant needs only 2 access, as shown below:

2

Create a group called Tenants. Purpose is to limit what objects it can see. To recap, the Roles limits what features can be seen, while Group limits the objects.

3

Click the Objects, then select the Tenants role (which you created earlier). Do not provide any more access. So none of the object hierarchy is selected.

4

Create a group type. Call it Tenants. Each tenant will have 1 group and 1 group only.

5

Download the files. Import the dashboard and super metrics.

6

Create the Text Widget file and Resource Kind file. See the screenshot below as guide. The name has to be identical, and it is Case Sensitive.

[2 Nov 2016: Thank you Patrick Nganga for spotting that I miss 2 files. There are 4 that you need]

7-files

That’s all you need as the base. All the work below is now per tenant. So if you have 10 tenants, you need to repeat 10x. I know…

Per-Tenant Setup

Create a group that contains all the VMs of a single Tenant. Best is to use the Tenant Name as the group name. If you organize the VMs properly in vCenter, by using vSphere Tags or Folders, you can take advantage of that. The example below is using vSphere Folder.

do-not-hardcode-the-vm-in-the-group

Once created, the group will appear under the Tenants group type. I’ve created 2 examples. Ensure the no of VM matches what it should be.

1-group-for-each-tenant-2

Create an account for each tenant. Give is full Administrator access. Just for temporary.

Login using this newly created account. I’d use another browser, and I do not want to logout from my administrator account. Go to Dashboard, select all dashboards except the one you want to show, and remove them from Home. See how it’s done below. Once done, the Visible on Home will show it’s not to be shown.

hide-all-the-other-dashboards

Log out the tenant account. Or simply close the browser.

Switch back to your administrator account. Remove the tenant administrator privilege, and map it to the Tenants role, as shown below.

Map the account to the associated group, and only to this group. This limits the visibility. Yes, this is how the “security” is done. I’m not sure if this is honoured by the API, but you can block the tenant ID from accessing via API.

select-the-right-group-for-each-tenant

Limitation

  • Tenant can only have 1 group. The Total is based on super metric that adds per group. It cannot add multiple groups as it does not know which groups to select.
  • Alerts are not implemented yet.
  • Tenant cannot change the alerts. For example, they cannot change their own threshold.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.