Monthly Archives: October 2016

SDDC Operations Dashboards for SMB environment

This post continues from the Operationalize Your World post. Do read it first so you get the context.

The SMB segment is a world of its own. There are things that are mandatory in Enterprise segment, but not relevant in SMB segment. As a result, products should be tailored for that market segment.

IMHO, there are actually 4 different market segments when it comes to SDDC Operations. I use No of VM as the marker for each segment. Each of the following segment requires different dashboards and reports:

  1. 100 VM
  2. 1000 VM
  3. 10000 VM
  4. 100000 VM

Now, it will be difficult to create a product with 4 sets of vROps dashboards & reports. I make a compromise on the above, and use this one instead:

  1. 400 VM: SMB market
  2. 4000 VM: Enterprise market
  3. 40000 VM: <give me a name here folks> market

I hope the above is acceptable. As the above has very wide range, I’d take the following reference point

  1. SMB market: 250 VM
  2. Enterprise market: 2500 VM
  3. Huge Private Cloud: 25000 VM

Let’s dive to the 250 VM segment. What are the unique characteristics?

  • 1-2 guys doing everything. No siloes in the team. You and your best friends take care of the whole darn IT.
  • You only have a few clusters. Each cluster only has a few ESXi Host.
  • You know your environment very well because it’s small. They all fit into 1 rack. Architecture is simple. You have a mental picture of it in your head.
  • You don’t buy hardware or VMware every quarter. Likely it’s every 2 years. Capacity planning and monitoring are simple.
  • The workload is quite stable. You are not adding/removing/changing VM every day.
  • Service Tier is an overkill as you only have 1-2 clusters for all workload.

Which of the above points apply to a large environment?

You are right. None.

As a result, SMB needs a purpose-built dashboards. It covers the following:

  1. Availability
  2. Performance
  3. Capacity
  4. Reclaimable Capacity
  5. Compliance
  6. When a VM Owner complains

Home

Your main dashboard. It’s the first dashboard you check, likely on a daily basis as part of your cadence. It answers the no 1 question: is everything healthy?

This is what it looks like in vR Ops 6.3. I’ve added explanation so you can easily see that it’s layered into 4 areas.

home

Availability

The first element of Health is Availability. If a VM or ESXi is down, there is no need to talk about performance or capacity as the damn thing is dead 🙂

The Availability dashboard gives you details info. You can answer questions such as “When did it go down? For how long?”

availability

The dashboard is also useful when you need to report uptime. You do need to create a report and customize it though. If you need it, email me your requirement.

Performance

Just because something is up, does not mean it’s fast. Performance dashboard provides the info here. The dashboard sports the new concept of Performance, which you can review here. It does not apply the formal SLA, as that’s not applicable in SMB. Even without SLA, you can use it to prove your innocence, or justify new hardware purchase.

Line Charts are used as performance problem might have started earlier, or it’s no longer happening and you’re doing a root cause.

If the performance issue is caused by villain VM, the dashboard lets you find the VM. Change the time line in the Top-N widget to the time where there is performance problem.

BTW, if you like the ability to find out which VM was causing the problem, send your thank you to Matthew Hurley

Capacity

Generally speaking, Performance problem happens because supply is not being met by demand. The Capacity dashboard gives detail info on the supply side. As there are only a few clusters, capacity management is much simpler.

capacity

Notice it takes into account performance.

If you mix Prod and Non Prod, capacity management becomes harder. Since the hardware is shared, we need to monitor at the overall cluster level. Since the Production VMs have a more stringent SLA, naturally their number reflects that. As a result, we need to show Prod and Non Prod differently. Let me know if you need it, as to me that complicates operations. This is another reason why I advocate separate cluster for Prod and Non Prod.

One common issue in virtual environment is VM sprawl. Some of these VMs end up not being used. You can reclaim CPU, RAM and Disk from these VMs.

  • The easiest to reclaim is from orphaned VMs, as they are not even registered in vCenter.
  • The second easiest is snapshot. You should only keep snapshot for 1 day or less.

Once the above is reclaimed, you need to look at Powered Off VMs and Idle VMs

  • CPU and RAM are reclaimed from running VMs, as powered off VMs are no longer consuming the resource.
  • CPU: claim from large VM (e.g. 8 vCPU or more). Avoid reclaiming from 2 vCPU unless you’ve completed the large VMs.
  • RAM: claim from large VM (e.g. 16 GB RAM of more) that has Guest OS metrics. It’s more accurate than hypervisor metric.

The Reclaimable dashboard lists all the VMs that have been idle or powered off. It also lists the orphaned VMs and large snapshots.

reclaimable

Configuration

If you configure vSphere hardening guide, and your Infra and VMs comply to it, you will see all green in the dashboard below. If not, you can see exactly which VM or infra is not complying. You can customize the default threshold, although it’s better than you customize the symptoms & alert instead.

You can see compliance for Network and vCenter too, under the vSphere Compliance widget. There is a drop-down there that is not shown.

IaaS

Last but not least, your job is actually about making sure the VM is being served well. It’s a service. Your customers don’t care about your infrastructure. So when they complain that their VM has a problem, you need a dashboard that quickly prove if the problem is at your end or their end. TTI is not Time to Investigate, but Time to Innocence 😉

The Troubleshoot a VM dashboard is built exactly for that!

troubleshoot-a-vm

This dashboard is quite long, as it lets you check underlying ESXi and datastore. You can collapse the widget, as shown below, to see more.

troubleshoot-a-vm-2

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

Tenant Self-Service Dashboards

This post continues from the Operationalize Your World post. Do read it first so you get the context.

A common request among VMware Admin is to give their customers a self service access to their own VMs. The VM Owners should be given a simple portal, where they can easily see all their VMs and its performance. The solution in this blog is inspired by the work done in this video and this paper. We’ve reduced the visibility and supply a custom dashboard with super metric.

Naturally, VM Owners do not have access to vSphere, as that’s to deep into the kitchen. We are also not assuming that you have vRealize Automation or vCloud Director in-place. So this is just using vR Ops.

Requirements:

  • Tenant can only see her own VMs.
  • Tenant cannot see the underlying infrastructure. It is both irrelevant and not something you’re comfortable disclosing.

This is what the dashboard looks like. It has a simple ReadMe to guide tenant.

dashboard

We’ve added visibility into how the IaaS is serving the VM. Provide that transparency to your customer, and you have a major advantage over the public cloud.

The tenant has very limited access to vR Ops. The following 3 screenshots show what have been removed.

dashboard-2

dashboard-3

vm-drilldown

Implementation

As you can guess, there are 2 parts to the implementation:

  • One-time setup.
    • A general set up you do that is applicable to all tenants
    • You develop your dashboards here.
  • Per Tenant work.
    • A tenant-specific setup that you need to do for each tenant.
    • Things like creating an account for that tenant belongs here.

One-Time Setup

Create a role called Tenants. Purpose is to limit what features it can access

1

The tenant needs only 2 access, as shown below:

2

Create a group called Tenants. Purpose is to limit what objects it can see. To recap, the Roles limits what features can be seen, while Group limits the objects.

3

Click the Objects, then select the Tenants role (which you created earlier). Do not provide any more access. So none of the object hierarchy is selected.

4

Create a group type. Call it Tenants. Each tenant will have 1 group and 1 group only.

5

Download the files. Import the dashboard and super metrics.

6

Create the Text Widget file and Resource Kind file. See the screenshot below as guide. The name has to be identical, and it is Case Sensitive.

[2 Nov 2016: Thank you Patrick Nganga for spotting that I miss 2 files. There are 4 that you need]

7-files

That’s all you need as the base. All the work below is now per tenant. So if you have 10 tenants, you need to repeat 10x. I know…

Per-Tenant Setup

Create a group that contains all the VMs of a single Tenant. Best is to use the Tenant Name as the group name. If you organize the VMs properly in vCenter, by using vSphere Tags or Folders, you can take advantage of that. The example below is using vSphere Folder.

do-not-hardcode-the-vm-in-the-group

Once created, the group will appear under the Tenants group type. I’ve created 2 examples. Ensure the no of VM matches what it should be.

1-group-for-each-tenant-2

Create an account for each tenant. Give is full Administrator access. Just for temporary.

Login using this newly created account. I’d use another browser, and I do not want to logout from my administrator account. Go to Dashboard, select all dashboards except the one you want to show, and remove them from Home. See how it’s done below. Once done, the Visible on Home will show it’s not to be shown.

hide-all-the-other-dashboards

Log out the tenant account. Or simply close the browser.

Switch back to your administrator account. Remove the tenant administrator privilege, and map it to the Tenants role, as shown below.

Map the account to the associated group, and only to this group. This limits the visibility. Yes, this is how the “security” is done. I’m not sure if this is honoured by the API, but you can block the tenant ID from accessing via API.

select-the-right-group-for-each-tenant

Limitation

  • Tenant can only have 1 group. The Total is based on super metric that adds per group. It cannot add multiple groups as it does not know which groups to select.
  • Alerts are not implemented yet.
  • Tenant cannot change the alerts. For example, they cannot change their own threshold.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

vSphere visibility for Storage Team

This post is part of Operationalize Your World post. Do read it first to get the context.

Ask any Storage Team and Platform Team whether the collaboration between them can be improved by a mile, and you are likely to get a nod. One reason for this issue is there is lack of common visibility. You need to see the same thing if you want to collaborate. Storage Team do not get always get access vSphere vSphere. Even if they do, vCenter UI is not designed for Storage team. It is designed for VMware Admin.

vRealize Operations and Log Insight can bridge that providing a set of read-only, purpose-built dashboards, that answer common questions such as:

  • When a VM Owner complains, can we clear if it’s a storage issue within 1 minute?
    • No ping pong between VM Owner, vSphere Admin, Storage Admin
  • Is the Storage serving all the VMs well?
    • If not, who are affected, when and how bad? Read or Write?
    • The answer has to be tier based, as Tier 1 VM expects lower latency than Tier 3
  • What’s the total demand hitting the array? Are they growing fast?
    • Who are the heavy hitters among the VMs?
  • When & where are we running out of capacity?
    • How much disk space can be reclaimed? From which VMs?
  • What have we got?
    • Are they consistently configured?

The questions above cover the main areas of SDDC Operations, such as performance, capacity, configuration and availability. They enable joint troubleshooting, capacity planning, performance monitoring. For better collaboration, add Blue Medora TVS, so you can analyze physical arrays and fabrics, and then correlate back with vSphere.

Overview

The first dashboard provides overall visibility to Storage team. It gives insight into the SDDC by showing relevant objects.

  • It quickly show the summary of key information.
  • It shows VM, datastore, datastore clusters, compute cluster, and datacenter. It shows their relationship, which you can interact and drill down.
  • It shows all the VMs, where they are located, how much space they are allocated, and how they are using it.

overview

Limitations & Customisation:

  • No RDM. Customers are moving away, and not many are using it to begin with.
  • Physical array Availability. This requires Storage MP or Blue Medora MP

Performance

The purpose of the vSphere platform is running VM. So long it is providing good service to the VM, we don’t have to explain the thing underneath. Whether 10,000 IOPS at VM level translates into 8,000 at hypervisor level due to caching at the host, is not as important as the VMs are being served well (as defined by SLA).

So we need to find out what’s the latency of every single VM in the vCenter. This is near impossible to establish in vCenter, as you have to go thru a lot of VMs. vR Ops helps using super metric.

Overall Performance

The set of dashboards answer questions such as:

  • What’s the overall performance, for each cluster and datastore? No point troubleshooting a VM or ESXi if the overall array is heavily hit.
  • What’s the total demand hitting the storage system? Who are the heavy hitters?
  • When a cluster is not performing, do we know when and which VMs were affected? Looking at cluster is useful as that’s where the demand comes from.

performance-overall

Total Latency is not Read + Write latency. In IP Storage, it is not Tx and Rx. It is both Tx as the ESXi host is sending the packets.

Datastore Performance

As this is for Storage Team, we can drill down to a specific datastore. It provides detail line charts of the datastore latency, throughput, outstanding IO and IOPS.

It also shows the VMs in the datastore, and if any of them is generating a lot of IOPS (villain VM).

performance-troubleshooting

Heavy Hitters

The performance problem could be caused by high overall loads. The dashboard shows you the total IOPS and total Throughput. If the number is high, you can drill down to see if there are Heavy Hitters.

What is Heavy Hitters? It has to be defined.

  1. What: IOPS or Throughput?
    • A VM with large block size can generate high throughput without doing excessive IOPS.
  2. How long: 5 minutes, 1 day, 1 week?
    • When we say heavy, how long before it becomes heavy?
    • The heavy hitters dashboard distinguishes between Bursty Hitter and Sustained Hitter.
  3. When: Today or yesterday? Any pattern?
    • The heavy hitters of yesterday may not be the heavy hitters of today.

It uses line chart so you can see pattern.

heavy-hitters

Limitation and Customisation:

  • Storage Team normally want sharper granularity than 5 minute. You can complement the above view VMkernel. I’ve shared posted here, so follow that link.
  • VM-level IOPS are different to Infra-level IOPS
    • Front-End IOPS are different to Back-End IOPS.
    • Distributed Storage has its own set back-end IOPS, such as synchronisation and replication.
  • We should know the cause of the poor storage performance.
    • The array?
      • Is it near its peak? Low cache, not enough spindle, etc.
      • Is it uneven (one SP doing the bulk of the work)? SP trespass?
      • Does it have hot spots?
    • The network or fabric?
      • Higher chance if you’re on IP Storage.
      • Happens if you don’t turn on Network QoS. There could be storage vMotion happening.
    • The ESX?
      • This could be bad configuration, like wrong multipath or insufficient HBA queue depth.
    • The VM?
      • Why does it generate high IOPS. Which process inside the Guest OS doing that?
      • A snapshot not removed?
      • A developer running IOmeter?

Capacity

The set of dashboards answer questions such as:

  • What’s the overall capacity? How is it used? Where are we on over-subscription?
  • Any datastore is running low on capacity? Are we using the datastores equally?
  • Are the VMs equally distributed among the datastores?
  • For each shared datastores, what are the capacity?

capacity-overall

The heat maps are useful in comparing across datastores:

  • Size by capacity
    • This shows if datastores are consistent in sizes. You should not have too many variance in size.
  • Color by capacity remaining.
    • Red = running out of capacity. Do not deploy more VM
    • Black = wastage. You’re not using it.
    • You should expect green.
  • Group by vCenter, then Datastore Cluster.
    • You know quickly where they are.

Combining the above 3 info into a single heat map tells you if you are using your storage properly.

You should use Datastore Cluster as part of your design. If you do, you also get visibility into their capacity, as shown below:

capacity-overall-datastore-cluster

You can drill down into each datastore and determine their capacity.

capacity-details-datastore

Limitation and Customisation:

  • Use datastore cluster. This should be part of your architecture anyway.
    • A limitation here is distributed storage, as it has no datastore clusters. You should exclude them and deal with them separately as the monitoring metrics are different.
  • If you have too many datastore clusters, then use Service Tier to group them

Capacity: Space Reclamation

There are 4 places where you can reclaim storage, from easiest to the hardest

  1. Orphaned VMs and orphaned VMDK. They are the easiest as they are not even owned.
  2. Snapshot.
  3. Powered Off VM
  4. Idle VM

capacity-reclamation

The dashboard does not list active VM because politically it’s hard to reclaim. Don’t bother trying 🙂

Storage as a Service

When a VM owner complain, can we rule out within 1 minute whether Storage is the issue?

Using the following dashboard, you select or browse for the VM in question. Its key storage properties and KPI will be automatically shown. We are using line chart as the problem might happen in the past and no longer present. You can also verify if it’s one off issue or regular issue.

The VM’s datastore will be shown automatically. The VM in the screenshot has its VMDK files in 3 different datastores. You can click on each, and the performance will be automatically shown. This lets you verify if the underlying datastore was able to cope or not.

performance-single-vm-troubleshooting

Limitation and Customisation:

  • The dashboard does not show Throughput. Throughput matters more on large block size. 4 – 32K block size should not be problem when IOPS is low.
  • The dashboard does not show Outstanding IO. This is useful to tell if underlying infra unable to process.
  • Add snapshot. Latency for snapshot will be higher as it has to go through multiple operations.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.