Tag Archives: SDDC

What’s new with vRealize Operations 6.4

Apology that this is coming late. Please read the following first, as I will start from where they end:

I’d focus on the new dashboards, since I was the one designing them. If the dashboards are not meeting your requirements, you now know exactly where to complain 😉 The dashboards were reviewed extensively by Product Managers (Monica Sharma, Ronit Halachmi Bekel) and Sunny Dua.

The dashboards in 6.4 is a subset of the dashboards in Operationalize Your World program. Around 20% made it. They are also simplified. The reason is we wanted the dashboards to pass The 5-second Test, and be applicable to SMB segment. We also wanted to have more feedback from real life environment, before bringing additional & more advanced dashboards. So do let me know at e1@vmware.com.

The following screenshot shows the 2 sets when you are running 6.4. You end up with both sets. They can co-exist, and the cost is some metrics are duplicated. As part of porting the dashboards to 6.4, we converted the super metrics into regular metrics so it’s simpler for you.

2016-12-10

Do we remove the old dashboards in 6.3? Nope, we did not. We simply moved them. Can you guess where they are on the screenshot above?

Yes, we moved them under “Other” folder. In future, we might deprecate them as we enhance the UI and dashboards.

You will notice we have grouped the dashboards into Infrastructure and VM. This is in-line with the Dining Area and Kitchen shared in Operationalize Your World. We wanted to drive your attention that you should ensure the VMs are served well, before you look at the kitchen.

We’ve placed some dashboards outside the folder for your convenience. We’ve also created a Read Me dashboard. We call it Getting Started. It explains the new dashboards.

getting-started

Technically, the dashboard has only 1 Text Widget. The adventurous among you will ask me if you can clone and tailor it for your company. The answer is yes. No, it is not supported. All the texts and images are in this directory:

ContentPack/VMWARE/conf/pages/getting_started/index.html?locale=%locale%

Operations Overview dashboard

We designed this dashboard to answer a few frequently asked questions on your day to day operations.

  • What have we got? If this number change, or not what you expect, you want to probe why.
  • What’s the Health of my environment? Environment can vary in size, so we group them by vSphere Data Center. The dashboard lists all your data centers. Select one, and you can see its uptime and alerts. You expect the uptime to be 100% and the Alerts to be below your normal operations.
  • Just because your vSphere is healthy does not mean the VMs are being served well. This is where the Top-N comes in. As this dashboard is your daily dashboard, you should expect the number to be within your expectation.

ops-overview

Cluster Performance dashboard

If the VMs are not being served well, you need to investigate why. This is where the following dashboard comes into play. It lets you see which clusters are not performing well. The heatmap shows the cluster by alert. Start with the reddest cluster.

Select the cluster you want to probe. Its performance counters will be automatically shown. You can see if it’s serving its VMs well. We are using line chart, so you can see the past and check if there is any strange spike.

cluster-performance

Heavy Hitter VMs dashboard

One possible cause of performance is you have Villain VM. Your vSphere environment is a shared environment. It can take as little as 1-2 VM to create performance problem in a cluster with 500 VMs.

This dashboard answers if there is any abnormal spike generated by any VM. It tracks both Storage and Network. From the following example, we can easily see there are both excessive storage and very high network throughput. They happened on different time and were caused by different VMs. The dashboard quickly shows the 2 villain VMs. You can see their workload is >10x to the second highest VM.

heavy-hitter-vms

In the Operationalize Your World, we enhance this dashboard by adding details, and split it into 2 (Storage and Network). This is to facilitate collaboration with your peers.

Datastore Performance dashboard

Cluster covers compute. What about storage? The Datastore Performance dashboard lets you see the performance of all your datastores. You can use the view to select a datastore. Its performance charts will be automatically shown. From the line chart, you could see if the datastore was having difficulty serving its VMs.

You can drill down to see each VM in the datastore. Select a VM, and its IOPS and Latency will be automatically plotted.

datastore-performance

VM Performance Troubleshooting dashboard

There are 2 spectrum of performance problem:

  • Whole house on fire
  • A small number of VMs were hit

The first few dashboards help you answer the first use case. This dashboard helps you look at a single VM. It’s a big dashboard, so I’ve added visual sections. It has 3 sections.

  • Section 1
    • This is where you select the VM. You can search, filter or simply browse.
    • The selected VM key properties, alerts and how it fits into the larger environment are automatically shown. If a VM is part of Resource Pool, check if the Resource Pool is limiting it.
  • Section 2
    • We display both the VM KPI and the IaaS KPI.
    • IaaS KPI is the 4 key metrics that shows how the IaaS serves this VM. If this is high, there is a good chance your IaaS capacity is full. It is struggle to serve all its VMs.
  • Section 3
    • You are verifying if the underlying IaaS was able to serve the VM.
    • Can you guess why we show Cluster instead of ESXi?
    • Hint: the performance problem may happened in the past.

vm-troubleshoot

In the Operationalize Your World, we expanded this dashboard into a set of dashboard.

VM Usage dashboard

A common request from VM Owner is to get his VM utilization and property. This is what this simple dashboard is for. You simply select the VM, and the key information is automatically shown. We are using line chart again, as the data that the VM Owner wants to know can be in the past.

If you need something more advanced, with self service, review the tenant dashboard here.

vm-usage

Capacity Dashboards

The twin brother of performance is capacity. While they are different, they are closely related. We’ve provided 2 dashboards to get you going:

  • Cluster capacity
  • Datastore capacity

The Cluster capacity lets you see a cluster utilization in 3 areas: CPU, RAM and Disk.

The model we use here is based on utilization. It does not take into account Availability Policy and Performance SLA. It is also based on Demand model, which is not suitable if you are doing Tier 1 cluster.

capacity-overview

The datastore capacity lets you see quickly which datastore is running out of space, and which datastores are hardly used. It uses the red color to show low capacity, and dark grey to show wastage. What you want to see is balanced usage across all datastores.

datastore-capacity

Configuration Dashboards

The last set of dashboards cover Configuration. We focus on configuration that need attention, rather than simply listing all configuration. Take the VM configuration dashboard, shown below. It highlights is you have lrge VMs, how large they are, and how many for each size.

You can also customize the filter. Simply edit the view widget.

It also highlight configuration that you need to watch. A VM with > vNIC should get your attention that it can bridge your network.

vm-config

We apply the same principle to the ESXi Configuration dashboard. For example, it shows the BIOS version. You want to keep the version consistent and minimal.

esxi-conf

The Cluster Configuration highlight inconsistent config among members ESXi in the cluster.

cluster-conf

The Network configuration lets your peers in the Network team to quickly understand the virtual network. It lists all the distributed virtual switch. Once you select one, it automatically lists all the port groups and ESXi in that switch. It also lists all the VMs. You can control and customise all these lists.

network-config

Hope you found them useful. We have intentionally kept them simple in 6.4. If you are running 6.3 or later, and you need a more advanced dashboards, download from here.

Tenant Self-Service Dashboards

This post continues from the Operationalize Your World post. Do read it first so you get the context.

A common request among VMware Admin is to give their customers a self service access to their own VMs. The VM Owners should be given a simple portal, where they can easily see all their VMs and its performance. The solution in this blog is inspired by the work done in this video and this paper. We’ve reduced the visibility and supply a custom dashboard with super metric.

Naturally, VM Owners do not have access to vSphere, as that’s to deep into the kitchen. We are also not assuming that you have vRealize Automation or vCloud Director in-place. So this is just using vR Ops.

Requirements:

  • Tenant can only see her own VMs.
  • Tenant cannot see the underlying infrastructure. It is both irrelevant and not something you’re comfortable disclosing.

This is what the dashboard looks like. It has a simple ReadMe to guide tenant.

dashboard

We’ve added visibility into how the IaaS is serving the VM. Provide that transparency to your customer, and you have a major advantage over the public cloud.

The tenant has very limited access to vR Ops. The following 3 screenshots show what have been removed.

dashboard-2

dashboard-3

vm-drilldown

Implementation

As you can guess, there are 2 parts to the implementation:

  • One-time setup.
    • A general set up you do that is applicable to all tenants
    • You develop your dashboards here.
  • Per Tenant work.
    • A tenant-specific setup that you need to do for each tenant.
    • Things like creating an account for that tenant belongs here.

One-Time Setup

Create a role called Tenants. Purpose is to limit what features it can access

1

The tenant needs only 2 access, as shown below:

2

Create a group called Tenants. Only share the “My Virtual Machines” dashboard to this group. As a result, this group can’t see any other dashboards.

Ensure the group Everyone has 0 dashboards

Purpose is to limit what objects it can see. To recap, the Roles limits what features can be seen, while Group limits the objects.

3

Click the Objects, then select the Tenants role (which you created earlier). Do not provide any more access. So none of the object hierarchy is selected.

4

Create a group type. Call it Tenants. Each tenant will have 1 group and 1 group only.

5

Download the files. Import the dashboard and super metrics.

6

Create the Text Widget file and Resource Kind file. See the screenshot below as guide. The name has to be identical, and it is Case Sensitive.

[2 Nov 2016: Thank you Patrick Nganga for spotting that I miss 2 files. There are 4 that you need]

7-files

That’s all you need as the base. All the work below is now per tenant. So if you have 10 tenants, you need to repeat 10x. I know…

Per-Tenant Setup

Create a group that contains all the VMs of a single Tenant. Best is to use the Tenant Name as the group name. If you organize the VMs properly in vCenter, by using vSphere Tags or Folders, you can take advantage of that. The example below is using vSphere Folder.

do-not-hardcode-the-vm-in-the-group

Once created, the group will appear under the Tenants group type. I’ve created 2 examples. Ensure the no of VM matches what it should be.

1-group-for-each-tenant-2

Create an account for each tenant. Give is full Administrator access. Just for temporary.

Login using this newly created account. I’d use another browser, and I do not want to logout from my administrator account. Go to Dashboard, select all dashboards except the one you want to show, and remove them from Home. See how it’s done below. Once done, the Visible on Home will show it’s not to be shown.

hide-all-the-other-dashboards

Log out the tenant account. Or simply close the browser.

Switch back to your administrator account. Remove the tenant administrator privilege, and map it to the Tenants role, as shown below.

Map the account to the associated group, and only to this group. This limits the visibility. Yes, this is how the “security” is done. I’m not sure if this is honoured by the API, but you can block the tenant ID from accessing via API.

select-the-right-group-for-each-tenant

Limitation

  • Tenant can only have 1 group. The Total is based on super metric that adds per group. It cannot add multiple groups as it does not know which groups to select.
  • Alerts are not implemented yet.
  • Tenant cannot change the alerts. For example, they cannot change their own threshold.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

SDDC Dashboards: The Kitchen

This post is part of Operationalize Your World post. Do read it first to get the context.

There are only 4 parts in the IaaS Monitoring:

  1. Capacity
  2. Configuration (with Inventory)
  3. Audit and Compliance
  4. Availability

Can you figure out why we do not have Performance in “the kitchen” area of your restaurant business?

Performance SLA concept explains why. I’ve also applied it to VDI use case and give an example.

Capacity

The Capacity dashboards below take into account Performance SLA and Availability SLA. Only when these 2 are satisfied, that it considers Utilization. Review this series of blogs for an extensive coverage on this new model.

The set of dashboards answer questions such as:

  • What’s the capacity of my clusters?
  • What’s the consumption on the clusters?
  • Which clusters are running low?
  • Is the cluster still coping well with demands?
  • Does a cluster consist of mostly large VMs?

Here it the dashboard for Tier 1, where we do not overcommit. As a result, both performance & utilization are irrelevant. It is driven by Availability SLA.

The vCPU and vRAM remaining is based on allocation model. It takes into account HA setting.

BTW, the lines in the 2 line chart above do not gradually come down (or up) because this is a lab, not a real life environment. Your production environment will have line chart that makes sense 🙂

Here it the dashboard for Tier 2 or 3. Since we overcommit, we now have to take into account performance, and then utilization.

capacity-tier-2-compute

As you can see from the above dashboard, it has 3 sections:

  • Availability SLA. Do we reach the concentration risk?
  • Performance SLA. Do we serve existing workload well?
  • Utilisation. It uses the net usable capacity as the ceiling. This ceiling takes into account your HA settings and Buffer. The default value for buffer is 10%, which you can change via policy.

Can you spot a limitation on the capacity dashboards I’ve shown so far?

Yes, it’s hard to compare across clusters. If you have many clusters, you want to know which clusters to check first. This dashboard lets you compare. It’s color coded so it’s easier for you to see.

For implementation details, refer to this post.

The twin-sister of Capacity is Reclamation. What can you reclaim and from which VMs?

reclamation

For implementation details, refer to this post.

Configuration 

In the world of Software-Defined, configurations are easy to change. So consistency and drift become 2 areas you need to watch.

The set of dashboards answer questions such as:

  • Are my ESXi config consistent, especially if they are member of the same cluster?
  • Are my ESXi & Clusters configured to follow best practice?
  • Do I have too many combination, which increase complexity?
  • What have I got?

The dashboard below is for ESXi:

configuration-esxi

The dashboard below is for Cluster:

configuration-cluster

That’s all you can do in vR Ops 6.4. If you need more details, you need to deploy VCM. The latest release is 5.8.3. For the list of configuration that it can track per object, review this.

Inventory differs to Configuration.

  • Configuration has Standard, and hence Drift. Inventory does not.
  • Configuration has Compliance. Inventory does not. Well, not generally 😉
  • Configuration has value that can be bad (e.g. ESXi has no syslog). Inventory does not.
  • Inventory has stock take (typically annual). This can trigger work, which impact Configuration.
  • Inventory is typically reported on regular basis.

Because of the above, we’ve provided a purpose built dashboard to track inventory.

inventory

Audit and Compliance

You can check your environment compliance to vSphere Hardening Guide. The dashboard belows shows the summary of compliance, with ability to drill down to each object.

capture

vCenter tasks, events and alarms are 3 areas that you can mine to help answer compliance and audit. Log Insight complements vR Ops nicely here. For example, the following screenshot answer this audit question

  • Who shutdown what VM and when?

compliance

There are many things it can answer, and it’s covered in the workshop.

Availability

Because of HA and DRS, tracking Cluster makes more sense than tracking each ESXi. A cluster uptime remains 100% when 1 host is not available because you have HA. You have catered for that, and as a result, you should not be penalized.

The set of dashboards answer questions such as:

  1. What’s the availability (%) of each cluster in the last 24 hours? Each cluster has its own line chart, and it’s color coded. You expect a green bar, as shown below.
  2. What’s the availability now? The heatmap provides that answers quickly. You can drill down into the cluster if you spot a problem.
  3. Am I containing risk when there is a major outage. How many VMs am I willing to lose when a cluster or datastore goes down?

availability-cluster

The heat map also provides the ESXi uptime. You can toggle between Cluster and ESXi.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.