Tag Archives: SDDC Operations

Dashboards for IT Senior Management

This post is part of Operationalize Your World post. Do read it first to get the context.

CIO, Head of Global Infrastructure, and other IT Senior Management have a different requirements for dashboard than technical folks.

Generally they want:

  • big picture, not details.
  • exception. Things that they need their attention.
  • less technical info. Ideally, present in business terms, not IT.
  • a portal that is easy to access. They may not want to login to vR Ops. If they do, they may forget their password. [e1: vR Ops 6.5 cannot do login-less yet]
  • UI that is easy to understand. So keep each dashboard to a specific question.
  • system that is easy to use. So keep the interaction, clicking, zooming, sorting, etc. minimal.

That’s what they want from you.

What do you want from them?

You show them something so you can get help (e.g. budget, resource). Here are some goals:

  • Show transparency. Giving visibility into live environment to senior management.
  • Prove that you do need additional hardware.
  • Prove the wastage you have been talking for months.

What do you not want to show? There are things you do not want to show. Urgent issues are something that you should not display. It is not about hiding information to CIO. This is about giving you the time or space to do your job. If there is an active fire that requires your full time concentration, you do not want to be interrupted by CIO asking why it’s showing red on the dashboard!

I covered dashboard best practice in this post. Read that first, as this blog builds upon that.

Done? Great!

We take the same approach we did when planning dashboards for specific roles (e.g. Storage team, Network team). We ask a set of questions.

If we implement the above, we will end up with at least 5 dashboards. I’ve combined some of them. I see a wide variety of requirements, so you will customise them anyway 🙂

Basic Visibility

  • How many VMs in our cloud? What’s their CPU, RAM, Disk allocation? This gives you a size of the environment the IaaS is supporting.
  • How much CPU, RAM, Disk do we have? Is it enough to support the above requirement?
  • You should also give the history of VM growth. What is enough today may not be enough in 3 months.

In the dashboard above, I’ve added Availability information. As VMs can be powered off intentionally by application team, you should only report for Tier 1 VMs. Tier 3 VMs, especially those in Test and Dev, can be rebooted frequently and hence will give misleading information.

Performance

The dashboard below shows all VMs. In a large environment, the heat map will automatically combine VMs with the same value (read: color & size).

Every VM is represented by a box. The box can take on value between 0 and 3.

  • Green = 0. The VM is served well.
  • Yellow = 1. One of the IaaS is not delivered as per Performance SLA. We track CPU, RAM and Disk. If your SLA states 10 ms disk latency, then the VM has to get 10 ms.
  • Orange = 2. Two of the IaaS is not delivered.
  • Red = 3. All 3 services not delivered.

The VMs are grouped by Datacenter, then cluster. This lets you see which Datacenter or Cluster aren’t coping well.

The above shows the VMs. What about applications? An Application spans multiple tiers and multiple VMs. Just because a VM does not perform does not mean the whole application is affected. As this is for Senior Management, we’re only showing the Tier 1 applications.

This blog explains the implementation.

Capacity

  • CIO is not in charge of capacity management. He just need to know the decision you want him to make (which is to approve hardware purchase, or get VM Owners to rightsize). For that, he needs to know if you are running out of capacity, and existing capacity is not wasted.
  • How is it growing? This can be taken care of by having a projection. This projection should take into account committed projects too.
  • Capacity is more complex than performance. Just because vSphere cluster is running low on utilization does not mean it can serve the VMs well. See this for detail explanation.
  • Capacity is best presented with a line chart. This enables you to see the trend. For environment with <10 clusters, you can fit all the clusters in the screen. For large environment, you need to make a trade off:
    • show live data. You can be detail as you’re only showing 1 data.
    • show historical data. You can’t be detail as you’re showing >1 data.

Here is an example with historical data. Notice we cannot show details, and the screen only accommodates <10 clusters.

Here is an example where we only show live data. We can show a lot more clusters, and for each we can show CPU, RAM, Disk and Network.

You may run out of capacity. But if you have a lot of wastage, you may have sufficient capacity after you reclaim them. See this for details.

Configuration

  • Do we have “bad” configuration? Examples are old & unsupported versions of Windows, Linux, ESXi, VMware Tools, etc.
  • How uniform is our environment? Complexity is required to optimize cost (hardware, software) and performance. However, there is cost in complexity.
  • Do we have outdated and unsupported products?

If your CIO does not appreciate the complexity, showing CIO the complexity is good for you. It will result in appreciation of your expertise & effort, as it’s certainly easier if the complexity is low. Complexity increases when you have a wide variety of things.

Factor impacting complexity:

  • No of ESXi versions. The more variants, the more complex.
  • No of ESXi CPU version
  • No of brand. The more vendor, the more complex as you need to learn them, and spend time with the their team.
  • No of cluster node size
  • No of shared Datastore size

See the examples provided here for Infrastructure and here for VM.

Architecting VMware vSphere for Operations

As an Architect, you take into account many requirements when designing your VMware vSphere environment. As an Operations person, I shall not question your Architecture. I’m sure it is fast, highly available, right on budget, etc. My role is to help you prove to CIO that what you architect actually lives up to its expectation. “Plan meets Actual” is what an Architect wants, because that means your architecture delivers its promise.

The plan of the Architecture exists in some diagram and documents. It's static.
The reality of the Architecture exists in Datacenter. It's live.

When proving, here are the questions we should be able to answer:

  • Availability: Does the IaaS deliver the promised Availability SLA? If not, what was its actual number, when was it breached, how long, which VMs affected? For each VM, when exactly it happened and ended?
  • Performance: Does the IaaS serve all its VMs well? An IaaS platform provides CPU, RAM, Disk and Network as services. If has to deliver these 4 resources when asked by each VM, 24 x 7 x 365 days a year. If not, which VMs were affected by what and when?

Those are simple questions, but are very difficult to answer. Let say you have 10,000 VMs. How do you answer that? How do you provide answer over time (e.g. monthly), proving you handle the peaks well?

To complicate matters, you need to able to answer per Business Units. Business Units A will not care about other business units. Since a Business Unit has >1 apps, you also need to answer per application. An application owner only cares about her applications.

There are a few things you need to do, so you are in the position to prove.

Step 1: Reflect the business in vSphere

Does your vCenter show all the business units? Can you show how the business is mapped into your vSphere environment? You design vSphere so Business can run on it, so where is the Business? A company is made up business units, which may have multiple departments. The structure below is a typical example.

An application typically has multiple tiers. Does your vSphere understand that?

Map the above as folders in your vCenter.

I see many naming convention that is not operations-friendly. It’s impossible to guess what it is. The names are very similar and hard to read, hence it’s easy for operators to make mistake! The naming convention typically originates from mainframe or MS-DOS era, where you cannot have space and have limited characters. Examples are SG1-D01-INS-0001W-PRD. Can you guess what on earth that is? You’re right, you can’t. Imagine there are 1000s of them like that, and you have new operators joining the help desk team.

Folder, Tags and Annotation

Have you seen a vSphere environment where there are tags and annotation everywhere?

It’s rare to meet customers with a 100% well-thought and documented approach to the 3 features above. They may have general guidelines, but not explicit Do’s and Dont’s. As a result, these 3 features are used wrongly.

Use Tags when the values are discrete, ideally Yes/No. I’d use tag to tag the following:

  • VMs with RDM.
  • VMs with MSCS or Linux Clustering.

Do not use Tags when the values are unlikely to be common. Use annotation for this. Examples are VM Owner Name, Email Address, Mobile Number. In an environment where there are >10K VMs, there can be 1000 VM Owners.

Do not use Tags to tag Service Tier. For Infra objects such as Cluster and Datastore Cluster, that should be clearly reflected in the name itself. I’d prefix all Tier 1 clusters with Tier 1, so the chance of deploying into the wrong tier is minimized.

Step 2: Design Service Tiers into vSphere

Does your vSphere understand that there are different classes of service? Are Tier 1 clusters and datastores clearly labelled?

You should avoid mixing multiple classes of service into a single cluster or datastore. While it is technically possible to segregate, it’s operationally challenging. Resource Pool expects the number of VMs for each pool to be identical among the sibling pool.

For each tier, you need to have both Availability SLA and Performance. For Performance SLA, review this doc.

Step 3: Define and Map Tiers in vR Ops

Now that you’ve considered service tier into your vSphere architecture, time to show it. You cannot show it in vSphere as vCenter does not understand Performance SLA and Availability SLA. You can use vR Ops do this. Follow this step.

Step 4: Map Applications in vR Ops

Use custom groups to create applications. If you have a proper naming convention, it should not be difficult to select members of the applications. All you need is a query that says the names contains XYZ. There should not be a need for regular expression.

Once apps are mapped, you can do something like this.

Step 5: Consider Debug-ability

Things go wrong. Especially in production. Your architecture should lend itself for troubleshooting.

A major area is to ensure the counters are reliable, else it’s hard to troubleshoot performance. The CPU Contention counter, which is the main counter for IaaS Performance SLA, is greatly affected by Power Management. Ensure your ESXi power management follow this guide by Sunny Dua.

Once you have that in place, you will be able to prove that your Architecture lives up to its expectation. Use the dashboards from Operationalize Your World to show that proof!

vSphere visibility for Network Team

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Similar to the problem face between Storage Team and Platform Team, VMware Admin needs to reach out to Network Team. A set of purpose-built dashboards will enable both team to look at issue from the same point of view.

The dashboards must answer the following basic questions for Network Team:

  1. What have I got in this virtual environment?
    • What is the virtual network configuration? What are the networks, and how big are they?
    • We have Distributed vSwitches, Distributed Port Group, Datacenter, Cluster, ESXi, etc. How are they related?
    • Who are the consumer of my network? Where are they located?
  2. Are they healthy?
    • Do we have any errors in our networks? Which port groups see packets dropped? If there is problem, which VMs or ESXi, are affected?
    • Do we have too many special packets? Broadcast, multicast and unknown packets. Who generates them and when?
  3. Are they optimized?
    • Just because something is healthy does not mean they are optimized. Look for opportunity to right size.

Once Network Admin know what they are facing, they are in better position to analyse:

  1. Utilization
    • Is any VM or ESXi near its peak?
    • Who are the top consumer for each physical datacenter?
    • How is the workload distributed in this shared environment?
  2. Performance
    • When VM Owner thinks Network is the culprit, can both Network Team and Platform verify that quickly?
  3. Configuration
    • Are the config consistent?
    • Do they follow best practice?

Once we know the questions and requirements, we can plan a set of dashboards. I’ll show some sample dashboards to get you going. They follow the dashboard best practices.

What have I got? 

This first dashboard provides overall visibility to Network team. It gives insight into the SDDC.

  • It shows the total environment at a glance. A Network Admin can see how the virtual network maps to the virtual environment (ESXi, vCenter Datacenter, vCenter).
  • It quickly shows the structure of virtual network. For each Distributed vSwitch, you can see what its port groups are and ESXi are connected to it. You see the config of both objects (port group and ESXi). You can see if the configuration is not consistent.
  • The heatmap quickly shows all port groups by size, so you can find out your largest ones easily. The color code also lets you see which ones are used the most.

overview

Once you know your environment, you are in a position to do monitoring. I don’t feel comfortable doing monitoring unless I have the big picture. It gives me context.

Are they healthy?

The next dashboard shows quickly if there is dropped packet and error packet in your network.

  • The first line chart shows the maximum packet dropped among all Distributed Switch. So if any switch has dropped packet, it will show up.
  • The second line chart sums all the error packets.

Both line charts are color coded, and you should expect to see green. This means no dropped packets nor error packets.

performance

The dashboard above has interaction, allowing you to drill down when the line charts are not showing green.

  • If you do not see green, you can drill down into each Distributed Switch.
    • As a virtual switch can span thousands of ports, it helps if you can drill down by Port Group and ESXi host. The dashboard automatically shows the relevant Port Group and ESXi of the distributed switch.
  • If there is a need to, you can even drill down to individual VM.
    • The table at the bottom is collapsed by default. Expand it and you’ll see all VMs with dropped packets information.

Other than dropped packets and error packets, Network Admin can also check for multicast, broadcast, and unknown packets. You don’t want to have too many of them zipping around your DC.

special-packets

The same concept is being applied, hence Network team only need to learn the dashboard once.

The line charts show the total broadcast and total multicast packets. We are not doing hourly average as at the global level, the law of large numbers will ensure its smooth. A significant deviation is required to move the number. Hence if there is big jump, you know something amiss.

Just like the previous dashboard, this dashboard lets you drill down too. You can check which VMs are generating broadcast packets.

Is Utilisation running high? 

Another factor that can impact performance is high utilization. Network team can see the total utilization of the network. The following dashboard answers questions such as:

  • What’s the total workload hitting our physical switches?
  • If the total workload increasing?
  • Any crazy pattern in utilization? Any sudden spike that should not have happened?

capacity

Line charts are used instead of a single number, as you can see pattern over time. In fact, 2 line charts are provided: detailed and big picture.

The chart gives you the Total throughput hitting the physical switches, so you know how much bandwidth is generated. The chart will be defaulted to 1 month as this is more of a long term view, not really for troubleshooting.

Based on the line charts, you can drill down into a specific time period where the peak was high. The Top-N lists the ESXi with the highest usage. Click on it, and its detailed utilization will be shown. You can see if any of its vmnic is near the physical limit. The super metric takes into account both RX and TX, and you can hit a limit on either.

If you see an ESXi is saturated, but others are barely used, that means your workload is not well distributed. Note that vSphere 6.0 DRS does not consider network, so unbalance is possible. vSphere 6.5 DRS takes network into account.

Customisation:

  • You can add the total line chart as above, but for VM.
    • VM traffic does not include hypervisor traffic (vMotion, management network, IP storage). So it’s pure business workload.
    • We should be expecting this number to slowly rise, reflecting growth.
    • A sudden spike is bad, and so is a sudden drop. We can turn on analytics on it so you get alert.
  • For details on how to do it, see http://virtual-red-dot.info/is-any-of-your-esxi-vmnics-saturated/

How is the workload distributed?

Distributed Switch does not span beyond vSphere Data Center. So data center is a logical choice to start analysing the traffic. The following dashboard compares the workload of each data center. Using color code makes it easy to see which DC reaches a workload that is high.

You can drill down inside the datacenter object. Click on it, and all the ESXi and port groups connected to it will be automaticaly shown. Click on an ESXi, and you can drill down into a VM.

workload-distribution

You can change the threshold (limitation: same limit for every DC) to suit your need.

Who are the Top Talkers? 

Another reason for performance is you have VMs that are consuming excessive network resources. Or you have a peak period, where the total is simply much higher than low period. The next dashboard provides 2 line charts. Again, line chart is used as you can see the pattern.

The table provides a list of VM, sorted by their peak utilization. You can find out who are the bursty-users (5 minute highest), and who are the sustained-users (1 hour highest and 1 day highest).

top-consumers

This example is only for the VM. We can build one for the ESXi if that’s needed. The concept is the same.

Is Network the culprit?

Lastly, it’s all about Service, not System or Architecture. When a VM Owner complains that IaaS is causing the problem, Network Admin and VMware Admin can quickly see the same dashboard to agree whether network is the culprit or not.

vm-troubleshooting

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.