Tag Archives: Log Insight

Operationalize SDDC program

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Callum Eade and Kenon Owens created a program called Operationalize Your World. Sunny and I provide the technical content. Many folks, both internal and external, have reviewed the materials along the way in the past several years. I was cleaning up my files and surprised to see decks from early 2011 have the old versions of the slides you’re seeing today.

If you only have 10 minutes, below is a 7-minute introduction to what you get in the 1-day workshop. Sunny & I delivered that in VMworld 2016. We benefited a lot from the community, so we immediately said yes when Alastair and vBrownbag invited us to share.

summary

In 2017, they again invited us. This time, we are given 30 minutes, so you get some of the solution this talk.

The 1-day workshop actually has 2-day worth of material. Hence there are flexibility on what is delivered on that day and it’s driven by the audience:

We use a restaurant analogy to raise awareness that your IaaS business should be operated differently. There are 4 main ppt files.

powerpoint

You can find the material hereThey are in editable format (ppt), not in PDF format.

We are giving in PowerPoint as Operations vary widely. Take what’s relevant to you, throw away what’s not, add your custom deck, and make it yours. When you share your deck to your peers or customers, let me know how it goes. I’m keen to hear your journey. It’s a journey because it will take you multiple rounds to enlighten your peers.

The workshop covers 4 areas in management (Availability, Performance, Capacity and Configuration). We map each area to both Consumer and Provider layers of your IaaS business.

blog 1

Hope you find the material useful. If you do, go back to the Main Page.

Operationalize Your World: Import Steps

This post continues from the Operationalize Your World. Do read it first to get the context.

Read the whole instruction first, before executing the steps.

Planning

  • Decide which clusters and datastores are on what Service Tier.
    • This is the most crucial step. 
    • Choose cluster, not resource pool. The Policy is applied at cluster level. I do not use Resource Pool. It complicates matters operationally.
  • If you are doing IaaS business, you should have at least 2 policies. That’s why I have not thought of a business scenario where you only have 1 policy.
  • If you have no Performance SLA, you have a bigger problem than IaaS monitoring.
  • You can place VM on Tier 1 Compute and Tier 2 Storage, although that just makes your operations complex. It’s like giving economy class passenger a business class TV & meal.

Prerequisite

  1. vR Ops Advanced edition or higher. Standard edition does not allow custom dashboard.
  2. vR Ops 6.7 or 7.0
    1. 7.0 is fully backward compatible with 6.7
    2. It does use 6.7 specific properties, so it will not work with 6.6
  3. Hands-on with vR Ops 6.7. I assume you know what you’re doing.
  4. Have an ID with admin privilege.
    • Do not use the built-in Admin account. It creates confusion between OOTB content and what you create.

Download

  • Download the files here.
  • The zipped files contain zipped files. No need to unzip the zipped files.

Steps (Summary)

This is a Summary so you can see the big picture. There is details steps after this section.

The steps can be grouped into 4 parts:

  1. Part 1: Metrics
    1. Enable these metrics (they were disabled in 6.7)
  2. Part 1: Super Metrics & Policies
    1. Import the super metrics
    2. In your active base policy, enable the super metrics
    3. Create 1 policy for each service tier.
      1. Ensure it’s a child of your active/base policy. It’s the one with the D.
  3. Part 2: Group
    • Create the group types.
    • Import the groups.
    • Modify the group selection criteria. I use a Custom Datacenter as it’s easier.
  4. Part 3: View & Dashboards
    1. Import the Views
      • Do this prior importing the dashboard as there is dependency
    2. Import the Dashboards.
      • Importing dashboard automatically creates the menu structure
    3. Recreate the XML files
      • They cannot be imported.

After import, you can customise the Large VMs, Idle VMs, Powered Off VMs group. For example, I used large VM as those with >8 vCPU or >24 GB RAM.

Steps (Details)

Watch the 12 minute video, then read the steps below. Apology that the video is based on 6.6, not yet updated to 6.7. Some screenshots are from earlier version, as they are still functionally the same.

Follow the names exactly. They are hardcoded in the dashboards. 
Names are Case Sensitive!
If you do not follow, import will work, but you get hourglass icon.

Part 1: Super Metrics and Policy

Import the super metrics. If you are curious the list of super metrics you are getting, the list looks something like this. Yes, heaps of them!

Once imported, enable the super metrics in your default active policy. Yes, you can bulk enable by selecting multiple lines (as shown below). Sort the result by Object Type, and do not enable on All Objects Types. Use the Actions menu to enable them all.

enable super metrics

Optional step: review the Performance SLA super metrics settings. Adjust the SLA accordingly if you know the performance of your IaaS. If you are running Balance power management, change the CPU SLA to 10, 20, 30 accordingly.

Create 1 policy for each Tier. This has to be based on your active policy, so the inheritance works properly. Make sure you choose the right one.

You must use the following names for the Policy:

  • Tier 1
  • Tier 2
  • Tier 3

Enable the correct SLA for each tier. In the example below, I’m enabling Tier 2. From the big red number 1, you can see I’m editing a policy named Tier 2. You can see it’s being selected in the background, behind the dialog box.

See the big red number 2: It shows the Performance SLA that should belong to Tier 2. As a result, I only enabled them (see the big red number 3). The easiest is to specify “Tier 2” in the filter, so only Tier 2 super metrics are shown. I do not enable the super metrics for Tier 1 (see the big red number 4).

Correct super metric for each policy

Here is the Enable example:

Click Save to end the editing.

Part 2: Group Type and Group

Create these group types carefully:

  • Class of Service
  • VM Types
  • Tenants
  • Multi-tier Applications
  • Application Tier

Your group import will fail if you do not have the group type.

If you mistyped and saved it, do not edit it to correct it. Delete it, and create a new one. The reason is the key wasn’t updated when you edit, only the label.

Once created, import the groups.

For the Service Tiers groups, you need to associate them to the correct policy. To do that, edit the group, and choose the respective policy. The following example shows for Tier 2.

Do the same steps for Tier 1 (Gold) and Tier 3 (Bronze).

BTW, you can also assign the policy to its associated group via the policy library. Your choice. Below is an example. Use the green plus sign, as I circle it below:

Assign policy to Tier 1

You know you got the policy associated when it appears in the Active Policies. The screenshot below show I’ve activated all 3 Tiers

policy

Part 3: View and Dashboard

Import the view, then the dashboard. Choose Overwrite if you’re importing for the 2nd time, or have the old Operationalize Your World views/dashboards.

view import

The lists shown below is partial. There are >100 in total.

view-list

Import the Dashboards. When you are done, it looks something like this.

dashboard

XML Files

Recreate the XML files. They cannot be imported. I use copy paste, even on the file names.

xml

Once imported, take your well deserved coffee break! It you have a large environment, it can take an hour for all the dashboards, super metrics, policies, groups, to be applied. During the process, you may see the known error while trying to open a dashboard. Just wait an hour or so.

When things go wrong

If your dashboard has hourglass icon, likely it’s because a metric or object is missing. The root cause is likely a missing group.

You should not need to do any of these things. But if things go wrong, there are a couple of things you can check. First, ensure each Policy actually applies to the correct object. For example, you can see below that I’ve applied the policy named Tier 2 to a group called Tier 2. Under the Assigned Groups, column, it shows it’s being applied to 1 group and it impacts 302 objects.

Policy objects group

The same goes with super metrics. In the following example, a super metric is being applied to Tier 2 policy. It’s not applied to other policies, as it does not make sense.

Super Metric n policy

If import fail, you will see the error message. Simply rename the duplicate object, then reimport.

import duplicate

You cannot re-import. The reason is the ID remains the same. Delete the existing object, then reimport. It is safe to delete.

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together.

How healthy is your vSphere environment?

One common question I get from customers is how to prove that there are not hidden warning lurking around in the log files. As you know, vSphere produces a lot of logs. I shared how you can check performance issue here, so I will complement it here.

Your first stop should be the General Problems dashboard in Log Insight. This dashboard checks the health of your vSphere using 8 queries. You expect a flying color, meaning it should be blank like this. That means vSphere does not log any issue.

1 good result

Let’s look at some of the query that Log Insight does. The SCSI latency is based on 1 second, which is 1,000,000 microseconds. Here is what the query look like:

good result 1

1 second is on the high side; you can change it to a lower number. Do note that this is from VMkernel viewpoint and it’s taking 1 SCSI operation (1 read or 1 write), so the number will be much higher than vCenter average. I’ve seen 12 ms value in vCenter (from the real time chart, so it is a 20 second average) became 600 ms. For details, see this.

The above query is pretty simple, as it’s looking for a specific item. Here is a much broader health check.

good result 2

The example below check for any error in the vCenter that is not already reported as alarm.

good result 4

This query below check for cluster imbalance.

good result 5

This query tracked for VM rebooted due to HA.

good result 6

OK, all the above are what you want to see. In reality, your environment may not be 100% healthy. Let’s look at another example, this time with some errors.

bad 1

You can drill down to each widget. Log Insight presents the Interactive Analysis screen, as you can perform analysis interactively on this screen.

bad 2

The above data gives you the relative distribution. You can drill down by adding time dimension. This lets you see if the problem happens consistently or not. In the example below the problems keep on happening.

bad 3

I can drill down to a specific problem. Let’s choose SCSI device connection loss. Once I narrow it, I can group the information by device.

bad 4

vSphere logs seem to be distinguishing the permanent loss further. From the above, we can see there are multiple types. I did not know about it, but it’s clearly shown by Log Insight. As a result, I can probe further.

bad 5

We can go on with more examples. I hope it has given you the idea that Log Insight is a good companion for your VMware logs. Since it is free for 25 sources per vCenter, give it a spin!