Tag Archives: Log Insight

Operationalize SDDC program

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Callum Eade and Kenon Owens created a program called Operationalize Your World. Sunny and I provide the technical content. Many folks, both internal and external, have reviewed the materials along the way in the past several years. I was cleaning up my files and surprised to see decks from early 2011 have the old versions of the slides you’re seeing today.

If you only have 10 minutes, below is a 7-minute introduction to what you get in the 1-day workshop. Sunny & I delivered that in VMworld 2016. We benefited a lot from the community, so we immediately said yes when Alastair and vBrownbag invited us to share.

summary

In 2017, they again invited us. This time, we are given 30 minutes, so you get some of the solution this talk.

The 1-day workshop actually has 2-day worth of material. Hence there are flexibility on what is delivered on that day and it’s driven by the audience:

We use a restaurant analogy to raise awareness that your IaaS business should be operated differently. There are 4 main ppt files.

powerpoint

You can find the material hereThey are in editable format (ppt), not in PDF format.

We are giving in PowerPoint as Operations vary widely. Take what’s relevant to you, throw away what’s not, add your custom deck, and make it yours. When you share your deck to your peers or customers, let me know how it goes. I’m keen to hear your journey. It’s a journey because it will take you multiple rounds to enlighten your peers.

The workshop covers 4 areas in management (Availability, Performance, Capacity and Configuration). We map each area to both Consumer and Provider layers of your IaaS business.

blog 1

Hope you find the material useful. If you do, go back to the Main Page.

Operationalize Your World: Import Steps

This post continues from the Operationalize Your World. Do read it first to get the context.

Read the whole instruction first, before executing the steps.

Planning

  1. Decide which clusters and datastores are on what Service Tier.
    1. This is the most crucial step.
    2. If you are doing IaaS business, you should have at least 2 policies. That’s why I have not thought of a business scenario where you only have 1 policy.
    3. You choose per cluster, not per resource pool. The Policy is applied at cluster level. I do not use Resource Pool. It complicates matters operationally.
    4. If you have no SLA, you have a bigger problem than IaaS monitoring.
    5. You can place VM on Tier 1 Compute and Tier 2 Storage, although that just makes your operations complex. It’s like giving economy class passenger a business class TV & meal.
  2. Define Large VM.
    1. I used large VM as those with >8 vCPU or >24 GB RAM.

Prerequisite

  1. vR Ops Advanced edition or higher. Standard edition does not allow custom dashboard.
  2. vR Ops 6.6
    1. It does use 6.6 specific properties. I’ve also removed dashboards, views, XML files that we’ve ported to 6.6.
    2. The import process is easier starting from 6.5, as you can import group.
  3. Hands-on with vR Ops 6.x. I assume you know what you’re doing.
  4. Have an ID with admin privilege. Do not use the built-in Admin account. It creates confusion between OOTB content and what you create.
Download the files here. 

Just unzip the downloaded file. No need to unzip the zipped files inside it, as you can import a zipped file.

Steps (Summary)

The steps can be grouped into 3 parts:

  1. Part 1: Super Metrics and Policies
    1. Import the Policies.
      • If you open the XML files, you will notice I’ve stripped all the contents. They only contain super metrics.
      • This is because they are merely a vehicle to bulk import super metrics. I do not use the policy, hence no need to enable the policy.
    2. In your active base policy, enable the super metrics and ESXi Temperature metrics
    3. Create 1 policy for each tier.
  2. Part 2: Group
    • Create the group types. Do not mistype.
    • Import the groups.
  3. Part 3: View & Dashboards
    1. Import the Views
      • Do this prior importing the dashboard as there is dependency
    2. Import the Dashboards.
      • Importing dashboard automatically creates the menu structure
    3. Recreate the XML files
      • They cannot be imported.

Steps (Details)

Read the steps as it has more details than the video.

Follow the names exactly. They are hardcoded in the dashboards. 
Names are Case Sensitive!
If you do not follow, import will work, but you get hourglass icon.

Part 1: Policy and Metrics

Import the policy. Choose Skip import to ensure nothing is overwritten. You will actually not overwrite anything as the file you import is a dummy policy. All it has is super metrics.

Policy import

It should take around 1 minutes. You will get this when done.

Policy import success

The purpose of the policy import is to merely import the super metrics. We have to enable them manually. If you are curious the list of super metrics you are getting, the list looks something like this:

Super Metrics

Once imported, enable the super metrics in your base policy. Yes, you can bulk enable by selecting multiple lines (as shown below). Use the Actions menu to enable them all.

enable super metrics

After you import the Performance SLA super metrics, review their settings. Do adjust the SLA accordingly if you know the performance of your IaaS. If you are running Balance power management, change the CPU SLA to 10, 20, 30 accordingly.

Create 1 policy for each Tier. This has to be based on your active policy, so the inheritance works properly. In the example below, my base policy is called OneCloud Default Policy. Make sure you choose the right one.

You must use the following names for the Policy:

  • Tier 1
  • Tier 2
  • Tier 3

Enable the correct SLA for each tier. In the example below, I’m enabling Tier 2. From the big red number 1, you can see I’m editing a policy named Tier 2. You can see it’s being selected in the background, behind the dialog box.

See the big red number 2: It shows the Performance SLA that should belong to Tier 2. As a result, I only enabled them (see the big red number 3). The easiest is to specify “Tier 2” in the filter, so only Tier 2 super metrics are shown.

I do not enable the super metrics for Tier 1 (see the big red number 4).

Correct super metric for each policy

Here is enable example, this time I’m using version 6.6:

Click Save to end the editing.

Part 2: Group Type and Group

Create these group types carefully:

  • Class of Service
  • VM Types
  • Tenants
  • Multi-tier Applications
  • Single-tier Applications
  • Application Tier

Your group import will fail if you do not have the group type.

If you mistyped and saved it, do not edit it to correct it. Delete it, and create a new one. The reason is the key wasn’t updated when you edit, only the label.

Once created, import the groups.

For the Service Tiers groups, you need to associate them to the correct policy. To do that, edit the group, and choose the respective policy. The following example shows for Tier 2.

Do the same steps for Tier 1 (Gold) and Tier 3 (Bronze).

BTW, you can also assign the policy to its associated group via the policy library. Your choice. Below is an example. Use the green plus sign, as I circle it below:

Assign policy to Tier 1

You know you got the policy associated when it appears in the Active Policies. The screenshot below show I’ve activated all 3 Tiers

policy

Part 3: View and Dashboard

Import the view, then the dashboard.

view import

The lists shown below is partial. There are >100 in total. I use View widget as they are flexible.

view-list

Import the Dashboards. You can import them in any order. When you are done, it looks something like this.

dashboard

XML Files

Recreate the XML files. They cannot be imported. I use copy paste, even on the file names

xml

Once imported, take your well deserved coffee break! It you have a large environment, it can take an hour for all the dashboards, super metrics, policies, groups, to be applied. During the process, you may see the known error while trying to open a dashboard. Just wait an hour or so.

When things go wrong

If your dashboard has hourglass icon, likely it’s because a metric or object is missing. The root cause is likely a missing group.

You should not need to do any of these things. But if things go wrong, there are a couple of things you can check. First, ensure each Policy actually applies to the correct object. For example, you can see below that I’ve applied the policy named Tier 2 to a group called Tier 2. Under the Assigned Groups, column, it shows it’s being applied to 1 group and it impacts 302 objects.

Policy objects group

The same goes with super metrics. In the following example, a super metric is being applied to Tier 2 policy. It’s not applied to other policies, as it does not make sense.

Super Metric n policy

If import fail, you will see the error message. Simply rename the duplicate object, then reimport.

import duplicate

You cannot re-import. The reason is the ID remains the same. Delete the existing object, then reimport. It is safe to delete.

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together.

How healthy is your vSphere environment?

One common question I get from customers is how to prove that there are not hidden warning lurking around in the log files. As you know, vSphere produces a lot of logs. I shared how you can check performance issue here, so I will complement it here.

Your first stop should be the General Problems dashboard in Log Insight. This dashboard checks the health of your vSphere using 8 queries. You expect a flying color, meaning it should be blank like this. That means vSphere does not log any issue.

1 good result

Let’s look at some of the query that Log Insight does. The SCSI latency is based on 1 second, which is 1,000,000 microseconds. Here is what the query look like:

good result 1

1 second is on the high side; you can change it to a lower number. Do note that this is from VMkernel viewpoint and it’s taking 1 SCSI operation (1 read or 1 write), so the number will be much higher than vCenter average. I’ve seen 12 ms value in vCenter (from the real time chart, so it is a 20 second average) became 600 ms. For details, see this.

The above query is pretty simple, as it’s looking for a specific item. Here is a much broader health check.

good result 2

The example below check for any error in the vCenter that is not already reported as alarm.

good result 4

This query below check for cluster imbalance.

good result 5

This query tracked for VM rebooted due to HA.

good result 6

OK, all the above are what you want to see. In reality, your environment may not be 100% healthy. Let’s look at another example, this time with some errors.

bad 1

You can drill down to each widget. Log Insight presents the Interactive Analysis screen, as you can perform analysis interactively on this screen.

bad 2

The above data gives you the relative distribution. You can drill down by adding time dimension. This lets you see if the problem happens consistently or not. In the example below the problems keep on happening.

bad 3

I can drill down to a specific problem. Let’s choose SCSI device connection loss. Once I narrow it, I can group the information by device.

bad 4

vSphere logs seem to be distinguishing the permanent loss further. From the above, we can see there are multiple types. I did not know about it, but it’s clearly shown by Log Insight. As a result, I can probe further.

bad 5

We can go on with more examples. I hope it has given you the idea that Log Insight is a good companion for your VMware logs. Since it is free for 25 sources per vCenter, give it a spin!