Monthly Archives: May 2015

Are those Large VMs using the resources given to them?

This post is part of Operationalize Your World program. Do read it first to get the context.

In the previous post, I covered the reason why over-provisioned VMs are bad. We also talked about the technique. Let’s discuss the implementation for CPU in this blog.

Create a dynamic group that capture all the large VMs. Depending on your environment, you can either grab those 8 or more vCPU VM, or 6 or more. In the screenshot below, I’m using 8 vCPU.

What do you notice about the CPU Utilization in the following screenshot?

1

The Large VMs as a group is only using 7.61% max!

That means not a single one of them used >8% CPU over a period of 24 hours. This is an example of severe over provisioning.

Once you have the super metrics, you can display them in the dashboard. You can use line chart or View. I use View as I do not need to show them as 2 separate charts. What do you see from the following example?

  • The area marked 1 is not what you want to see. None of the Large VMs are doing any work. This means they are all oversized.
  • The area marked 2 is healthier. At any given moment, one of the VM are doing work. Demand counter can go above 100% as hypervisor performs IO (storage or network) using another core.
  • The average remains low all the time. This is over 1 month period, with 5 minute granularity. It shows majority of the VMs are over sized.

Now, the above is good as overall. But it’s missing something. Can you guess what?

Yes, it’s missing the VMs themselves. What if upper management want to see at a glance all the VMs utilisation?

We can create a table that has data transformation. The table complements the line chart by listing all the VMs. From the list, you can see which VMs is the most over provisioned, because the list is sorted. You can sort by 5-minute or 1 hour peak.

What’s the limitation of the table?

  • It does not show the VM distribution. Where are the large VMs? Do they exist in a cluster where they are not supposed to exist?

This is where the Heat Map comes in.

  • We group them by Cluster, then by ESXi, so we can see where they are. You want to see them well spread in your clusters, and not concentrated in just 1 host.
  • The heat map is sized by vCPU configuration. In this way, the bigger VM will have bigger box. A 32 vCPU VM will have a box that is 4x larger than a 8 vCPU VM, so it will stand out. You can see in the following example that some large VMs are much larger than the rest. I have monster VMs in this environment.

A great feature of heat map is color. It’s very visual. We know that both under provisioning and over provisioning are bad. So I set the color spectrum. I choose

  • black for 0
  • red for 100
  • green for 50

If I do the right sizing, I’d see mostly green. If I under provision, I’d see mostly red. If I over provision, which you can expect in most environment, guess what? They are black!

3

That’s all you need to see the overall picture.

A VM Owner does not care about your overall picture. She just cares about her VM. That means we need to drill down into individual VM.

To facilitate that, we need a list of VMs. I use a Top-N as enables me to sort the VM. The good thing about Top-N is you can go back to any period in time. Heat map only allows you to see current data.

The time line in Top-N is set to just 1 hour. No point setting it longer as it will average it. What you want is already provided by the View List. Use that to pick the VM to downsize. The Top-N is merely to drive the widgets.

We also add table. It shows the individual vCPU peak utilisation. It’s showing in seconds, following vCenter real-time chart. 20 seconds = 100%.

The table does not answer quickly what is the CPU utilisation 95% of the time. This is where the Forensic comes in. It shows the 95th percentile. You expect that green vertical line to be at 80% mark, indicating it’s correctly size.

The table and Forensic are useful. What’s their limitation?

  • They not as user friendly as a line chart.
  • Plus, VM Owner wants to see the utilization of each vCPU. This lets her clearly if a specific peak was genuine demand or not.

The chart is busy as the 5 minute granularity is maintained. No roll up. You can zoom into any specific time of your interest.

I’m only showing the first 16 vCPU. You can configure to show the rest. My screen not big enough to show all 16 vCPU. If yours is not big enough, or you need to show >16, create multiple View widgets.

How do they fit together on the dashboard? Here is how they look like.

I hope you found it useful. Happy rightsizing!

Is any of your ESXi Hosts in your data centers overheating?

This week must be a hot week in data center. I saw 3 different requests from 3 countries asking how to monitor the ESXi host temperature and fan speed. The metrics are captured by vRealize Operations, but not enabled by default.

So step 1 is to enable the metrics. vRealize Operations drives using its powerful Policy feature. So you need to modify your policy.

  1. Open your active policy. It will bring up the Edit Monitoring Policy dialog box, as shown below.
  2. Click on Overwrite Attributes. That’s step 4 in the dialog box. It will bring up the full list of attribute.
  3. In the search field, type “temperature”. I was lazy so I just typed “tempe” on the screenshot below. You should also get the fan speed. Search for “fan” or something like that.
  4. Select the State, and choose Local. This will enable the metric.
  5. Save it.

enable the metric

There you go. It’s done. You will see something like this below. Indeed, a lot of sensors are tracked! You know exactly which component is overheating.

metrics enabled

Now that you can see at individual host level, it will help in specific host troubleshooting. But that is not good for overall data center monitoring. Troubleshooting is when you know the problem, and have likely zoomed into a particular host. Monitoring is you expect no problem. So let’s create a super metric. I’d advice a super metric at the entire physical data center level. You can certainly track it per cluster or other object. I’m going to track at cluster level in this example.

Below is the super metric I created. I track the maximum as I want to know if any host is affected. You need to create 2 super metrics if you want to track Temperature and Fan Speed.

super metric

As shared in my other blog post, I always verify the super metric. The preview feature comes in handy. Below is the preview for fan speed. Notice the value was flat 0 and then it went up. That’s because I just enabled the metric. It was not collected before.

Max Fan speed in percentage in a cluster

Below is the super metric for Temperature.

max host temperature in celcius

Once saved, don’t forget to enable it in the policy.

With that, you’re done! The following screenshot shows a stable value for both. You certainly do not want to see sudden spike to a high number.

Result

You’ve got monitoring. I guess the next thing is alert. The good thing is it is already enabled by default. You do not have to do anything. The following screenshot shows the Symptom. vRealize Operations follows vCenter, so it has both the Yellow and Red symptoms.

alert

What about the alert itself? vRealize Operations alert is based on Symptoms. The Symptoms drives the alert, which makes sense. In our case here, both the Red and Yellow will trigger the alert.

alert 2

There you go. Hopefully it’s not so hot anymore in the DC 🙂

Why you need a Log Management platform

Those of us who have experienced troubleshooting a VMware environment (or any enterprise infrastructure) know the importance of logs. When your IaaS platform hits a problem, and life somehow likes to make it happen just before you go on your holiday with loved ones, the ability to analyse the log is essential. In fact, it is the first thing that VMware Support asks you. Many vendors support will also ask you for log.

An enterprise-wide Log Management platform is essential if you are operating a DC infrastructure. The larger and more complex your environment, the more critical it becomes. Let’s list down the benefits and requirements, starting with benefits.

No missing logs

  • Logs can be rotated, or your ESXi hits a PSOD. You have no log. That’s the end of troubleshooting as there is nothing you can do.
  • Here is a real life case where the log was gone.

No need to upload logs

  • This is a big time saving, as uploading GB of data is not easy. Your vendor Support Engineer is able to WebEx or do other remote session. If you are busy attending other matters, the remote session can be delegated to another colleague, as all that is needed is to facilitate the screen sharing session. You can even record it for learning for your broader team.
  • If you are a VMware Mission Critical Support (MCS) customer, I’d encourage you to have regular session with your MCS Engineer and go through the key dashboards of Log Insight proactively. You should do the same with your key vendors.

Faster analysis

  • Log Insight comes with hundreds of built-in query for vSphere. It also has many built-in queries, fields and dashboards for other VMware products, such as SRM, NSX, and VSAN. This speeds up analysis. If you set up alert, you can be informed before the situation degenerates.
  • For non VMware, you are not forgotten. The growing list of Content Pack is getting better over time, and they are independently released of Log Insight.

A unique insight into your environment

  • How well do you your environment? Sure, they are healthy, and vSphere admin client or vRealize Operations tells you that they are. But what does the log say? Can you tell your customers with confidence that there is no error or warning lurking around in the millions of log entries?
  • One reason why VMware Support asks for the log file is there are many information that is not available, or readily available in vCenter.
  • There are information that is in the UI, but not easy to query. Take for example, the vCenter Tasks and Events. It is not easy to analyse across times. If you have multiple vCenter Servers across the globe, that’s even harder. Can you tell your auditor “who does what to what object and when“? That’s not possible without powerful analysis. Given a VM, can you prove which ESXi Hosts it has ever run on since it was provisioned?
  • There are information that is simply not there in the UI. They are only available in log. Take for example, the information on vMotion. Do you know how long is the pre-copy just before the stun time?

Helps in mastering the products you are in charge

  • You know VMware products well. You can design and you can troubleshoot. How well do you know the log? If there is a problem, do you know what log entries to search for? Do you know what those codes in the log mean? VMware products generate log entries, and they provide valuable information. Learning the log helps you deepen your skills.
  • During the webex session with the vendor support engineer, you can see how they troubleshoot, and learn from the joint-troubleshooting session.

Protect against future incident

  • The first time a problem happens, your management will forgive that it takes time to figure out the root cause. She will however expects that you implement measurement. There must be counter action that you do as a result of the incident. You implement an alert, so when the same problem happens again in the future, you will know it within minutes, and before the customers know it.
  • Once you know the root cause, Log Insight enables you to set up alert that will be triggered when it receives the same log entry.

We have covered some of the benefits. I hope you are convinced that you need an enterprise wide log platform.

Let’s cover the capabilities that make up a great log analysis for SDDC:

Deep understanding into VMware

  • Not just ESXi and vCenter. It has to know all the VMware products that you are having. Some of you run many VMware products, and you want a tool that understand all of them. Log Insight comes with a wide variety of Content Pack.

Bandwidth control

  • This is important if you are running a global operation. You want a single pane of glass. ESXi and Windows can generate a lot of logs. I have seen how a single ESXi saturated a WAN link. Due to a bug in LDAP configuration, it generated excessive logs.
  • syslog as protocol is not compressed. Log Insight proprietary protocol provides compression. Steven Flander has shown a 30x compression in his blog.
  • There are situation where you only want to forward selected entries. For example, you have limited WAN link in your remote sites. You have set your queries to certain errors, warning, events and you only need to track them. Log Insight allows you to create filters, and forward filters to multiple destination.

Scalability

  • This is important as you eventually want to capture all logs, not just VMware. You want a cross analysis of the events logs from Windows, RedHat, physical network and storage devices. The end result is a lot of log entries, especially when you have thousands of VMs and hundreds of ESXi Host. Scalability matters.
  • Scalability refers to the speed of log ingestion, the amount of data it can store, and the speed of query. Querying the last 1 hour maybe fast. Try querying the last 3 months 🙂
  • Log Insight scales horizontally, and it comes with a built-in load balancing.

Reliability

  • What if the remote syslog is not available? This could be due to unreliable WAN link, or the remote syslog is having some maintenance. In the case of syslog, your source (e.g. ESXi, SRM, vCenter) will simply drop the log. Yes, you lose the logs as they are not sent to your central syslog server. Log Insight prevents this by caching it. It can keep these entries, and will resend when the source is reachable.
  • That’s the source. What about the destination? If you have a central log platform, you want it to be available all the time. But what if you are doing maintenance, and that requires reboot? Log Insight can ensure that ingestion is not affected by having cluster. You can cluster your central Log Insight instance for higher availability.

Security

  • By default, the syslog entries are not sent via secure channel. Log Insight enables you to send via secured channel.
  • The enterprise-wide log management platform will have a diverse set of users. You want to be able to control who can see what data.

Long Term archival

  • Most log entries lose their value after a few months. Audit logs, however, should be kept for years. In some customers, this is 7 years. You need the ability to filter these security logs, and send them to a separate system so they are available. Log Insight achieves this via its Event Forwarding feature. You create a dedicated Log Insight VM for this purpose, and have it received only and all the audit logs.

Availability

  • If you have log management platform that covers everything, it becomes a critical component of infrastructure. You need to have DR for it. Log Insight, via its Event Forwarding, can be architected as Active/Active instance.

I hope the above is useful. For additional info on why Log Insight is a great fit, review this 12 reasons from Steven Flanders.

I recommend that you join the Log Insight community and provide feedback on the next version.

Have fun in the weird and wonderful world of logs! 🙂