Is any of your ESXi Hosts in your data centers overheating?

This week must be a hot week in data center. I saw 3 different requests from 3 countries asking how to monitor the ESXi host temperature and fan speed. The metrics are captured by vRealize Operations, but not enabled by default.

So step 1 is to enable the metrics. vRealize Operations drives using its powerful Policy feature. So you need to modify your policy.

  1. Open your active policy. It will bring up the Edit Monitoring Policy dialog box, as shown below.
  2. Click on Overwrite Attributes. That’s step 4 in the dialog box. It will bring up the full list of attribute.
  3. In the search field, type “temperature”. I was lazy so I just typed “tempe” on the screenshot below. You should also get the fan speed. Search for “fan” or something like that.
  4. Select the State, and choose Local. This will enable the metric.
  5. Save it.

enable the metric

There you go. It’s done. You will see something like this below. Indeed, a lot of sensors are tracked! You know exactly which component is overheating.

metrics enabled

Now that you can see at individual host level, it will help in specific host troubleshooting. But that is not good for overall data center monitoring. Troubleshooting is when you know the problem, and have likely zoomed into a particular host. Monitoring is you expect no problem. So let’s create a super metric. I’d advice a super metric at the entire physical data center level. You can certainly track it per cluster or other object. I’m going to track at cluster level in this example.

Below is the super metric I created. I track the maximum as I want to know if any host is affected. You need to create 2 super metrics if you want to track Temperature and Fan Speed.

super metric

As shared in my other blog post, I always verify the super metric. The preview feature comes in handy. Below is the preview for fan speed. Notice the value was flat 0 and then it went up. That’s because I just enabled the metric. It was not collected before.

Max Fan speed in percentage in a cluster

Below is the super metric for Temperature.

max host temperature in celcius

Once saved, don’t forget to enable it in the policy.

With that, you’re done! The following screenshot shows a stable value for both. You certainly do not want to see sudden spike to a high number.

Result

You’ve got monitoring. I guess the next thing is alert. The good thing is it is already enabled by default. You do not have to do anything. The following screenshot shows the Symptom. vRealize Operations follows vCenter, so it has both the Yellow and Red symptoms.

alert

What about the alert itself? vRealize Operations alert is based on Symptoms. The Symptoms drives the alert, which makes sense. In our case here, both the Red and Yellow will trigger the alert.

alert 2

There you go. Hopefully it’s not so hot anymore in the DC 🙂

Leave a Reply