In general, you know that you’ve done a good job with your vSphere IaaS because the VM Owners are happy with the performance of their VMs. Business is powered by the VMware infrastructure that you design and operate.
What does the logs of vSphere say? Is there anything lurking in the log files?
As VMware professionals, we know vSphere well and probably have years of experience on it. We can architect, design, implement, upgrade, and even troubleshoot it.
The same thing cannot be said with the logs. Generally speaking, the deep knowledge of vSphere logs belongs to VMware GSS engineers, as they read logs on daily basis, performing all kinds of troubleshooting. That knowledge has been slowly codified into Log Insight. I’m not sure which engineers are doing this great job, but Steven Flanders will be a good starting point. If you have any feedback on the Content Pack, drop him a note.
BTW, if you are new to Log Insight, there are many bloggers who have written good articles on Log Insight. Examples are VMware Arena (vSphere), VMguru (NSX integration), Cody (variety of articles). BlueShift has also written a good overview here.
Back to the question: what does the log of vSphere say about its health?
This is where the vSphere content pack comes in. As you can see from the following screenshot, it comes with many out of the box dashboards, queries, alerts and field. I’d encourage to review the definition, and not just simply look at dashboard content.
Going through the above can be daunting, as it is indeed a lot. Yes, vSphere has a lot of logs, and Log Insight merely reflects that breadth of information. One approach that works for me is Search. I searched for the particular thing I am after. For example, in the following screen, I searched for vMotion. The result told me what information I can get about vMotion.
You might be wondering how to use the variables, or Field as Log Insight calls it. Well, you don’t have to do anything, as they are automatically appears in the drop down field. All you need to do is to type the name. I type vmw_ in the following screenshot as all the VMware specific content packs follow this naming convention. Notice the variable name as the Content Pack name in bracket. This lets you know which content pack is contributing to that field. Nice!
You will eventually create your own field. Please have a naming convention. I typically use prefix of the company name.
Now that you understand the basic, let’s review some of the dashboards.
The first one answers these questions: What vSphere alarms are we getting? How often do they happen? When do they happen? Is there a pattern? Which hosts, cluster, etc get it?
I plot the above for just 1 day. We can tell my small environment hit by what alarms and when. There was a spike at 10 pm.
Do you notice how easy it was to produce the dashboard? All it takes was 2 built-in variables. Log Insight has created vc_event_type and vmw_vc_alarm_type fields. The dashboard was built by simply choosing they exists. So long there is a value, it’s counted as exist.
The above chart was grouped by type of alarm. What if you are looking at your customers, and want to be grouped by VM?
This is where another built-in field comes in. The field vm_vm_name identifies the VM name.
The preceding 2 screenshots are individual charts. It’s in the Interactive Analytics mode of Log Insight. Notice at the top of the screen, there are Dashboards and Interactive Analytics. The dashboards is where you will come first in your day to day operations.
You will eventually build your own dashboard, picking what you like from the out of the box dashboards. The good thing about your own is they won’t be superseded during Log Insight upgrade.
The dashboard below shows the vSphere alarms. From here, you can drill into any of the widgets. I normally use existing dashboard as starting point, drilled down into one of the widget, customise it, and convert the chart into a widget in my own custom dashboard.
Let’s move into Storage, as that’s one area VMware Admin have to deal with. Regardless of storage platform, ESXi is the one executing the IO on the VM behalf. You can track errors by device, hostname, and path. You can also track the latency as seen by the VM kernel. Knowing the latency at hypervisor level complements the info you have at VM level and Guest OS level.
I put the word “Regardless” in red, because it is not actually regardless. There is a situation where ESXi cannot provide you with the above info. Can you figure out what situation is that?
Yup, it is the distributed storage architecture. Specifically, when the physical disk is directly pass through to the Storage VM as PCI device (not RDM). In this case, the hypervisor does not see it. Only a monitoring tool that knows how that specific product works can monitor it. vCenter and vRealize Operations cannot help you. My point here is you need to know how your things works first, before you can monitor it properly.
You might notice on the preceding dashboard the latency widget show no result. That’s because the setting was set at 1 second. That’s 1000 ms. You can easily change it. In the dashboard below, I’ve changed the field vmw_esxi_scsi_latency from >1000000 to just exists. I also plot the Minimum and Maximum, so I can see the variation.
If I’m interested in just specific device, I can filter it. I can also change the chart type. I can also display only Maximum, so it’s easier to see.
Ok, that was Storage.
Let’s move to DRS and HA. The dashboard below gives an overall picture of DRS and HA events. I can see that during the time period I specified, all my clusters were balanced. However, I have quite a number VM heartbeat issue. [Yes, the product team is aware that the chart legend says red but it’s showing blue]
What about the speed of vMotion? How much bandwidth does it take? Did it utilize my 10Gb Ethernet well?
From the above, my precopy stun time is on the high side. In a healthy environment, this number would be <300 ms. The bandwidth would also be much higher than that. This is a lab, and I know the physical connectivity.
I think you’ve got the idea of how Log Insight helps you. Let’s do just 1 example at VM level. This dashboard answer a specific question: Did any VM hit high CPU Usage? If yes, which VM and when?
We covered quite a fair of things. They are all something familiar to you. Storage, Cluster, VM, etc.
Now…. your CIO may ask: what about something we don’t know? Is there any errors, warning, timeout, abort, etc. that we need to know?
Below is the built-in query that gives you that. It’s a pretty complex query 🙂
An example of the result is below. I have grouped them by ESXi host
Hope you find it useful. Deploy it (free for 25 OSI), do the health check, and let me know what you found! You might be surprised, and have to cancel that vacation 🙂 🙂