The storage latency data that vCenter provides is a 20-second average. With vRealize Operations, and other management tools that keeps the data much longer, it is a 5-minute average. 20 seconds may seem short, but if the storage is doing 10K IOPS, that is an average of 10,000 x 20 = 200,000 read or writes. An average of 200,000 numbers will certainly hide the peak. If you have 100 reads or writes that experienced bad latency, but the remaining 199,900 operations returned fast, that poor 100 operations will be hidden.
With esxtop, it can get down to 2 seconds. That’s great, but we need to do it per ESXi. If you have a large farm, it means logging into every ESXi host. This will also have impact on performance, so you do not want to do it all the time, 24 x 7 x 365 days. There is also an issue of presenting that data in 1 screen.
ESXi vmkernel actually logs if the storage latency gets worse or improve. It does this out of the box, so nothing we need to enable. This log entry is not an average. No, it is not logging every single IO the vmkernel issues. That would generates way too many logs. It only exists when the situation improves or deteriorates. This is what you want anyway. As a bonus, the data is in microseconds, so it’s also more granular if you need something more accurate than ms.
This is where vRealize Log Insight comes in. With just a few clicks, I got the screenshot below. Notice that Log Insight already has a variable (field) for VMware ESXi SCSI Latency. So all I needed to do is to get all the log entries where this field exist. From there, it’s a matter of presentation. In the following screenshot, I grouped them by the Device ID (LUN). I only have 1 device that has this issue in the past, so apology for the poor example.
The line chart is an average. You can take the Max if you want to see the worst. I’ve zoomed the chart to a narrow window of just 2:15 minutes (from 18:06:00 to 18:08:15 hours), and each data point represents 1 second. So that is 1 second granularity. 20x better than the 20 seconds average you get from vCenter.
What do they look like when you have multiple LUNs, and you have storage issue? Here is an example. I have grouped them by Device as it’s easier to discuss with storage admin if you can tell the Device ID.
Here is another example. This time the latency went up beyond 20,000 ms!
Beside the line chart, you can also create table. Log Insight automatically shows the field above. All I did was hiding some (14 to be exact, as shown next to the Columns). I find the table useful to complement the chart.
You may be curious about what field is. Log Insight comes with many out of the box fields. They are provided by the content pack. To see the default fields, just duplicate it like what I did below. Log Insight will automatically highlight in green, how it determines the field. In the example below, it parses the string “microseconds to” and “microseconds” and the value in between the strings is extracted as field.
You can also set the type of field. In this case, it specifies the field as Integer.
You may think…. why not search for the string “latency”, so we get everything? Well… this is what I got. There are many log entries with the word latency in it that is not relevant. I have 169 entries below.
The following screenshot shows more examples of the log entries.