Tag Archives: Storage

Mastering Storage Capacity: the Disk Space metrics

As I shared in my book, it is critical to know what those counters in vCenter and vRealize Operations mean. This enables you to pick the right counters for the right purpose. It also leads to correct interpretation.

In this article, let’s take storage disk space capacity. I added the word space, as storage has 2 capacity: IOPS capacity and Disk Space.

Let’s start with vCenter, as that’s the source and foundation. I’m using a datastore cluster, which has 3 datastores. Each has 1 TB, mapped to a 1 TB LUN. Let’s verify what contributes to the Free space column.

10 the datastores usage

To do that, let’s add all the VMs. Hmm… they do not add up to what I saw at Datastore level. Something does not tally.

Can you guess 4 reasons contributing to this discrepancy?

11 the VM usage - thin and thick shown

Let’s browse the datastore. We found the first reason. I have non-VM objects. In this case, I have ISO files.

12 Datastore 2 has non VM

I mention that there are 4 reasons. Can you guess the other 3 reasons? The following screenshot explains the next 2 reasons.

12 orphaned files - clean up your datastores

The following screenshot shows the 4th reason. That particular VM has its CDROM coming from another datastore. Once I addressed the reasons, the total column makes more sense.

14 The total adds up now

Once I addressed the above 4 reasons, the total tally. It confirms what I thought, which is the Free column is based on Thin provisioning.

13 Used is Thin not Thick

Now that we know exactly what values we have at vCenter, we can go to vRealize Operations. We then pick up metrics that matches what we have in vCenter. This normally involves some trial and error. Here are the counters you should use:

Let’s review the counter further. I did add a 200 GB thin provisioned vmdk and 100 GB thick provisioned vmdk. So the total is 300 GB. vRealize Operations showed in the above. The Used Space (GB) metric went up by 100 GB, proving that it is based on Thin Provisioning. The Total Provisioned Consumed Space (GB) went up by 300 GB.

Do not use the following counters as the collection is less frequent:

  • Disk Space | Freespace (GB)
  • Disk Space | Total Used (GB)
  • Disk Space | Provisioned Space (GB)

As you can see below, their values are correct, but they do not get the frequent update.

17 less frequent update

Summary:

  • To see the total capacity in your datastore, use Capacity | Total Capacity (GB)
  • To see the space consumed in your datastore, use Capacity | Used Space (GB)
    • If you prefer to see the consumption number in %, use Capacity | Used Space (%) 
  • To see the free in your datastore, use Capacity | Available Space (GB) 

Now, the above is based on Thin Provision numbers. If you are doing your planning based on the Thick Provision number, use Capacity | Total Provisioned Consumer Space (GB). But take note that this number does not include non-VM (e.g. ISO) and VMs that are not registered to vCenter. The following screenshot proves that it does not.

16 thick excludes non VM and non registered VM

The above works well for a datastore. What about at the Datastore Cluster level, since this is where you should be doing your capacity management?

There are less counters, so we need to use super metrics.

  • To see the total capacity in your datastore cluster, do a super metric to Sum (Datastore: Capacity | Total Capacity (GB) )
  • To see the space consumed, based on Thin Provision, use Disk Space | Total Used (GB)
    • To see the free space, based on Thin Provision, do a super metric to
      Sum (Datastore: | Capacity | Available Space (GB) )
  • Total space consumed, based on Thick Provision, do a super metric to
    Sum (Datastore: Capacity | Total Provisioned Consumer Space (GB) )

    • To see the free space, based on Thick Provision, do this super metric:
      Sum
      (Datastore: Capacity | Total Capacity (GB) ) – Sum (Datastore: Capacity | Total Provisioned Consumer Space (GB) )

A storage array built purposely for VM only

VMware VSAN demonstrates that you can make storage easier to setup for VMware Administrators. It’s directly integrated with vSphere, with no additional installation. It is purpose built for vSphere.

VSAN is a form of distributed storage, an architectural departure from central array. I certainly like the idea that my data is not put in 1 basket. However, there are thousands of array out there, so there are bound to be customers who want or need a central array. What if your physical Storage array is purpose built for VM? This means it does not support physical servers. Nope, it does not support RDM either. Only VM and its vmdk. What are some of the complexity that can be taken away, making it easier for VMware Admin to master it?

What I like about such storage array is not just what is there, but what is not there.

I had the pleasure playing with Tintri T820. They have kindly loan a unit in VMware ASEAN Lab. It’s a 4RU box. Racking and cabling took only a few minutes. Both Tintri and PTC (our joint reseller) were on site when we set it up, but they let me do the bulk of the setting. No, I did not want to read the manual as I wanted to see how easy it is.

There is no complex cabling connecting the different component of the array. It just have 2 power cable, 2 10GE cable for data, and a few 1 GE cables for replication and management. That’s it! It reminds of me ESXi server.

We booted up. To my surprise, GRUB loaded. Linux then loaded, as you can see below. This feels like a server.

DSC_0542

We waited for it to load. It then prompts the following screen, where I keyed in the IP address and host name. And that’s all it asked.

DSC_0544

Once we got the IP address, it’s all web based, as shown below. There is a one time setup, which is much longer. But the questions are pretty straight forward.

00 - 3

And apparently we’re done with the configuration! I was asked to then create an NFS datastore. Tintr’s recommendation is to present 1 datastore. So I created a datastore as per below. From the screen below, you can see I have 3 vCenter Servers. That means the same Tintri datastore is mounted 3x.

Once mounted, I started migrating VM to it, so I can see some data in Tintri.

00 - 52

Using web browser, we login to Tintri. No annoying Java required. Also, no installation of management software. Just a browser. Below is the main screen. The dashboard tab is the main tab. I’ve already loaded many VMs, as you can see on the right side of the screen.

00 - 4-1

Look at the above screen carefully. Can you see what’s not there? There is no the usual storage concept or object such as LUNs, Volume, Slice, Pools, Aggregate, etc. There is also no choice of interface or protocol. There is no FC, FCoE, iSCSI. Only NFS. The entire array is 1 giant pool of 1 datastore. This is like VSAN, as VSAN also presents 1 datastore.

The next tab is “Virtual Machine”. It has a few sub tabs, which allows you to drill down. The columns on the screen are sortable. You can also filter each column. In the following screen, I have sorted the table by Latency.

00 - 4-2

You can drill down into Virtual Disk! In the screenshow below, I again sorted by latency.

If you look at the Provisioned column, you will see an example of filtering.

00 - 4-21

The next tab is VMStore. In here, you can see performance and other information. My performance was capped at 1 Gb as the ISL between the 2 switches in the lab is 1 Gb.

00 - 4-25

If you have multiple Tintri boxes, you can see them all on 1 dashboard. You need to deploy a vSphere appliance. Again, only took a few minutes to set up. Below is a screenshot from Tintri global center.

00 - 4-29

You might have noticed the Settings link near the top right of the screen. If you guess this is where you edit all the settings, you are right. It just a dialog box. Most of them are pretty self explanatory.

00 - 51

I almost forgot we’re dealing with physical array here! You can bring up a hardware tab. From here, you can see that the box has 24 physical disks, with 14 (yes, 14) being SSD. You can also see that it has 3 networks (Management, Data, Replication).

00 - 4-27

No setup is complete without installing syslog as we need to analyse it. From the Settings dialog box, you configure the target syslog. I specified my Log Insight, and you can see Tintri sends its log. I’d like to see Performance log, and I’ve asked Tintri if this can be done.

00 - 7

And that’s basically it. I’m deeply impressed with its simplicity.

vSphere Storage Latency – View from the vmkernel

The storage latency data that vCenter provides is a 20-second average. With vRealize Operations, and other management tools that keeps the data much longer, it is a 5-minute average. 20 seconds may seem short, but if the storage is doing 10K IOPS, that is an average of 10,000 x 20 = 200,000 read or writes. An average of 200,000 numbers will certainly hide the peak. If you have 100 reads or writes that experienced bad latency, but the remaining 199,900 operations returned fast, that poor 100 operations will be hidden.

With esxtop, it can get down to 2 seconds. That’s great, but we need to do it per ESXi. If you have a large farm, it means logging into every ESXi host. This will also have impact on performance, so you do not want to do it all the time, 24 x 7 x 365 days. There is also an issue of presenting that data in 1 screen.

ESXi vmkernel actually logs if the storage latency gets worse or improve. It does this out of the box, so nothing we need to enable. This log entry is not an average. No, it is not logging every single IO the vmkernel issues. That would generates way too many logs. It only exists when the situation improves or deteriorates. This is what you want anyway. As a bonus, the data is in microseconds, so it’s also more granular if you need something more accurate than ms.

This is where vRealize Log Insight comes in. With just a few clicks, I got the screenshot below. Notice that Log Insight already has a variable (field) for VMware ESXi SCSI Latency. So all I needed to do is to get all the log entries where this field exist. From there, it’s a matter of presentation. In the following screenshot, I grouped them by the Device ID (LUN). I only have 1 device that has this issue in the past, so apology for the poor example.

The line chart is an average. You can take the Max if you want to see the worst. I’ve zoomed the chart to a narrow window of just 2:15 minutes (from 18:06:00 to 18:08:15 hours), and each data point represents 1 second. So that is 1 second granularity. 20x better than the 20 seconds average you get from vCenter.

Storage latency

What do they look like when you have multiple LUNs, and you have storage issue? Here is an example. I have grouped them by Device as it’s easier to discuss with storage admin if you can tell the Device ID.

Max Storage Latency by LUN

Here is another example. This time the latency went up beyond 20,000 ms!

Max Storage Latency by LUN 2

Beside the line chart, you can also create table. Log Insight automatically shows the field above. All I did was hiding some (14 to be exact, as shown next to the Columns). I find the table useful to complement the chart.

You may be curious about what field is. Log Insight comes with many out of the box fields. They are provided by the content pack. To see the default fields, just duplicate it like what I did below. Log Insight will automatically highlight in green, how it determines the field. In the example below, it parses the string “microseconds to” and “microseconds” and the value in between the strings is extracted as field.

You can also set the type of field. In this case, it specifies the field as Integer.

Storage latency - field

You may think…. why not search for the string “latency”, so we get everything? Well… this is what I got. There are many log entries with the word latency in it that is not relevant. I have 169 entries below.

Storage latency - text latency

The following screenshot shows more examples of the log entries.

Storage latency - text latency 2

That’s all. Hope you find it useful. If you do, you may find this and this useful.