Tag Archives: capacity

Released – VMware Performance and Capacity Management

Glad to share that the 2nd edition is finally out. It is now available for order. If you use Amazon, it’s here.

From both the amount of effort, and the resultant book, this to me is more like 2.0 than 1.1. Page wise, it is 500+ pages, doubling the 1st edition. The existing 8 chapters have been expanded and reorganized, resulting in 15 chapters.

It now has 3 distinct parts, whereas the 1st edition has no part. The 3 parts are structured specifically in that order to make it easier for you to see the big picture. You will find the key changes versus the 1st edition below.

It’s a surprise how much things changed in just 14 months. I certainly did not expect some of the changes back in Jan 2015!

  • Major improvement in monitoring beyond vSphere.
    • vRealize and its ecosystems have huge improvement in Storage, Network, Application monitoring. This includes newer technology such as VSAN and NSX.
    • Many adapters (management packs) and content pack were released for both vRealize Operations and vRealize Log Insight. I’m glad to see thriving ecosystems. Blue Medora especially have moved ahead very fast.
  • Rapid adoption of NSX and VSAN, that I had to add them. They were not plan of the original 2nd edition.
  • Rapid adoption of VDI monitoring using vRealize. I had to include VDI use cases.
  • Adoption by customers, partners and internal have increased.
    • In the original plan, I wasn’t planning of asking any partners to contribute. So I’m surprised that 2 partners agreed right away.
    • It is much easier to ask for review, as people are interested and want to help.
  • vSphere 6.0, 6.0 U1 and U2 were released.
    • Since the book focus on Operations (and not architecture), the impact of both releases is very minimal.
    • Very few counters have changed since with vSphere 5.5.
  • vRealize Operations had 6.1 and 6.2 releases. Log Insight has many releases too.
    • Again, this has minimal impact, since the book is not a product book.

You can find more details of the book here.

vCenter and vRealize counters – part 3

Storage

If you look at the ESXi and VM metric groups for storage in the vCenter performance chart, it is not clear how they relate to one another at first glance. You have storage network, storage adapter, storage path, datastore, and disk metric groups that you need to check. How do they impact on one another?

I have created the following diagram to explain the relationship. The beige boxes are what you are likely to be familiar with. You have your ESXi host, and it can have NFS Datastore, VMFS Datastore, or RDM objects. The blue colored boxes represent the metric groups.

Book - chapter 4 - 01

NFS and VMFS datastores differ drastically in terms of counters, as NFS is file-based while VMFS is block-based. For NFS, it uses the vmnic, and so the adapter type (FC, FCoE, or iSCSI) is not applicable. Multipathing is handled by the network, so you don’t see it in the storage layer. For VMFS or RDM, you have more detailed visibility of the storage. To start off, each ESXi adapter is visible and you can check the counters for each of them. In terms of relationship, one adapter can have many devices (disk or CDROM). One device is typically accessed via two storage adapters (for availability and load balancing), and it is also accessed via two paths per adapter, with the paths diverging at the storage switch. A single path, which will come from a specific adapter, can naturally connect one adapter to one device. The following diagram shows the four paths:

Book - chapter 4 - 02

A storage path takes data from ESXi to the LUN (the term used by vSphere is Disk), not to the datastore. So if the datastore has multiple extents, there are four paths per extent. This is one reason why I did not use more than one extent, as each extent adds four paths. If you are not familiar with extent, Cormac Hogan explains it well on this blog post.

For VMFS, you can see the same counters at both the Datastore level and the Disk level. Their value will be identical if you follow the recommended configuration to create a 1:1 relationship between a datastore and a LUN. This means you present an entire LUN to a datastore (use all of its capacity).

The following screenshot shows how we manage the ESXi storage. Click on the ESXi you need to manage, select the Manage tab, and then the Storage subtab. In this subtab, we can see the adapters, devices, and the host cache. The screen shows an ESXi host with the list of its adapters. I have selected vmhba2, which is an FC HBA. Notice that it is connected to 5 devices. Each device has 4 paths, so I have 20 paths in total

ESX - Adapter

Let’s move on to the Storage Devices tab. The following screenshot shows the list of devices. Because NFS is not a disk, it does not appear here. I have selected one of the devices to show its properties.

ESXi - device

If you click on the Paths tab, you will be presented with the information shown in the next screenshot, including whether a path is active. Note that not all paths carry I/O; it depends on your configuration and multipathing software. Because each LUN typically has four paths, path management can be complicated if you have many LUNs.

ESXi - device path

The story is quite different on the VM layer. A VM does not see the underlying shared storage. It sees local disks only. So regardless of whether the underlying storage is NFS, VMFS, or RDM, it sees all of them as virtual disks. You lose visibility in the physical adapter (for example, you cannot tell how many IOPSs on vmhba2 are coming from a particular VM) and physical paths (for example, how many disk commands traveling on that path are coming from a particular VM). You can, however, see the impact at the Datastore level and the physical Disk level. The Datastore counter is especially useful. For example, if you notice that your IOPS is higher at the Datastore level than at the virtual Disk level, this means you have a snapshot. The snapshot IO is not visible at the virtual Disk level as the snapshot is stored on a different virtual disk.

Book - chapter 4 - 03

Network

My apology that I cannot publish information on Network as it’s not provided as free pages by the publisher. The information is covered in my book.

vCenter and vRealize counters – part 2

Compute

The following diagram shows how a VM gets its resource from ESXi. Unlike a physical server, a VM has dynamic resources given to it. It is not static. Contention, Demand and Entitlement are concepts that do not exist in the physical world. It is a pretty complex diagram, so let me walk you through it.

Book - chapter 3 - 02

The tall rectangular area represents a VM. Say this VM is given 8 GB of virtual RAM. The bottom line represents 0 GB and the top line represents 8 GB. The VM is configured with 8 GB RAM. We call this Provisioned. This is what the Guest OS sees, so if it is running Windows, you will see 8 GB RAM when you log into Windows.

Unlike a physical server, you can configure a Limit and a Reservation. This is done outside the Guest OS, so Windows or Linux does not know. You should minimize the use of Limit and Reservation as it makes the operation more complex.

Entitlement means what the VM is entitled to. In this example, the hypervisor entitles the VM to a certain amount of memory. I did not show a solid line and used an italic font style to mark that Entitlement is not a fixed value, but a dynamic value determined by the hypervisor. It varies every minute, determined by Limit, Entitlement, and Reservation of the VM itself and any shared allocation with other VMs running on the same host.

Obviously, a VM can only use what it is entitled to at any given point of time, so the Usage counter does not go higher than the Entitlement counter. The green line shows that Usage ranges from 0 to the Entitlement value.

In a healthy environment, the ESXi host has enough resources to meet the demands of all the VMs on it with sufficient overhead. In this case, you will see that the Entitlement, Usage, and Demand counters will be similar to one another when the VM is highly utilized. This is shown by the green line where Demand stops at Usage, and Usage stops at Entitlement. The numerical value may not be identical because vCenter reports Usage in percentage, and it is an average value of the sample period. vCenter reports Entitlement in MHz and it takes the latest value in the sample period. It reports Demand in MHz and it is an average value of the sample period. This also explains why you may see Usage a bit higher than Entitlement in highly-utilized vCPU. If the VM has low utilization, you will see the Entitlement counter is much higher than Usage.

An environment in which the ESXi host is resource constrained is unhealthy. It cannot give every VM the resources they ask for. The VMs demand more than they are entitled to use, so the Usage and Entitlement counters will be lower than the Demand counter. The Demand counter can go higher than Limit naturally. For example, if a VM is limited to 2 GB of RAM and it wants to use 14 GB, then Demand will exceed Limit. Obviously, Demand cannot exceed Provisioned. This is why the red line stops at Provisioned because that is as high as it can go.

The difference between what the VM demands and what it gets to use is the Contention counter. Contention is a special counter that tracks all these competition for resources. It’s a counter that only exists in the virtual world.

So Contention, simplistically speaking, is Demand – Usage. I said simplistically as that’s not the actual formula. The actual formula does not really matter for all practical purpose as it’s all relative to the expectation that you’ve set to your customers (VM Owner).

Contention happens when what the VM demands is more than it gets to use. So if the Contention is 0, the VM can use everything it demands. This is the ultimate goal, as performance will match the physical world. This Contention value is useful to demonstrate that the infrastructure provides a good service to the application team. If a VM owner comes to see you and says that your shared infrastructure is unable to serve his or her VM well, both of you can check the Contention counter.

The Contention counter should become a part of your Performance SLA or Key Performance Indicator (KPI). It is not sufficient to track utilization alone. When there is contention, it is possible that both your VM and ESXi host have low utilization, and yet your customers (VMs running on that host) perform poorly. This typically happens when the VMs are relatively large compared to the ESXi host. Let me give you a simple example to illustrate this. The ESXi host has two sockets and 20 cores. Hyper-threading is not enabled to keep this example simple. You run just 2 VMs, but each VM has 11 vCPUs. As a result, they will not be able to run concurrently. ESXi VMkernel will schedule them sequentially as there are only 20 physical cores to serve 22 vCPUs. Here, both VMs will experience high contention.

Hold on! You might say, “There is no Contention counter in vSphere and no memory Demand counter either.”

This is where vR Ops comes in. It does not just regurgitate the values in vCenter. It has implicit knowledge of vSphere and a set of derived counters with formulae that leverage that knowledge.

You need to have an understanding of how the vSphere CPU scheduler works.
The following diagram shows the various states that a VM can be in:

Book - chapter 3 - 03

The preceding diagram is taken from The CPU Scheduler in VMware vSphere®
5.1: Performance Study. This is a whitepaper that documents the CPU scheduler with
a good amount of depth for VMware administrators. I highly recommend you read this
paper as it will help you explain to your customers (the application team) how your
shared infrastructure juggles all those VMs at the same time. It will also help you pick
the right counters when you create your custom dashboards in vRealize Operations.