Monthly Archives: January 2015

vCenter and vRealize counters – part 1

This blog post is adapted from my book, titled VMware vRealize Operations Performance and Capacity Management. It is published by Packt Publishing

vSphere 6 comes with many counters, many more than what a physical server provides. There are new counters that do not have a physical equivalent, such as memory ballooning, CPU latency, and vSphere replication. In addition, some counters have the same name as their physical world counterpart but behave differently in vSphere. Memory usage is a common one, resulting in confusion among system administrators. For those counters that are similar to their physical world counterparts, vSphere may use different units, such as milliseconds.

As a result, experienced IT administrators find it hard to master vSphere counters by building on their existing knowledge. Instead of trying to relate each counter to its physical equivalent, I find it useful to group them according to their purpose. Virtualization formalizes the relationship between the infrastructure team and application team. The infrastructure team changes from the system builder to service provider. The application team no longer owns the physical infrastructure.

The application team becomes a consumer of a shared service—the virtual platform. Depending on the Service Level Agreement (SLA), the application team can be served as if they have dedicated access to the infrastructure, or they can take a performance hit in exchange for a lower price. For SLAs where performance matters, the VM running in the cluster should not be impacted by any other VMs. The performance must be as good as if it is the only VM running in the ESXi.

Because there are two different counter users, there are two different purposes.

  • The application team (developers and the VM owner) only cares about their own VM.
  • The infrastructure team has to care about both the VM and infrastructure, especially when they need to show that the shared infrastructure is not a bottleneck.

One set of counters is to monitor the VM; the other set is to monitor the infrastructure. The following diagram shows the two different purposes and what we should check for each. By knowing what matters on each layer, we can better manage the virtual environment.

Book - chapter 3 - 01

At the VM layer, we care whether the VM is being served well by the platform. Other VMs are irrelevant from the VM owner’s point of view. A VM owner only wants to make sure his or her VM is not contending for a resource. So the key counter here is contention. Only when we are satisfied that there is no contention can we proceed to check whether the VM is sized correctly or not. Most people check for utilization first because that is what they are used to monitoring in the physical infrastructure. In a virtual environment, we should check for contention first.

At the infrastructure layer, we care whether it serves everyone well. Make sure that there is no contention for resource among all the VMs in the platform. Only when the infrastructure is clear from contention can we troubleshoot a particular VM. If the infrastructure is having a hard time serving majority of the VMs, there is no point troubleshooting a particular VM.

This two-layer concept is also implemented by vSphere in compute and storage architectures. For example, there are two distinct layers of memory in vSphere. There is the individual VM memory provided by the hypervisor and there is the physical memory at the host level. For an individual VM, we care whether the VM is getting enough memory. At the host level, we care whether the host has enough memory for everyone. Because of the difference in goals, we look for a different set of counters.

In the previous diagram, there are 2 numbers shown in a large font, indicating that there are 2 main steps in monitoring. Each step applies to each layer (the VM layer and infrastructure layer), so there are two numbers for each step:

  1. Step 1 is used for performance management. It is useful during troubleshooting or when checking whether we are meeting performance SLAs or not.
  2. Step 2 is used for capacity management. It is useful as part of long-term capacity planning. The time period for step 2 is typically 3 months, as we are checking for overall utilization and not a one off spike.

With the preceding concept in mind, we are ready to dive into more detail. Let’s cover compute, network, and storage in the next post.

Discount code for the book

Please note that the codes expire on 26 January 2015. This was originally meant for VMUG Singapore members only, but the publisher has kindly let me published openly.

You can buy it here.

Physical Book: aU1dLXxb0a (25% discount)
eBook : E1VMUG2015 (50% discount)

You can find my presentation for VMUG Singapore here. It’s based on the book.

Capacity Management at Infrastructure level – part 2

For the storage node, capacity management depends on the chosen architecture. Storage is undergoing a revolution with the arrival of the converged storage, which introduces an alternative to the traditional, external array.

In the traditional, external storage model, there is a physical array (for example, EMC VNX, HDS HUS, and NetApp). As most environments are not yet 100 percent virtualized, the physical array is shared by non-ESXi servers (for example, UNIX). There is often a physical backup server (for example, Symantec NetBackup) that utilizes the VMware VADP API.

The array might have LUNs replicated to a DR site. This replication certainly takes up bandwidth, FC ports, the array CPU, and bandwidth on your inter-data center line.

If the array is not supporting VAAI (or that specific feature is not yet implemented), then the traffic will traverse the path up and down. This can mean a lot of traffic going from the spindle to ESXi and back.

Book - Capacity Management - storage 01

In the second example, there is no longer a separate physical array. It has been virtualized and absorbed into the server. It has truly become a subsystem. Some example products in this category are Nutanix and Virtual SAN. So the object labeled Storage 1 in the next diagram is just a bunch of local disks (magnetic or solid state) in the physical server. Each ESXi host runs a similar group of local drives, typically with flash, SSD, and SATA. The local drives are virtualized. There is no FC protocol; it’s all IP-based storage.

To avoid single point of failure, the virtual storage appliance is able to mirror or copy in real time and there is a need to cater bandwidth for this. I would recommend you use 10 Gb infrastructure for your ESXi if you are adopting this distributed storage architecture, especially in environments with five or more hosts in a cluster. The physical switches connecting your ESXi servers should be seen as an integral part of the solution, not a “service” provided by the network team. Architectural choices such as ensuring redundancy for NICs and switches are important.

The following diagram also uses vSphere Replication. Unlike array-replication, this is consuming the resource of ESXi and the network.

Book - Capacity Management - storage 02

Once you have confirmed your storage architecture, you will be in the position to calculate your usable capacity and IOPS. Let’s now dive deeper into the first architecture, as this is still the most common architecture.

The next diagram shows a typical mid-range array. I’m only showing Controller 1 as our focus here is capacity, not availability. The top box shows the controller. It has CPU, RAM, and cache. In a tiered storage, there will be multiple tiers and the datastore (or NFS / RDM) can write into any of the tiers seamlessly and transparently. You do not need to have per-VM control over it. The control is likely at the LUN level. I’ve covered what that means in terms of performance (IOPS and latency). What I’d like to show in this diagram is the trade-off in design between the ability to share resources and the ability to guarantee performance. In this diagram, our array has three volumes. Each volume consists of 16 spindles. In this specific example, each volume is independent of one another. If Volume 1 is overloaded with IOPS but Volume 2 is idle, it cannot offload the IOPS to Volume 2. Hence, the storage array is not practically one array. From a capacity and performance point of view, it has hard partitions that cannot be crossed. Does it then mean that you create one giant volume so you can share everything? Probably not; the reason is that there is no concept of shares or priority within a single volume. From the diagram, Datastore 1 and Datastore 2 live on Volume 1. If a non-production VM on Datastore 1 is performing a high IO task (say someone runs IOmeter), it can impact a production VM on Datastore 2.

Book - Capacity Management - classic array

Storage I/O Control (SIOC) will not help you in this case. The scope of SIOC is within a datastore. It does not ensure fairness across datastores. I recommend you review Cormac Hogan’s blog. Duncan Epping has also written some good articles on the topic and a good starting point is this. If you have many datastores, SIOC has the highest chance of hitting fairness across datastores when the number of VMs per datastore is consistent.

As a VMware admin performing capacity management, you need to know the physical hardware where your VMware environment is running on at all layers. Often as VMware professionals, we stop at the compute layer and treat storage as just a LUN provider. There is a whole world underneath the LUNs presented to you.

Now that you know your physical capacity, the next thing to do is estimate your IaaS workload. If you buy an array with 100,000 IOPS, it does not mean you have 100,000 IOPS for your VM. In the next example, you have a much smaller number of usable IOPs. The most important factors you need to be aware of are:

  • Frontend IOPS
  • Backend IOPS

There are many calculations on IOPS as there are many variables impacting it. The numbers in this table are just examples. Your numbers will differ. The point I hope to get across is that it is important to sit down with the storage architect and estimate the number for your specific environment.

Book - Capacity Management - classic array workload

Capacity planning at the network layer

Similar to calculating capacity for storage, understanding capacity requirements for your network requires knowledge of the IaaS workloads that will compete with your VM workload. The following table provides an example using IP Storage. Your actual design may differ compared to it. If you are using Fiber Channel storage, then you can use the available bandwidth for other purposes

Book - Capacity Management - network