For the storage node, capacity management depends on the chosen architecture. Storage is undergoing a revolution with the arrival of the converged storage, which introduces an alternative to the traditional, external array.
In the traditional, external storage model, there is a physical array (for example, EMC VNX, HDS HUS, and NetApp). As most environments are not yet 100 percent virtualized, the physical array is shared by non-ESXi servers (for example, UNIX). There is often a physical backup server (for example, Symantec NetBackup) that utilizes the VMware VADP API.
The array might have LUNs replicated to a DR site. This replication certainly takes up bandwidth, FC ports, the array CPU, and bandwidth on your inter-data center line.
If the array is not supporting VAAI (or that specific feature is not yet implemented), then the traffic will traverse the path up and down. This can mean a lot of traffic going from the spindle to ESXi and back.
In the second example, there is no longer a separate physical array. It has been virtualized and absorbed into the server. It has truly become a subsystem. Some example products in this category are Nutanix and Virtual SAN. So the object labeled Storage 1 in the next diagram is just a bunch of local disks (magnetic or solid state) in the physical server. Each ESXi host runs a similar group of local drives, typically with flash, SSD, and SATA. The local drives are virtualized. There is no FC protocol; it’s all IP-based storage.
To avoid single point of failure, the virtual storage appliance is able to mirror or copy in real time and there is a need to cater bandwidth for this. I would recommend you use 10 Gb infrastructure for your ESXi if you are adopting this distributed storage architecture, especially in environments with five or more hosts in a cluster. The physical switches connecting your ESXi servers should be seen as an integral part of the solution, not a “service” provided by the network team. Architectural choices such as ensuring redundancy for NICs and switches are important.
The following diagram also uses vSphere Replication. Unlike array-replication, this is consuming the resource of ESXi and the network.
Once you have confirmed your storage architecture, you will be in the position to calculate your usable capacity and IOPS. Let’s now dive deeper into the first architecture, as this is still the most common architecture.
The next diagram shows a typical mid-range array. I’m only showing Controller 1 as our focus here is capacity, not availability. The top box shows the controller. It has CPU, RAM, and cache. In a tiered storage, there will be multiple tiers and the datastore (or NFS / RDM) can write into any of the tiers seamlessly and transparently. You do not need to have per-VM control over it. The control is likely at the LUN level. I’ve covered what that means in terms of performance (IOPS and latency). What I’d like to show in this diagram is the trade-off in design between the ability to share resources and the ability to guarantee performance. In this diagram, our array has three volumes. Each volume consists of 16 spindles. In this specific example, each volume is independent of one another. If Volume 1 is overloaded with IOPS but Volume 2 is idle, it cannot offload the IOPS to Volume 2. Hence, the storage array is not practically one array. From a capacity and performance point of view, it has hard partitions that cannot be crossed. Does it then mean that you create one giant volume so you can share everything? Probably not; the reason is that there is no concept of shares or priority within a single volume. From the diagram, Datastore 1 and Datastore 2 live on Volume 1. If a non-production VM on Datastore 1 is performing a high IO task (say someone runs IOmeter), it can impact a production VM on Datastore 2.
Storage I/O Control (SIOC) will not help you in this case. The scope of SIOC is within a datastore. It does not ensure fairness across datastores. I recommend you review Cormac Hogan’s blog. Duncan Epping has also written some good articles on the topic and a good starting point is this. If you have many datastores, SIOC has the highest chance of hitting fairness across datastores when the number of VMs per datastore is consistent.
As a VMware admin performing capacity management, you need to know the physical hardware where your VMware environment is running on at all layers. Often as VMware professionals, we stop at the compute layer and treat storage as just a LUN provider. There is a whole world underneath the LUNs presented to you.
Now that you know your physical capacity, the next thing to do is estimate your IaaS workload. If you buy an array with 100,000 IOPS, it does not mean you have 100,000 IOPS for your VM. In the next example, you have a much smaller number of usable IOPs. The most important factors you need to be aware of are:
- Frontend IOPS
- Backend IOPS
There are many calculations on IOPS as there are many variables impacting it. The numbers in this table are just examples. Your numbers will differ. The point I hope to get across is that it is important to sit down with the storage architect and estimate the number for your specific environment.
Capacity planning at the network layer
Similar to calculating capacity for storage, understanding capacity requirements for your network requires knowledge of the IaaS workloads that will compete with your VM workload. The following table provides an example using IP Storage. Your actual design may differ compared to it. If you are using Fiber Channel storage, then you can use the available bandwidth for other purposes