Sunny Dua and I are sharing what we have learned next week at VMworld 2015. As you can see from our blogs and book, we focus on Performance and Capacity Management. Essentially, we are sharing what we have learned from our engagements and projects in the past few years.
We have presented the material a few times, and know it will not fit within the 1 hour time slot. So this blog serves as a deeper dive to the slide.
Sunny has provided a good overview on the topic in his blog here, so please read that first. You can find the session here. Title is MGT4973 – Mastering Performance Monitoring and Capacity Planning. I will provide additional details in this blog.
The material follows the following structure:
- A Technical Introduction, to set the focus and scope of discussion, and level set the knowledge.
- The Dining Area. We use a restaurant analogy to drive the message that you need to focus on the customer first, and your IaaS second. If you take care of them well, and they are happy with your service, the problem you have in your IaaS is secondary and internal matter.
- The Kitchen. This is your infrastructure layer, where VMware and the hardware resides.
The key component of this is the 2 distinct layer in your IaaS business. Please review that article before proceeding, as the rest of the material completely depends on this model.
If you are presenting this material back to your colleagues or management, who may not have deep technical knowledge on VMware, be prepared to whiteboard it. From experience, it took around 2 – 4 hours for those without vSphere vmkernel scheduling knowledge. In one of my customer, it took me all day as the audience kept on asking question.
The Dining Area
Here, I share the actual dashboards you need to help you ensure a good IaaS business, where customers are happy. It focuses on the customers, not the Infrastructure.
Detail monitoring of a single VM
- We start with a single VM, as we need to ensure we can handle 1 VM before we consider handling all VMs. A common use case here is a VM owner (your customer) complains that his VM is slow. You need to come up with a dashboard that enable help desk to quickly and easily identify where the problem is. Is it with Infrastructure or with the VM? Is it CPU, RAM, Disk or Network? How severe is the problem?
Large VMs Monitoring
- We created this dashboard as over-provisioning is a common illness in virtual environment. If you want a healthy environment, you need to eradicate, or at least minimize, this bad practice. Reducing someone’s VM is a delicate and lengthy process, so you want to focus on the largest VM. Reducing one 16-vCPU VM to 4 vCPU gives you better return than reducing three 8-vCPU VM to 4 vCPU. The actual total vCPU reduction is the same (12 vCPU in this example), but ESXi vmkernel scheduler will have easier task in juggling the VMs as the 16 vCPU VM needs 16 physical cores (even though it’s running idle loop).
- This dashboard visually tells you how deep and wide-spread the over-provisioning problem is. You get to see all the large VMs, and from here you can drill down into individual VM and see if it’s really using all those resources allocated.
VM Right Sizing
- There are 2 ends of the spectrum:
- Upsize is generally not your concern 🙂 . The VM owner will be the first to tell you his VM needs more resource. From your view point, as someone looking after all the VMs, you can use Log Insight to quickly tell which VM hit high CPU or RAM usage and when.
- Downsize is definitely your concern. It is tough to get anyone to give back their resources, especially since it incurs downtime. From my experience, I learn that some application team want to see the actual utilization of each vCPU in the past 1 month. You can create a dashboard that automatically plots all the vCPU utilization. To see more details coverage, review Chapter 8 of my book.
- One characteristic of virtual environment is sharing. The VMs share the physical resources. Excessive usage by 1-2 VM can impact the overall IaaS performance. This is especially true in component that you do not cap by default, which is Network throughput, Disk IOPS and Disk throughput.
- This dashboard lets you see if there is excessive usage at any point in time. And if there is, you can drill down to find out which VM causes that.
Indeed, there are only 4 use cases you need. Do let Sunny or me if you think you need additional dashboards. Keep it simple, so you are not lost in the forest of screens and reports. From experience, customers who want more dashboard mistaken the Consumer Layer with the Provider Layer (the kitchen). So let’s cover the kitchen now.
The IaaS layer is where you have, or should have, complete control and visibility. If you do not have, you need to fix it, as your customer assumes and expects you do.
There are 4 large areas to manage:
As you know well, the above 4 disciplines are inter-related. Among these 4, Performance is the most common issue, but Capacity is what you normally tell me you need. You will see in the session that Capacity depends heavily on Performance and Availability. Take Storage for example. Say your SAN array has 100 TB capacity left. That’s plenty of space, probably enough for 1000 VM. But existing VMs are already experiencing high latency. Should you add more VM? The answer is clearly no. Adding VM will make performance worse. For all practical purpose, the capacity is full.
The way you do capacity changes drastically, once you take into account Performance and Availability. See this for an in-depth explanation on how you can implement a more holistic capacity planning.
For Performance, the main requirement from your CIO or management is typically around your IaaS ability to deliver. They want your IaaS to be performing, as business runs on it. The question is how do you prove that… not a single VM… in the past 1 month or whatever the period is… suffers unacceptable performance hit because of non-performing IaaS?
That’s an innocent, but loaded, question. Very loaded, and you need to consider carefully.
If you have 1000 VMs, you need to answer for 1000 VM. For each VM, you need to answer CPU, RAM, Disk and Network. That’s 4000 metrics. If your management or customer agrees on a 5-minute sampling period, you have 12 samples in 1 hour. In 1 day you have 288 samples. In 1 month you have ~8750 samples (30.4 days on average). For 1000 VM, that means 4000 x 8750 = 35,000,000 chances where your IaaS can fail in serving the customer!
In the session, and in the book, you will see that if you implement Service Tiering, it drastically increases your chance in meeting the requirement. We introduce a concept called Performance SLA. Once you have it, you will know for sure if you fail or succeed in meeting the agreed performance.
I distinguish between monitoring and troubleshooting. To me, troubleshooting is a big topic by itself, and the steps vary depending on what you’re trying to troubleshoot. Monitoring, on the other hand, consists of repeatable steps that you perform regularly, preferably daily. You can create SOP (Standard Operating Procedure) out of it.
As you can see from the book and blog, my focus so far has been on Performance and Capacity. The reason is they are big topic and I need to reach the level that you can actually implement and operationalize. Once I’m done, I’d move to Configuration and Availability.
With that, see you at VMworld!
[15 Sep 2015: you can find the actual presentation in this link]