Tag Archives: performance management

75% discount code for the SDDC Performance & Capacity Management book

Code is TechSummit15

With 75% discount, it’s only US$ 6.75.

The code only works at the Publisher web site. It does not work in Amazon or other site. The direct link to the book is this.

It is also for soft-copy only. Good for the environment, saves shipping fee, and it’s immediate download 🙂

For those who are not familiar, the title is rather misleading. As shared here, I wanted to name it SDDC Performance and Capacity Management. That was what I proposed to the Publisher, as it is not actually a product book. It is more of an architecture or best practice book, focus on performance and capacity. Out of the 260 pages, the bulk of the book is not about vRealize Operations. Product wise, the book covers vSphere more than it covers vRealize Operations. So it’s actually relevant if you just want to master those vSphere counters.

Because it is not a product book, you can in fact apply a lot of the concept even if you do not have vRealize Operations. If you are looking for a product book, the best one is here.

VMworld 2015 session MGT4973

Sunny Dua and I delivered the session MGT4973 at VMworld 2015. We did 2 repeat sessions, as session 1 had near 500 people registered. We were really humbled and honored by the great feedback. The topic required a change in paradigm. You need to unlearn what you’ve known for years as best practices, and learn a new concept. So it’s something you can receive/accept well if you are relaxed.

Sunny and I decided to use humor, else we ran a risk that the session wouldn’t go down well. We were not sure if our humor would resonate, so it’s a big relieved that it was well received! The 2 sessions receive a rating of 4.38 and 4.77 respectively. The 2nd one had better rating as we took feedback on the first session. Below are the comments and we are thankful for the kind words. It encourages us to continue sharing on SDDC Operations Management.


The session video has been published to VMworld attendees. No login required.

  • Sunny provided a good overview on the topic in his blog here, so please read that first.
  • I added a bit more detail to his overview here.
  • The slide was based on our deck that we’ve presented before. It normally took ~3 hours. You can find the presentation here. This is a super-set of the VMworld session.
  • Because of the positive feedback, we decided to share with the vBrownBag. We were given 15 minute slot, which you can see here. Certainly, we would love to share more with the community.
  • [15 Sep 2015 update: You can find the actual presentation we delivered here]

There are requests for details info, and here they are:

  • Details of my book can be found here.
  • The actual dashboard for Capacity Management can be found here. Warning, it’s a long read.

Will we make it to VMworld Barcelona? It would be a privilege indeed. We are local resources of Singapore team, so naturally there is no scope for us to fly to Barcelona. But if we are needed, we would love to participate and share the content!

VMworld 2015 session MGT4973 preview


Sunny Dua and I are sharing what we have learned next week at VMworld 2015. As you can see from our blogs and book, we focus on Performance and Capacity Management. Essentially, we are sharing what we have learned from our engagements and projects in the past few years.

We have presented the material a few times, and know it will not fit within the 1 hour time slot. So this blog serves as a deeper dive to the slide.

Sunny has provided a good overview on the topic in his blog here, so please read that first. You can find the session here. Title is MGT4973 – Mastering Performance Monitoring and Capacity Planning. I will provide additional details in this blog.

The material follows the following structure:

  1. A Technical Introduction, to set the focus and scope of discussion, and level set the knowledge.
  2. The Dining Area. We use a restaurant analogy to drive the message that you need to focus on the customer first, and your IaaS second. If you take care of them well, and they are happy with your service, the problem you have in your IaaS is secondary and internal matter.
  3. The Kitchen. This is your infrastructure layer, where VMware and the hardware resides.

Technical Introduction

The key component of this is the 2 distinct layer in your IaaS business. Please review that article before proceeding, as the rest of the material completely depends on this model.

If you are presenting this material back to your colleagues or management, who may not have deep technical knowledge on VMware, be prepared to whiteboard it. From experience, it took around 2 – 4 hours for those without vSphere vmkernel scheduling knowledge. In one of my customer, it took me all day as the audience kept on asking question.

The Dining Area 

Here, I share the actual dashboards you need to help you ensure a good IaaS business, where customers are happy. It focuses on the customers, not the Infrastructure.

Detail monitoring of a single VM

  • We start with a single VM, as we need to ensure we can handle 1 VM before we consider handling all VMs. A common use case here is a VM owner (your customer) complains that his VM is slow. You need to come up with a dashboard that enable help desk to quickly and easily identify where the problem is. Is it with Infrastructure or with the VM? Is it CPU, RAM, Disk or Network? How severe is the problem?

Large VMs Monitoring

  • We created this dashboard as over-provisioning is a common illness in virtual environment. If you want a healthy environment, you need to eradicate, or at least minimize, this bad practice. Reducing someone’s VM is a delicate and lengthy process, so you want to focus on the largest VM. Reducing one 16-vCPU VM to 4 vCPU gives you better return than reducing three 8-vCPU VM to 4 vCPU. The actual total vCPU reduction is the same (12 vCPU in this example), but ESXi vmkernel scheduler will have easier task in juggling the VMs as the 16 vCPU VM needs 16 physical cores (even though it’s running idle loop).
  • This dashboard visually tells you how deep and wide-spread the over-provisioning problem is. You get to see all the large VMs, and from here you can drill down into individual VM and see if it’s really using all those resources allocated.

VM Right Sizing

  • There are 2 ends of the spectrum:
    • downsize
    • upsize
  • Upsize is generally not your concern 🙂 . The VM owner will be the first to tell you his VM needs more resource. From your view point, as someone looking after all the VMs, you can use Log Insight to quickly tell which VM hit high CPU or RAM usage and when.
  • Downsize is definitely your concern. It is tough to get anyone to give back their resources, especially since it incurs downtime. From my experience, I learn that some application team want to see the actual utilization of each vCPU in the past 1 month. You can create a dashboard that automatically plots all the vCPU utilization. To see more details coverage, review Chapter 8 of my book.

Excessive Usage

  • One characteristic of virtual environment is sharing. The VMs share the physical resources. Excessive usage by 1-2 VM can impact the overall IaaS performance. This is especially true in component that you do not cap by default, which is Network throughput, Disk IOPS and Disk throughput.
  • This dashboard lets you see if there is excessive usage at any point in time. And if there is, you can drill down to find out which VM causes that.

Indeed, there are only 4 use cases you need. Do let Sunny or me if you think you need additional dashboards. Keep it simple, so you are not lost in the forest of screens and reports. From experience, customers who want more dashboard mistaken the Consumer Layer with the Provider Layer (the kitchen). So let’s cover the kitchen now.

The Kitchen

The IaaS layer is where you have, or should have, complete control and visibility. If you do not have, you need to fix it, as your customer assumes and expects you do.

There are 4 large areas to manage:

  • Performance
  • Capacity
  • Configuration
  • Availability

As you know well, the above 4 disciplines are inter-related. Among these 4, Performance is the most common issue, but Capacity is what you normally tell me you need. You will see in the session that Capacity depends heavily on Performance and Availability. Take Storage for example. Say your SAN array has 100 TB capacity left. That’s plenty of space, probably enough for 1000 VM. But existing VMs are already experiencing high latency. Should you add more VM? The answer is clearly no. Adding VM will make performance worse. For all practical purpose, the capacity is full.

The way you do capacity changes drastically, once you take into account Performance and Availability. See this for an in-depth explanation on how you can implement a more holistic capacity planning.

For Performance, the main requirement from your CIO or management is typically around your IaaS ability to deliver. They want your IaaS to be performing, as business runs on it. The question is how do you prove that… not a single VM… in the past 1 month or whatever the period is… suffers unacceptable performance hit because of non-performing IaaS?

That’s an innocent, but loaded, question. Very loaded, and you need to consider carefully.

If you have 1000 VMs, you need to answer for 1000 VM. For each VM, you need to answer CPU, RAM, Disk and Network. That’s 4000 metrics. If your management or customer agrees on a 5-minute sampling period, you have 12 samples in 1 hour. In 1 day you have 288 samples. In 1 month you have ~8750 samples (30.4 days on average). For 1000 VM, that means 4000 x 8750 = 35,000,000 chances where your IaaS can fail in serving the customer!

In the session, and in the book, you will see that if you implement Service Tiering, it drastically increases your chance in meeting the requirement. We introduce a concept called Performance SLA. Once you have it, you will know for sure if you fail or succeed in meeting the agreed performance.

I distinguish between monitoring and troubleshooting. To me, troubleshooting is a big topic by itself, and the steps vary depending on what you’re trying to troubleshoot. Monitoring, on the other hand, consists of repeatable steps that you perform regularly, preferably daily. You can create SOP (Standard Operating Procedure) out of it.

As you can see from the book and blog, my focus so far has been on Performance and Capacity. The reason is they are big topic and I need to reach the level that you can actually implement and operationalize. Once I’m done, I’d move to Configuration and Availability.

With that, see you at VMworld!

[15 Sep 2015: you can find the actual presentation in this link]