Monthly Archives: May 2015

Any VM abusing your IaaS by doing excessive workload?

This post continues from the Operationalize Your World post. Do read it first so you get the context.

You provide IaaS to your customers. Typically, you are not given login access to the their VMs. As a result, you do not have practical control over the VMs. They can generate excessive workload. Anytime, without you knowing it.

I’ve seen how just 1% of VM population doing damage for the entire cluster. Yes, that means in a cluster of 300 VM, it can only take 3 VMs doing excessive workload for many VMs to suffer.

A VM consumes 5 resources:

  • vCPU
  • vRAM (GB)
  • Disk Space
  • Disk IOPS
  • Network (Mbps)

The first 3 you can bind and control. When you give a VM 4 vCPU, 16 GB vRAM and 100 GB of vDisk, that’s all it can take. The Guest OS can run at full speed, doing as much work as it can, and it won’t exceed 4 vCPU, 16 GB vRAM and 100 GB space.

The last 2 you can also control, but normally you don’t implement it. It takes a lot of effort as the default in VMware vSphere is no control (read: unlimited).

You should control it. In fact, you can turn it as additional revenue. Here are some ideas:

  • Every VM comes with 500 IOPS, averaged over 5 minute period, free of charge. But customer can pay extra for unlimited IOPS for a flat fee every month.
  • If you operate in a global environment, where WAN link and Internet bandwidth is not unlimited, you should charge for bandwidth. You define your service offering: every VM comes with 100 Mbps free of charge. Customer can pay extra for unlimited bandwidth for a flat fee every month.

The above is to encourage proper usage.

Application Team does not normally know how much IOPS or Network they need. They also do not know how much network bandwidth and Disk IOPS your IaaS can provide. So they may do a test regularly, to ensure your IaaS is good. They may download IOmeter and run 10K IOPS for 5 minutes every day, to ensure your IaaS can handle the load when they need it. That can hit your IaaS badly. You need to track this excessive usage.

Real Life example

It’s easier to understand how the dashboard helps by using a real case.

This is a story from a customer who was hit by a VM running IOmeter. There were 500 VMs in the cluster, and customer did not know when the hit took place. They only knew it was recent.

To find this, we plot the maximum IOPS from any VM in the cluster over 1 week period. The line chart shows the maximum IOPS of a VM. It does not matter which VM. If a VM, any VM, generated excessive IOPS, we would know when and how long. Even if they take turn to generate the IOPS, it will be captured, because the super metric evaluates the formula afresh every 5 minutes. At 8 am it can be VM 007, at 8:05 am it can be VM 8888.

We are tracking the counter at Virtual Disk, not lower level. That means the IO is coming from VM. It is not vSphere doing snapshot, vCenter doing storage vMotion, nor vSAN doing rebalancing.

What number do you expect?

My take is 1000 IOPS, unless your cluster has a lot of VMs doing heavy IOPS. 1000 IOPS can do damage, as it’s 5 minutes average. That is actually 1000 x 5 x 60 = 300K IO performed in 5 minutes! It’s not normal to issue 300K IOs in 5 minutes flat.

As you can see from the screen, we quickly found out the problem.

1

In the preceding 2 line charts, we see a very high peak at 13,212 IOPS. That’s a lot of IO. It hit near 4 million IO issued in 5 minutes flat.

I plot for 7 days so to see the extend of the peak relative to normal workload. As you can see below, this is not normal workload. It stood out.

The peak will take care of the excessive usage. But what about your environment as a whole? Also, how many VMs are doing this excessive IOPS?

The second line chart shows the Average. Notice it only went up to 15 IOPS. That means this is not a population issue. The peak is likely the job of 1 VM, as the average remains low.

In general, you should expect the average to be <50 IOPS. Remember, it’s 5 minutes sustained. A cluster with 300 VMs doing an average of 100 IOPS means your storage is hit by 30,000 IOPS sustained for 5 minutes. That’s a lot of IOPS. You need SSD to handle the load.

If the average is near the maximum, and the maximum is high, that means there are a lot of VMs doing high IOPS. Your infra is being hammered.

Let’s zoom into the peak. We can see that it peaked at around 3:17 am on 24 May. We can find out which VM did this. This is one reason why I find vRealize Operations powerful. I can zoom into any period of time and get any info about any object.

2

To list the VMs doing the IOPS at around 3:17 pm on 24 May, I use Top-N widget. I wanted to know not just the VM, but all the VMs. I wanted to verify my thought earlier, that it is only 1 VM doing excessive IOPS. The Top-N sorts the VMs by IOPS.

Bingo!

We got the culprit. Notice the number (13,212 IOPS) matches the Line Chart. Notice also that the next VM is doing much lower IOPS. At 715 IOPS, it is far lower.

3

You need to set to the 5 minute, as Top-N takes an average over the selected period. So do not select 1 hour, for example, as it will give the average of the entire hour.

Dashboard

This is what the dashboard looks like. We have Storage on the left, and Network on the right. You can

For Network, the number is in KBps as it follows what you see at vCenter. We convert into Mbps using $This super metric

Implementation

You can download the dashboard here. Do note that you get the whole 50 dashboard, as it’s part of a set.

Paul Armenakis gave a constructive feedback that I should include the super metric formula. Thanks Paul, here they are:

  • Max(${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=virtualDisk|commandsAveraged_average, depth=2})
  • Avg(${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=virtualDisk|commandsAveraged_average, depth=2})
  • Max(${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=net|usage_average, depth=2}) * 8 / 1024
  • Avg(${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=net|usage_average, depth=2}) * 8 / 1024

One thing I like about super metric is you can use it to convert value also. The default unit in vSphere is KB/s, and I convert it into Mbps.

As an example, here is how you create super metrics. For a step by step instruction on how to crease super metric, please see this.

Super metric

For Network, if this number is >1 Gb most of the time, it is high. Unless you have VM doing very high traffic, the line chart will not be consistently high. Please note that the Max is 2 Gb, as it’s full duplex.

For the Average utilisation, I’d expect this to be <100 Mbps. Remember, it’s 5 minutes sustained. Just like most VMs are not storage intensive, most VMs are not network intensive either.

Enhancement

The line chart tells you the pattern over time. But it does not tell you at a glance, the distribution among the VMs. The Top-N lists the VMs, but it’s not scalable. You can list the top 20 or 40, but you won’t list the top 2000. It won’t make sense also. Also, you cannot see if latency is affected.

This is where the Heat Map widget comes in handy. We can plot all the VMs. You can have >1000 VMs and be able to visually tell at a glance. Each box or square represents 1 VM. I’ve shown an example below. This is not from the same time period, and the peak has subsided.

4

I color the above heat map by Latency. You can set the threshold to any value you want. I set 15 ms as red. So I can tell quickly if any VM experience latency at 15 ms. From the above, I do have a few VMs with 15 ms or more latency, but in general they are good.

If you see a big box, that means you have a VM doing excessive IOPS. I do not have such scenario in the above. There are some VMs doing IOPS, but the box’ size are quite well spread. The tiny boxes are normal, as most VMs should be relatively idle from IOPS view point.

For Network, what do you use for the Color?

We do not have Network Latency counter in vSphere. You can use Drop Packets as the “latency” indicator. So expect to see green as VMs should not be dropping packets. If you know that your network is healthy, you can be creative and use Network Usage for color also. Why do we use Usage for both color and size? It does not seem logical right? I let you ponder for a while, and review the sample below. I use Network Usage for both color and size.

5

Managed to figure it out why?

Good, the issue with size is it is relative. It does not tell me the exact or absolute amount. I want to see at a glance, who is exceeding my threshold. If I set my threshold at 100 Mbps, then anyone exceeding that number will show up in bright red!

Limitation

Say you run a vSphere farm of 1000 VMs. You know that you do have large databases doing heavy IOPS. You also have high traffic web servers consuming high network. Even if you have 1 of these VMs each, it will render your Max (super metric) redundant. The result is dominated by these VMs.

So what can you do? You exclude them. Create 2 groups:

  1. First group is just these high-usage VMs.
  2. Second group is for the rest.

Once you separate them, it’s easier to manage.

BTW, this dashboard is at Cluster level. Its purpose is to complement the Performance Monitoring dashboard, which is based on Cluster level. It is not for overall monitoring. We have separate dashboards for your Storage Heavy Hitters and Network Top Consumers.

Hope you find the idea useful. Apply it in your environment, and let me know your finding! 🙂

Mastering vRealize Operations. Great book by Scott Norris and Christopher Slater

I had the privilege to write a foreword for Scott and Chris’s book. It’s great to see it published. I’m including the actual foreword I wrote as it says what I need to say about this book. Go get one and master it!

Book

————— the actual foreword in the book —————–

When Scott and Chris approached me to write a foreword for their book, I jumped on it right away. As stated in my book, VMware vRealize Operations Performance and Capacity Management, Packt Publishing, I excluded a lot of areas in order to stay within the self-imposed page limit. They have read my book, and it gives me sheer joy as an author when another author takes your work and then complements it well. Having read the final product, I am in fact going to change the strategy for my second edition. I am going to refer to this book a lot as they have explained the concepts better than I did. These are indeed “the missing chapters” in my book!

I highly recommend this book to you. Whether you are new to vRealize Operations, or you have a large scale deployment, there is something for you. The book covers the product top-down. I have read the vRealize Operations manuals and white papers, and I think this book stands out. It stands out not because the manuals are not good, but because the book was written by practitioners and field personnel. Just like my book, it was born at the customer’s site with real-world input. Scott and Chris have done numerous vRealize Operations implementations, and the content of this book reflects their valuable experience.

Speaking about the product of this book, vRealize Operations has gained the acceptance of many customers globally. It has also gained industry acceptance, as you can see from the many management packs provided by partners. It has evolved from a vSphere-centric management tool to an overall SDDC management tool. vRealize Operations is also undergoing improvement every year. By the time you read this book, there is a good chance that an updated release will be out. The good thing is that the foundation set in this book is required for you to master the new version.

I am certainly glad to see that both the books are published by the same publisher, as this makes future collaboration easier. Together with many bloggers and practitioners out there, we can make a significant contribution toward a great vRealize Operations deployment.

Iwan ‘e1’ Rahabok

Completed 7 years in VMware. Why the next 3 will be more exciting!

Today, I completed 7 years in VMware. All of us who have been doing VMware for a long time (I consider 7 years a long time in x86 virtualisation) know that we did not predict it would have such a massive impact in the industry. I’m a technologist, and yet 7 years ago, I had no idea I’d be doing what I’m doing today. There are many stories that many of you virtualisation old timers can share. Please share in the comment section. I think we all can agree it’s been a great journey worth taking.

While the past 7 years have been much more than I expected, the next 3 years will be even more exciting. I think we are in period where the IT Industry is being fundamentally redefined. There are many large waves that are overlapping, each with its own ramification. It is truly the survival of the smartest in this technology-driven industry!

There are a few major trends that I’m seeing. I think they are clear enough that each is no longer a Prediction, but rather just a Projection. We maybe wrong in the rate of the adoption, but the adoption trend is there. It is a matter of When, not If, anymore.

At the end of 2014, I wrote my 2015 projection. I’d expand on that as I’m looking at the next 3 years now. Our industry is undergoing an interesting change. No large vendor is safe now. You can be a multi-billion dollar business, dominating your industry for decades, and yet the very core of your offering is being attacked. If it is not being attacked, it is being made less relevant. How each 800-pound gorilla will play their game is an interesting movie to watch!

The underlying current causing the above massive ripple consists of several trends. The trend takes the shape of either industry trend or technology trend, or both. The largest trend, Cloud Computing, is obvious that I won’t cover it here. Let’s see some of the trends, and see if you agree. Do post in the comment as I’m keen to hear your opinion.

UNIX to X86 migration continues.

  • UNIX is still a >5 billions dollar industry. It is mostly used in the mission critical environment, running older version (more proven) of business applications (e.g. core banking). This will provide the demand for x86 virtualisation vendors. I see a lot of migration consulting and mission critical support opportunities for the next 3 years. The migration requires consulting as it’s an application level migration. This also explains why it is moving slowly. I also see customers buying more x64 hardware to handle the UNIX migration, instead of using their existing hardware.
  • In the next 3 years, I see customers beefing up their virtual X86-64 platform to handle this mission critical workload. Specific to VMware, more customers are adopting mission critical support. Deeper VMware knowledge from IT Professionals will be required and appreciated by your customers.

The x86-64 architecture is becoming good enough for more and more use cases.

  • IMHO, this provides the crucial support for SDDC. You cannot have SDDC if the hardware is not a uniform pool of resource. The software needs to be able to use any hardware, so the hardware has to be standardized. No more proprietary ASIC for most cases. The Data Centers are standardizing from lots of different hardware to basically x86-64. You run all your core software (e.g. firewall, switch, router, LB) on the same common hardware. Because they are common, they become commodity. The stickiness is reduced.
  • The above standardisation has other benefit. One is simplicity at the physical level. The overall DC foot print is shrinking as a result of being able to share the hardware. 1000 VM per rack, complete with storage + network, is becoming common.
  • In the next 3 years, I see customers standardizing the hardware in their DC to commodity servers. Instead of physical firewall, load balancers, storage, etc, they are just buying white box servers. A good example is Super Micro.
  • VMware IT Pros have the chance to expand their skills to entire data center, as virtualisation expands and covers the remaining of data center.

Classic Storage will be disrupted.

  • The combined effect of SSD and 10G Ethernet have hit the classic dual-controller array. You can see the revenue from the classic products have flatten in the past several years. What’s saving the mid-range is the high-end is being migrated to them. What’s saving the entry level is the demand for storage continues to grow. In all cases, the budget per box has reduced.
  • In some deployment, the mid-range and low-end storage have started to disappear. They follow the Top-of-Rack switch, which has been virtualized when customers virtualise their servers. While the replacement process will take years, the snow ball effect has started.
  • The above trend has impacted Fibre Channel too. When the actual array is virtualized, there is no need anymore for the fabric. As you can see here, the adoption has been lagging in the past several years. I think that’s a clear sign of declining. The winner apparently is not iSCSI or NFS. It’s a different protocol, which I’d just call distributed storage. For example, VSAN does not use use iSCSI nor NFS. In fact, it is not VMFS either.
  • Because the need for a server to have lots of storage, I see the rack-mount form factor making a come back. The converged form-factor is also gaining momentum as it allows for density that matches blade.
  • There is a business driver behind all the above. The distributed storage is cheaper. It has lower CAPEX requirements. The regular expansion is also lower.
  • As Storage moves into Virtualisation, VMware IT Pros have the chance to expand their skills to cover storage too. I have been working with customer on a 12-node VSAN cluster and it’s certainly more fun than working with classic dual-controller array.

Network will be defined in software

  • The trend impacting Server and Storage will also hit Network. Instead of having dedicated and proprietary network equipments, customers are simplifying and standardising at the physical layer.
  • The VM network will be decoupled from the physical network. This makes operations easier. No need to wait for weeks for IP address and firewall configuration from central team.
  • Every customers that have implemented stretched L2 across DC told me it’s complex. The main reason for stretching L2 is Disaster Recovery. With NSX, you can achieve it without the risk of spanning tree. It is also much cheaper, and this is what customer told me.
  • As Network moves into Virtualisation, VMware IT Pros have the chance to expand their skills to cover storage too.

Management will be built-in, not bolted-on

  • Once the 3 pillars of data centers (Server, Storage, Network) are virtualised and defined in software, the management has to change. Managing an SDDC is very different to managing a physical DC. I’ve seen how the fundamental has changed.
  • In traditional data center, management is typically outside the scope of VMware IT Pro. There is another team, which typically uses The Big 4 (IBM Tivoli, HP Open View, CA Unicenter, BMC) to manage their data center. In SDDC, I see the VMware IT team expands their presence and own this domain too.

All in all, I think the work and career of VMware IT Pro will be more exciting in the next 3 years. I do enjoy discussion with customers where the scope is entire data center, instead of just “server” portion. Have you started architecting the entire DC? I’m sure it’s complex, but do you like the broader scope better? Let me know in the comment below if you think you will be  playing the role of SDDC Architect in the next 3 years!