Category Archives: Architecture

Cover both architecture and engineering. It does not cover operations and strategy.

Announcing VMware Performance & Capacity Management, 2nd edition

Img-5235

I have been waiting for a long time to be able to post this. The book started around Dec 2014, when the writing of the 1st edition was complete and the publisher did a cut off date for changes. I knew many items were not covered. It was 1.0 after all.

Fast forward to February 2016, and we have revamped the content. From both the amount of effort, and the resultant book, this to me is more like 2.0 than 1.1. Page wise, it is 500+ pages, doubling the 1st edition. You can see the structure of the second edition in this link. I have tried to codify the knowledge I have into a structured process.

It’s a surprise how much things changed in just 14 months. I certainly did not expect some of the changes back in Jan 2015!

  • Major improvement in monitoring beyond vSphere.
    • vRealize and its ecosystems have huge improvement in Storage, Network, Application monitoring. This includes newer technology such as VSAN and NSX.
    • Many adapters (management packs) and content pack were released for both vRealize and Log Insight. I’m glad to see thriving ecosystems. Blue Medora especially have moved ahead very fast.
  • Rapid adoption of NSX and VSAN, that I had to add them. They were not plan of the original 2nd edition.
  • Rapid adoption of VDI monitoring using vRealize. I had to include VDI use cases.
  • Adoption by customers, partners and internal have increased.
    • In the original plan, I wasn’t planning of asking any partners to contribute. So I’m surprised that 2 partners agreed right away.
    • It is much easier to ask for review, as people are interested and want to help.
  • vSphere 6.0 and 6.0 Update 1 were released.
    • Since the book focus on Operations (and not architecture) and monitoring, the impact of both releases is very minimal.
    • Not many counters changed compared with vSphere 5.5
  • vRealize Operations had 6.1 and 6.2 releases. Log Insight has many releases too.
    • Again, this has minimal impact, since the book is not a product book.

Release Notes of 2nd edition

The existing 8 chapters have expanded and reorganized, resulting in 15 chapters. It now has 3 distinct parts. The 3 parts are structured specifically in that order to make it easier for you to see the big picture. You will find the key changes versus the 1st edition below.

More complete

  • More explanation on Performance and Capacity Management.
  • Elaborates the Performance SLA concept as it has resonates with customers (from engagements, events and blogging)
  • Add Network monitoring, with focus on NSX
  • Add VSAN monitoring.
  • Add Horizon View monitoring. Practical tips like this.
  • Incorporate monitoring that is better done via VMware Log Insight
  • Add application-level monitoring. I asked Blue Medora to contribute as they know this better than I do.

More practical, less theory.

  • Move contents that are more theoretical to the back.
  • Add more examples, and structure them so readers can see the relationship.

Easier to read

  • Less of long sentences, long paragraphs or complex tables.
  • More bullet points.
  • Break long chapters into smaller chunks.
  • Add more white space on places where it’s full of text.
  • More diagrams, to complement explanation.
  • Lighter words. Friendly chat among friends, not formal research paper.
  • Add humour.
  • Add adult picture. Ok, this is not a good idea.
  • Clear picture. Some pictures were too small.
  • Clearer heading and layout. The style heading 2 is relatively too big to the text.

Fix title

  • The book is actually not just for vRealize Operations user. It’s for the broader VMware team. This is more of a vSphere book than vRealize book.
  • This would also make the darn title shorter 🙂 Yes, in future this will evolve to just SDDC Operations Management.

What does not change?

  • It remains focus on Performance and Capacity. I’m not adding Configuration, Availability, Security, etc.
  • The book will also remain a solution book, and not a product book. There is already a great product book here by fellow CTO Ambassador.

I will provide as many free information as possible, that the Publisher allows me. We are looking at early April publication, so in the meantime, here is what they have made available. When they have officially released it, I’d add more information, such as proper acknowledgement to those who have made the book possible, and certainly discount code.

If you want to publish a review in your blog or LinkedIn, I'll link you with Packt.

I hope you find it useful. Any correction and suggestion, let me know at Twitter or LinkedIn.

Capture

PS: No, please don’t ask me about the 3rd edition. Right now I need a break 🙂 and to spend time with family! Below is wife, my 2 girls and my 1st niece.

11057344_1674393329473505_6520074326172583340_o

VMware Performance SLA

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Google “performance SLA” VMware, and you will find only few relevant articles. The string performance SLA has to be within a quote, as it is not performance and SLA, but Performance SLA. Yes, I’m after web page with the words Performance SLA together. You will get many irrelevant results if you simply google VMware Performance SLA without the quote.

I just tried it again (2 April 2017). It’s only 6000 results, up from 2330 results in Nov 2016, and 1640 results on Oct 2015. The first 10 are shown below. Notice 8 of them are actually from my blog, book or event. If you ask your peers, you will not find many customers have Performance SLA.

I didn't change the screenshot below, as it's similar to what I got in April 2017

performance-sla-nov-2016

I checked beyond the first 10 results. Other than my own articles, Google returned only 5 relevant articles. The rest are actually not relevant. An example of relevant article is by former colleague, and a good friend Michael Webster. All the relevant articles are good and informative articles. They also mention Performance SLA. They just do not define and quantify what Performance SLA is. If something is not quantified, it is subjective. It’s hard to reach formal agreement with customers quickly and consistently when the line is not clearly drawn. If you have disagreement with your customers, especially paying customers, guess who win 🙂

A former colleague, Scott Drummonds, covered Performance SLA in his old blog back in 2010. It is unsurprising to me, knowing Scott, that he had thought about it years ago! However, what he covered was Application layer, not IaaS layer. He also did not provide a counter to measure. Certainly, it was virtually impossible to provide that years ago, considering the maturity of the IaaS at that time.

Availability SLA protects you when there is downtime. Performance SLA protects you when there is performance issue. How?

If you are within your SLA, you are safe.

Below is example of Performance SLA. Describe (or define) the service for each of the 4 infrastructure component (CPU, RAM, Disk, and Network).

Take note:

  • Do not set SLA at the individual vCPU level. Set at the whole VM level. It’s much harder to comply and monitor at per vCPU.
  • Do not set SLA at both Read and Write latency. Set at the aggregate.
  • All numbers are measured in 5 minute average. If a spike only lasts for 1 minute, then it calms down for the remaining 4 minutes, it won’t show up.
  • To change the power management settings, see this KB (VM application runs slower than expected in ESXi)
  • The SLA impacts your architecture. It poses a constraint as it sets a formal threshold.
    • Example: how do we ensure Tier 3 storage does not impact Tier 1 since they are on the same array (spindles, CPU, etc). Some storage array has shares internally. Even on vSAN, this is challenging.

The above is is an example as your policy as IaaS provider may vary. For each, list all the properties that impact the quality of the service.

Notice what’s missing in the table?

Something that you normally have if you are doing Capacity Management based on spreadsheet 🙂

Yup, it’s the Consolidation Ratio.

It’s not there because it’s not relevant. In fact, the ratios can be misleading as it does not take into account VM utilization, VM size, ESXi power management, DRS, backup period, etc. It is definitely a good guide for initial planning. Once you are in production, you need to monitor based on what’s happening in production. Performance Experts like Mark Achtemichuk has explained it well here. I recommend you read it first.

Done?

Great! Let’s dive deeper. I will take one component, Storage, as it’s the easiest to understand.

  • VM Disk Latency for Tier 1: 5 ms
  • VM Disk Latency for Tier 2: 15 ms
  • VM Disk Latency for Tier 3: 25 ms
  • All values measured as 5-minute average.
  • SLA is breached when value exceed SLA at any given 5 minutes, 365 days a year.

When a VM owner complains that her VM is slow because of storage, and that VM resides on a Tier 2 storage, both of you can see the VM disk latency. If it’s below 15 ms, it’s not your fault. Perhaps her application needs a faster storage, and she can pay more and upgrade to Tier 1. If it higher than 15 ms, you as IaaS provider does not even have to wait until she complains 🙂 Better still, do something before she notices.

What number should you set?

  • If you have no data, the above is a good starting point.
  • If you have vR Ops running, you can set a number based on your actual data.
    • There is no point is setting something much higher or lower than what you actually have.
    • Use the super metric preview as shown below.
    • I plotted 1 month data (that’s 8650 data points). That’s more than enough.
    • Take the maximum. That’s your baseline.

Alerts

Now… you have thousands of VMs under your management. I guess what you want is to be alerted if any of them hit the SLA you promise.

Yes, vRealize Operations can alert you, so you can proactively do something before VM owner complains. Brandon Gordon, Integration Architect at VMware, showed me how we can achieve the above in vRealize Operations.

See the screenshot below, courtesy of Brandon.

alert 4

In order to get such alerts at each VM, you need to create and define the alerts. Brandon has defined for CPU, RAM and Disk. He has also defined it for each tier.

alert 2

Hope you find useful, just like many of my customers have. If not, drop me a note.

1000 VM per rack is the new minimum

The purpose of the above article is to drive a point that you need to look at the entire SDDC, and not just a component (e.g. Compute, Storage, Network, Security, Management). Once you look at the whole SDDC infrastructure in its entirety, you maybe surprised that everything fits into just 1-2 rack!

The purpose is not to say that you must achieve 1000 VM per rack. It is also possible that you can’t even achieve 100 VM per rack (for example, you are running all Monster VMs). I’m just using “visual” so it’s easier for you to see that there is a lot of inefficiency in typical data center.

If your entire data center shrinks into just 1 rack, what happens to the IT Organisation? You are right, it will have to shrink also.

  • You may no longer need 3 separate team (Architect, Implement, Operate).
  • You may no longer need silos (Network, Server, Storage, Security).
  • You may no longer need the layers (Admin, Manager, Director, Head)

With less people, there is less politics and the whole team becomes more agile.

The above is not just my personal opinion. Ivan Pepelnjak, a networking authority, has in fact shared back in October 2014 that “2000 VMs can easily fit onto 40 servers”. I recommend you review his calculation on this blog article. I agree with Ivan that “All you need are two top-of-rack switches” for your entire data center. Being a networking authority, he elaborates from networking angle. I’d like to complement it from a Server angle.

Let’s take a quick calculation to see how many VMs we can place in a standard 42 RU rack. I’d use Server VM, not Desktop VM, as they demand higher load.

I’d use a 2RU, 4 ESXi Host form factor, as this is a popular form factor. You can find example at SuperMicro site. Each ESXi has 2 Intel Xeon sockets and all flash local SSD running distributed virtual storage. With Intel Xeon E5-2699 v3, each ESXi Host has 36 physical cores. Add 25% of Intel Hyper-Threading benefit, you can support ~30 VM with 2-3 vCPU each as there are enough physical cores to schedule the VMs.

The above take into account that a few cores are needed for

  • VMkernel
  • NSX
  • VSAN
  • vSphere Replication
  • NSX services from partners, which take the form of VM instead of kernel module.

30 VM for each ESXi. That’s 30:1 consolidation ratio, which is a reality today. You have 4 ESXi in a 2RU form factor. That means 30 x 4 = 120 VM fits into 2 RU space. Let’s assume you standardise on a 8-node cluster, and you do N+1 for HA. That means a cluster with HA will house 7 ESXi x 30 VM = 210 VMs. Each cluster only occupies 4 RU, and it comes with shared storage.

To hit ~1500 VMs, you just need 7 clusters. In terms of rack space, that’s just 7 x 4 RU = 28 RU.

Capture

A standard rack has 42 RU. You still have 42 – 28 = 14 RU. That’s plenty of space for Networking, Internet connection, KVM, UPS, and Backup!

Networking will only take 2 x 2 RU. You can get 96 ports per 2 RU. Arista has models you can choose here. Yes, there is no need for spine-leaf architecture. That simplifies networking a lot.

KVM will only take 1 RU. With iLO, some customers do not use KVM as KVM encourages physical presence in data center.

If you still need a physical firewall, there is space for it.

If you prefer external storage, you can easily put 1400 VM into a 2RU all-flash storage. Tintri has an example here.

I’ve provided a sample rack design in this blog.

What do you think? How many racks do you still use to handle 1000 VM?

Updates

  • [7 Nov 2015:  Tom Carter spotted an area I overlooked. I forgot to take into account the power requirements! He was rightly disappointed, and this is certainly disappointing for me too, as I used to sell big boxes like Sun Fire 15K and HDS 9990! On big boxes like this, I had to ensure that customers data center has the correct cee form. Beyond just the Ampere, you need to know if they are single-phase or triple-phase. So Tom, thank you for the correction! Tom provided his calculation in Ivan’s blog, so please review it]
  • [15 Nov 2015: Greg Ferro shared in his article that 1000 VM is certainly achievable. I agree with him that it’s a consideration. It’s not a goal nor a limit. It all depends on your application and situation]
  • [27 Mar 2016: Intel Xeon E5-2699-V4 is delivering 22 cores per socket, up from 18 cores in v3]