Tag Archives: hypervisor

Multi-hypervisor consideration

My customer was considering adding a second hypervisor, because the Analysts say it is a common practice. My first thought as an IT Architect is: just because others are doing it, does not mean it is a good idea to do it. Even if it is a good idea, and it is also a best practice, does not mean it is good for you. There are many factors to consider that makes your situation and condition different to others.

Before I proceed further, we need to be clear on the scope of the discussion. This is about multi-hypervisors vs single-hypervisor. This is not hypervisor A vs B. To me, you are better off running Hyper-V or Acropolis or vSphere completely, then running >1. At least, you are not doubling complexity and need to master both. If you cannot even troubleshoot vSphere + NSX + VSAN properly, why add another platform into the mix?

To me, one needs to be grounded before making a decision. This allows us to be precise. Specific to hypervisor, we need to know which cluster should be running the 2nd hypervisor. Because of HA/DRS, a vSphere cluster is the smallest logical building block. I treat cluster as 1 unit of computing. I will make each member to have the same ESXi version and patch; hence running a totally different hypervisor in the same vSphere cluster is out of the question for me.

In order to pinpoint which cluster to run the 2nd hypervisor, you need to look at your overall SDDC architecture. This helps you ensure that the 2nd hypervisor fits well into your overall architecture. So start with your SDDC Architecture. You have that drawing right? 😉

I have created a sample for 500 server VM and 1000 VDI VM. Review that and see where you can fit the 2nd hypervisor. For those with larger deployment, the sample I provided scales to 2000 server VM and 5000 VDI VM. That’s large enough for most customers. If yours is larger, you can use that as Pod.

It’s a series of posts, and I go quite deep. So take your coffee and carefully review it.

I am happy to wait 🙂

Done reviewing? Great!

What you need to do now is to come up with your own SDDC Architecture. Likely, it won’t be as optimized and ideal as mine, as yours have to take into account brownfield reality.

You walk from where you stand. If you can't stand properly, don't walk.

Can you see where you can optimize and improve your SDDC? A lot of customers can improve their private cloud, better capability while lowering cost/complexity, by adding storage virtualization and network virtualization. If what you have is just server consolidation, then it is not even an SDDC. If you already have SDDC, but you’re far from AWS or Google level of efficiency and effectiveness, then adding a 2nd hypervisor is not going to get you closer. Focus first on getting to SDDC or Private Cloud.

Even if you have the complete architecture of SDDC, you can still lower cost by improving Operations. Review this material.

Have you designed your improved SDDC? If you have, there is a good chance that you have difficulty placing a 2nd hypervisor. The reason is a 2nd hypervisor de-optimize the environment. It actually makes the overall architecture more complex.

hypervisor

The hypervisor, as you quickly realized, is far from a commodity. Here is a detailed analysis on why it is not a commodity.

This additional complexity brings us the very point of the objective of a 2nd hypervisor. There are only 2 reasons why customer adds a second vendor to their environment:

  • The first one does not work
  • The first one too expensive

Optimizing SDDC Cost

In the case of VMware vSphere and SDDC, I think it is clear which one is the reason 🙂

So let’s talk about cost. With every passing year, IT has to deliver more with less. That’s the nature of the industry, hence your users expect it from you. You’re providing IT service. Since your vendors & suppliers are giving you more with less, you have to pass on this to the business.

If you look at the total IT cost, the VMware cost is a small component. If it were a big component, VMware revenue would equal to many IT giants. VMware revenue is much smaller than many IT giants, and I’m referring to just the Infrastructure revenue of these large vendors. For every dollar a CIO spends, perhaps <$0.1 goes to VMware. While you can focus on reducing this $0.1 by adding a second hypervisor, there is an alternative. You can use the same virtualization technology that you’ve applied to Server, and apply it to the other 2 pillars of Data Center. Every infrastructure consists of just 3 large pillars: Server, Storage, and Network. Use the same principles and experience, and extend virtualization to the rest of your infrastructure. Another word, evolve from Server Consolidation to SDDC.

What if Storage and Network are not something you can improve? In a lot of cases, you can still optimize your Compute. If you are running a 3-5 year old server, going to the latest Xeon will help you consolidate more. If your environment is small, you can consider single-socket. I wrote about it here. Reducing your socket counts mean less vSphere license. You can use the savings and improve your management capability with vRealize Operations Insight.

Even without this article, a lot of you realized that adding a 2nd hypervisor is not the right thing to do. I heard it directly from almost every VMware Architect/Engineers/Administrator at customers’ side. You’re trading cost from one bucket to another. This is because hypervisor is not merely a kernel that can run VMs. That small piece of software is at the core of your SDDC. Everything else on top depends on it, and leverages its API heavily. Everything else below is optimized for it. It is far from commodity. If you have peers who still think it’s a commodity, I hope this short blog article helps.

Have an enjoyable journey toward your SDDC, whichever hypervisor it maybe.

VMware SDDC Architecture: sample for 500 – 2000 VM

If you are to architect a virtual infrastructure for 500 VM, what will it look like? The minor details will certainly differ from one implementation to another, however the major building blocks will be similar. The same goes for say a 2000 VM class environment. Given the rapid improvement in hardware price/performance, I categorise 500 – 2500 VM as a medium SDDC. 2500 VM used to be a large farm, occupying rows of racks in a decent size data center. I think in 2016, it will be merely 1-3 racks, inclusive of network and storage! What used to be the whole data center has become the size of a small server room! Yes, a few experts are all it takes to manage 2500 VM.

Before you proceed, I recommend that you read the official VMware Validated Design. I've taken idea and solution from there and added my 2 cents. You should also review EVO SDDC, especially this interview with Raj Yavatkar, VMware Fellow and Lead Development Architect of EVO SDDC.

I’m privilege to server customers with >50K server VM, and even more desktop VM. For customers with >10K VM per physical data center, I see 2015-Q4 Pod to consist of 2 racks and house 2000 – 3000 VM. While data center power supply is reliable, most customers take into rack failure or rack maintenance. So the minimum size is 2 physical racks. The pod is complete and can stand on its own. It has server, storage, network and management. It is 1 logical unit, and managed as 1. They are patched, updated, secured and upgraded as 1. It may not have its own vCenter, as you do not want to have too many vCenter as that increases complexity. Because of this operation and management challenge, a pod has only 1 hypervisor. You either go with VMware SDDC Pod, or pod from other vendors. If you are going with multiple hypervisors, then you will create 2 independent pods. Each pod will host similar number of VMs. If your decision to go multiple hypervisors because you think they are commodity, read this blog.

I understand that most customers have <2500 VM. In fact, in my region, if you have >1000 VM, you are considered large. So what does a pod look like when you don’t even have enough workload for half a pod? You don’t have economics of scale, so how do you optimize a smaller infrastructure?

Architecture is far from trivial, so this will be a series of blogs.

  1. Part 1 (this blog) will set the stage, cover Requirements, provide Overall Architecture and summary.
  2. Part 2 covers Network Architecture
  3. Part 3 will cover Storage Architecture (coming after VMworld, as I need some details from Tintri!)
  4. Part 4 covers the Rack Design.
  5. Part 5 explains the design considerations I had when thinking through the solution. This is actually critical. If you have a different consideration, you will end up with a different design.
  6. Part 6 covers the methodology I used.

Requirements

To enable us to discuss, we need to take an example. The following table scopes the size and requirements:

1

The SDDC will have to cater for both server VM and desktop VM. I call them VSI and VDI, respectively. To save cost, they share the same management cluster. They have their own vCenter as this allows Horizon to be upgraded independently of the server farm.

It has to have DR capability. I’m using VMware SRM and vSphere Replication. It has to support active-active application too. I consider VDI as an application, and in this architecture, it is active/active, hence DR is irrelevant. Because I have Active-Active at application layer already, I do not see a need to cater for Disaster Avoidance.

The server farm is further broken into Service Tiers. It is common to see multiple tiers, so I’m using 3. Gold is the highest and best tier. Service Tier is critical, and you can see this for more details. The 3 tier of service is defined below:

2

VMs are also split into different environments. For simplicity, I’d group them into Production and Non-Production. To make it easier to comply with audit and regulation, I am not mixing Production and Non-Production in the same vSphere cluster. Separate cluster also allows you to test SDDC changes in a less critical environment. The problem is the nasty issue typically happens at production. Just because it’s running smoothly in Non-Production for years does not mean it will be in Production.

For the VDI, I simply go with 500 VM per cluster, as 500 is an easy number to remember. In a specific customer environment, I normally refine the approach and use 10 ESXi host per cluster as the number of VDI VMs vary depending on the user profile.

SDN is a key component of SDDC. NSX best practices tells to have a dedicated cluster for Edge. I’m including a small cluster on each physical data center.

VMware best practice also recommends a separate cluster for management. I am extending this concept and call it IT + Management Cluster. It is not only for management, which is out of band. It is also for core or shared services that IT provides to business. These services are typically in-band.

Overall Architecture

Based on all of the above requirements, scope, and consideration, here is what I’d propose:

It has 2 physical data centers, as I have to cater for DR and active-active applications. It has 2 vCenters servers for the same reason. Horizon has its own vCenter for flexibility and simplicity.

The physical Data Center 1 serves the bulk of production workload. I’m not keen on splitting Production equally between 2 data centers as you need a lot of WAN bandwidth. As you can see in the diagram, I’m also providing active/active capability. Majority of traffic is East – West. I’ve seen customers who have big pipes and yet encounter latency issue even though the link is not saturated.

The other reason is to force the separation between Production and Non Production. Migration to Production should be controlled. If they are in the same physical DC, it can be tempting to shortcut the process.

  1. Tier 1. I further split the Gold tier into 2. This is to enable mission critical applications to have long distance vMotion. Out of 100 VM, I allocate 25% for active/active applications and Disaster Avoidance (DA).
  2. Tier 2. I split it into 2 physical data center, as I need to meet the DR requirements. Unlike Tier 1, I no longer provide DA and active/active application.
  3. Tier 3. I only make this available for Non-Production. An environment with just 300 Production VM is too small to have >2 tiers. In this example, I am actually providing 2+ tier, as the Gold Tier has option for active/active application and DA.

My sizing is based on the simple model of consolidation ratio. To me, they are just guidance. For proper sizing, review this link. You may wanna get yourself a good cup of coffee as that’s a 5-part series.

Let’s now add the remaining component to make the diagram a bit more complete. Here is what it looks like after I add Management, DR and ESXi.

500 - 2

DR with SRM. We use SRM to failover Tier 1 into Tier 3. During DR, we will freeze the Tier 3 VMs, so Tier 1 VMs can run with no performance impact. I’ve made the cluster size identical to ensure no performance loss. For Tier 2, it will failover to Tier 2. So there is 50% performance loss. I’m drawing the arrow one-way for both, in reality you can fail over in either direction.

ESXi Sizing. I have 2 sizes: 2 socket and single-socket. The bigger square is 2 sockets, and the smaller one is 1 socket. Please review this for why I use single socket. I’m trying to the cluster size 4 – 12 nodes, and I try not to have too many sizes. As you can see, I do have some small cluster as there is simply not enough workload to justify more node.

We’ve completed the overall architecture for 500 server VM and 1000 VDI. Can we scale this to 2000 server VM and 5000 VDI, with almost no re-architecting? The answer is yes. Here is the architecture. Notice how similar it is. This is why I wrote in the beginning that “the major building blocks will be similar”. In this case, I’ve shown that they are in fact identical. My little girl told me as I went back and forth between the 2 diagram…. “Daddy, why are drawing the same thing two times?” 🙂

5000 - 1

The only changes above is just the ESXi sizes and cluster size. For examples:

  • For the VDI, I have 5 clusters per site.
  • For the Tier 1 server VM, I have 3 clusters. Each has 8 ESXi host. I keep all 8 to make it simpler.
  • For the Tier 3 server VM, I have 2 clusters. Each has 12 ESXi hosts. Total is 24 hosts, so it’s enough to run all Tier 1 during DC-wide failover.

By now, you likely notice that I have omitted 2 large components of SDDC Architecture. Can you guess?

Yup, it’s Storage and Network.

I will touch Storage here, and will cover Network on a separate blog. I’m simplifying the diagram so we can focus on the storage subsystem:

5000 - 2

I’m using 2 types of storage, although we can very well use VSAN all the way. I use VSAN for VDI and IaaS clusters (Management and Network Edge), and classic array for Server clusters (Tier 1, 2, and 3).

I’ve added the vSphere integration that Storage arrays typically have. All these integration need specific firmware level, and they also impact the way you architect, size and configure the array. vSphere is not simply a workload that needs a bunch of LUNs.

I’ve never seen an IT environment where the ground team is not stretched. The reality of IT support if you are under-staff, under-trained, lack of proper tools and bogged down by process and politics. There are often more managers than individual contributors.

As you can see from this article, the whole thing becomes very complex. Making the Architecture simple pays back in Operation. It is indeed not a simple matter. This is why I believe the hypervisor is not a commodity at all. It is your very data center. If you think adding Hyper-V is a simple thing, suggest you review this. That’s written by someone with actual production experience, not consultant who leave after project is over.

As Architect, we all know that it is one thing to build, it is another to operate. The above architecture requires a very different operation than classic, physical DC architecture. SDDC is not physical DC, virtualised. It needs a special team, led by the SDDC Architect.

In the above architecture, I see that adding a second hypervisor as “penny wise pound foolish”. If you think that results in a vendor lock in, kindly review this and share your analysis.

Limitations

  • Not able to do Disaster Avoidance. The main reason is I think it increases cost and complexity with minimal additional benefit. For critical applications, it is already protected with Active/Active at application layer, making DR and DA redundant. For the rest, it already has SRM.

BTW, if you want the editable diagram, you can get it here. Happy architecting! In the next post, I’ll cover Network architecture..

“Hypervisor is a commodity”: A deeper analysis

“The hypervisor is a commodity” is a common saying I read on the Internet. You just google it, and you will find many articles on it. It’s interesting to see that articles like this appear regularly in the past several years, as if there is a force behind it. The thinking still persists into the SDDC era, as I found article dated just May 2015. I have read quite a number of them. In general, their thinking are:

  • Majority of the hypervisors are good enough. While VMware has the lead, the common and key capabilities of hypervisor are available on all hypervisors. In these core capabilities, they are more or less the same, so it does not matter which one you use.
  • Distributed applications, the kind of applications that can scale horizontally, are best served inside a container and managed via OpenStack. Hence, it does not matter which underlying hypervisor you should use.

Just like there are those who think hypervisor is a commodity, there are also those who think it is not. I’d quote some of them here as they say it better than I can. Plus, they say it before I do, so it’s only appropriate to quote and refer to them.

My colleague Massimo Re Ferre actually wrote an article on this topic 5 years ago. Please read it before you read this. It gives you a 5 years perspective 🙂

Done reading? Good, let’s dive. The reason why IT Pros disagree is there are multiple dimensions of commodity. I will cover 4 here.

commodity

Financial perspective

From this view point, I agree that it is a commodity. The price has reached the level of commodity hardware. The hypervisor itself is now free. You can get ESXi free. This actually contributes to the misconception that hypervisor is a commodity.

Money is certainly a powerful factor. When something is expensive, it cannot be a commodity. Take family car for example. Whether it’s Toyota, Nissan, Hyundai, Ford, they are all about the same. But all these typical 1.5l family sedan in Singapore costs US$ 100K for 10 years. Yes, your eye sight is still good. No, petrol and parking not included. Does anyone in Singapore think car is commodity? You gotta be kidding me! 🙂

Technology and Architecture perspective

By technology, I mean the ESXi. By architecture, I mean the entire VMware SDDC stack. I’m grouping Technology and Architecture into 1 here, because they are one for almost all customers.

J. Peter Bruzzese wrote an article at Infoworld on 7 August 2013:

While the hypervisors may be “equal” for the most part, I agree with Davis that the choice you make dictates all the other aspects of your virtualized environment. It certainly has a domino effect.

Rather than saying “See, it’s all the same now” and dismissing the hypervisor as a commodity, it’s better to step back and view the whole picture, including financial and ecosystem, to make the best choice for your environment

The key why hypervisor is far from a commodity is majority of customers do not deploy just ESXi. Far from it. They deploy vCenter. A lot of them add vRealize to help them manage. Some add SRM as they need to do DR. A lot use Horizon View. And just in the past 2 years, many leading customers began adding NSX and VSAN. So what they end up deploying is a proprietary, unique set of products. A lot of  these products do not run on Hyper-V or KVM. Even if they do, they run best on ESXi. A customer told me that VMware is like Apple of the Enterprise. Apple pitches an integrated stack for your personal IT, while VMware pitches an integrated stack for your enterprise IT.

A comparison to this ESXi and SDDC Stack is the OS. You can say that the kernel of Windows and RHEL have more or less similar capability. They all do the base kernel job well. But you don’t just deploy NT kernel or Linux kernel. You deploy the whole OS because both MS and RHEL have created an integrated stack. Once you choose an OS for your application, you do not migrate it to another OS. You live with that decision until the apps are rewritten because it’s hard to get out.

At a personal level, I have not been selling vSphere nor competing with Hyper-V for a good number of years now.

I hope you have read Massimo’s blog above. I’m just quoting a component here:

I’d define a commodity technology as something that had reached a “plateau of innovation” where there is very little to differentiate from comparable competitor technologies. This pattern typically drives prices down and adoption up (in a virtuous cycle) because users focus more on costs rather than on technology differentiation. The PC industry is a good example of this pattern.

Now, you know he said that many blue moons ago. It’s amazing that in 5 long years, the difference between VMware, Microsoft and RedHat are actually getting wider! If they are becoming commodity, they should becoming alike. We did not have NSX and VSAN five years ago! Because they are becoming more different, then you need to choose carefully, because you may end up where you do not want to end up.

Here is another good thought. Eric Siebert said it well here:

To me, the hypervisor is not a commodity at all. For starters, the implementation and features of each hypervisor are very different. The hypervisor is much more than an enabler for virtualization. It has deep integration with many other components in the virtual environment, and each hypervisor is unique. If the hypervisor was a commodity, you would be able to run virtual machines (VMs) across any hypervisor without any effort, which you cannot do now (without converting a VM to a specific format). At some point the hypervisor may evolve into more of a commodity, but with the lack of standards and architectural differences, it’s not today.

What’s interesting is he said that in Feb 2011. Far from evolving into a commodity, VMware has managed to integrate a differentiating stack and led the industry on SDDC. Who would have thought of Software-Defined Storage and Software-Defined Network many blue moons ago?

Operational perspective

Bob Plankers wrote an article on TechTarget on March 2015:

So is the hypervisor a commodity? I don’t think so. A hypervisor isn’t easily interchangeable, and it more closely represents a collection of services rather than a single monolithic product. You pay service and support on it throughout its lifetime, either directly to a vendor or in the form of payroll for support staff. It may be a primary product but it doesn’t count as a raw material.

That’s right. It is not easily interchangeable. Downtime required. Make that massive downtime if you are highly virtualized. How easy is it to get downtime for production VMs nowadays? Expectation on uptime is rising, so getting a downtime will get harder as years go by.

Hans De Leenheer wrote on his blog in 23 April 2014:

A commodity is a service/product where the choice of manufacturer/vendor is irrelevant and interchangeable without any impact on the consumarization. In technology we can plug a server in 220V power off the grid the same way as we can use 220V battery backed power. In virtualization the hypervisor can run on any x86 server, whether it comes from HP, DELL, Lenovo, SuperMicro, Quanta, Cisco, … and the running VM is completely interchangeable without any impact.

If choosing something will result in a lock-in, meaning it is costly and complex to get out, would you choose it carefully? I know I would.

I recently got to know a customer who wants to run VMware in production, and Microsoft in Non Production. The underlying thinking is the hypervisor is commodity. You can migrate between Microsoft and VMware easily. Hans explains:

Moving a VM from one hypervisor to another will still need a migration. There are at least 3 reasons for this;

  • the VM config file is proprietary to the hypervisor
  • the VM disk format is proprietary to the hypervisor
  • the VM guest drivers are proprietary to the hypervisor

How many of you actually need to run multi-hypervisor environments in which VMs need to be interchangeable? I know there are (every day better) migration tools to get you from one platform to another but how much do you want this? And how many times?

Having VMware in production and Microsoft in non production, is like having FC Storage in production and NFS in non production. Sure, you save money. But how would test your production storage upgrade?

In my personal opinion, having multiple hypervisor is penny wise pound foolish. Having multiple hypervisor, is like having multiple emails systems, or multiple Message Bus, or multiple Directory, or multiple CMDB, or multiple intranet, etc. The list goes on, phone system, help desk system, VDI system. There are certain things where you need to standardise. If you are worried about vendor lock-in or the need to control a vendor, there are other levers you can use. Sacrificing your own operations and blood pressure is the dumb way of achieving it 🙂

Skills perspective

In my guesstimate, there are probably >10 millions IT professionals who knows VMware vSphere. By “know”, I do not mean knowing at PowerPoint level. Beyond talking, these IT Professionals can install vSphere. Around 200 thousands are VCP. For every VCP, there are probably 10x more people who are actually at VCP level, but did not take up the certification.

The number drops drastically when it comes to VCAP, VCIX and VCDX. Out of that many people, less than 0.002% have achieved the level of VCDX. Let’s make an ultra conservative assumption that for every VCDX, there are 10 others who are actually at VCDX level. We are still talking 0.02%. A staggering 99.98% do not have that level of expertise, including yours truly. I’ve been a VMware Engineer for 7+ years, was one of the first in the world to pass VCAP-DCD (no 089) and yet I won’t even pass the VCAP-DCA exam, let alone passing the VCDX defence. While my customers and management see me as an SME and expert, the reality is there are more things in vSphere that I don’t know than I know. Another word, my knowledge is not even 50%. And that is just vSphere. I have not included other products that come with vSphere Enterprise Plus or vSphere Operations Manager (VSOM). These products are “free”, as in they are bundled with vSphere. Products such as vRealize Orchestrator, vSphere Replication, vCenter Data Protection should be considered a part of hypervisor.

Beyond vSphere Enterprise Plus, I know that many VMware Professionals have branched out to pick up Storage, Networking, and Management. They are becoming SDDC Architect, and I wrote an article on the rise of SDDC Architect.

As an Architect and Engineer, I think it is unrealistic to have good knowledge on multiple hypervisors. The person will end up as jack of all trades, master of none. If I were a CIO, I certainly expect my Lead Architect to have sufficiently deep knowledge, else he may deliver the wrong architecture (and that can be very costly to undo later on). I also expect my principal engineers to have deep knowledge, as $h!t happens in production (my customer told me: they always happen in production), and I want it solved fast.

Conclusion

Instead of evolving into commodity, Hypervisor is actually evolving into proprietary SDDC. It has broaden beyond server virtualisation and turn the Storage industry, Network industry and Infra Management industry upside down. It’s redefining the very architecture of these industry in software.

Choose your hypervisor carefully, as it determines your entire SDDC stack.