Recently, I published a sample architecture for an SDDC that is based in VMware technology. The post resonated well with readers (thank you!). The sample architecture is probably not what you expected, so I will share the consideration I had when thinking through the architecture.
A vSphere-based SDDC is very different to physical Data Center. I covered that in-depth in my book, so here I will just highlight the relevant points for this post:
- It breaks best practice, as virtualisation is a disruptive technology and it changes paradigm. Do not apply physical-world paradigm into virtual-world. There are many “best practices” in physical world that are caused by physical world limitation. Once the limitation is removed, the best practice becomes dated practice.
- Take advantage of emerging technology to break away from constraint. Virtualization is innovating rapidly. Best practice means proven practice, and that might mean outdated practice in this rapidly changing landscape.
- Consider unrequested requirements as business expect cloud to be agile. You have experienced VM sprawl right 🙂 . Expect to be able to adapt to changing requirements.
In the past 2+ decades that I work in IT, I have been fortunate to learn or see many great Architects and Engineers. I notice there is an element of style. Each architect does thing a little differently. There are also principles that they adopt. One of my favourite is the KISS principle. Besides this, here is another one I hold dearly:
Do not architect something you are not prepared to troubleshoot.
If you will not be staying post implementation, think of the support team. A good IT Architect does not setup potential risk for Support Person down the line. I also tend to keep things simple and modular. Cost will certainly not be optimal, but it is worth the benefits. You are right, I’m applying the 80/20 rule.
Having said all the above, what do I consider in vSphere based architecture?
Why do I consider so many things? Because Enterprise IT is complex. I’m not here to paint it is simple. Chuck Hollis explains it very well here, so I won’t repeat it.
- This is unique in the virtual world. It is not something you consider in the physical world. A key aspect of SDDC that we don’t normally discuss is how to upgrade the SDDC itself. Upgrading the entire SDDC can be something like renovating your home while living in it.
- When considering an upgrade, think beyond the next version. Generally speaking, it is safe to assume that there is an upgrade path from current version to the next version. But do you always follow the latest? Normally, you would skip, as it’s an expensive operation to perform the upgrade. If you are upgrading every 3 years, you might be 3 versions behind. The upgrade path maybe more complex than you assume it is.
- Upgradeability is no 1 in my consideration because an SDDC consists of multiple products from multiple vendors. You need to have an approach on how you will upgrade your SDDC as it’s unique to yours.
- Your architecture will likely span 3 years. It will be operational for possibly many years beyond that. So check with your vendors for NDA roadmap presentation. While you know the roadmap is not a guarantee, at least you know you’re not implementing something that your vendor has no intention of improving.
- How easy it is to troubleshoot the SDDC you architected?
- Troubleshooting in virtual environment is harder than physical, as boundary is blurred and physical resources are shared. So troubleshooting tools is a critical component of SDDC.
- For troubleshooting, log files is a rich source of information. I rarely see customers with a proper Log Management Platform (LMP). I shared the criteria in this blog so you can benchmark yours.
- There are 3 types of troubleshooting:
- Configuration. Generally this results in something becomes broken. It used to work normally, and it stops working as expected. The symptom and root cause can be unrelated.
- Stability. Stability means something hang or crash (BSOD, PSOD, etc.) or corrupted. This is typically due to bug or incompatibility.
- Performance. This can be really hard to solve if the slow performance is short lived and in most cases it is performing well. You may need to create additional vCenter alarm to catch this infrequent performance issue.
- For Tier 1 workload, I’d add extra server and storage. This means we can isolate the problematic VM while performing joint troubleshooting with App team. The extra hardware is not wasted as for Tier 1 I normally specify N+2 as Availability Policy.
Manageability and Supportability
- This is related, but not the same with Debugability.
- This relates to things that make day to day management easier. Monitoring counters, reading logs, setting up alerts, big screen projectors, etc. Make it easy for the front line help desk team to support your SDDC.
- Moving toward “1 VM 1 OS 1 App”. In physical data center, some physical servers serve multiple purpose. In virtual, we can afford, and should do so, to run 1 App per VM.
- A good design makes it harder for Support team to make human error. Virtualisation makes task easy, sometimes way too easy relative to physical world. Consider this operational and psychological changes in your design. For example, I separate Production and Non Production into separate cluster. I also put Non Production into separate physical Data Center, so the promotion to Production is a deliberate effort.
- Supportability also means using components that are supported by the vendors. This should be obvious as we should not deploy unsupported configuration.
- I hope you are not surprised that this appears as no 4 in the list, not no 1. I’m mindful of keeping cost low, as you can see in the choice of hardware and removal of certain features. Cost was a key factor for not having Disaster Avoidance and Active/Active at Infrastructure layer.
- The Secondary Site serves 3 purposes to reduce cost:
- Running Non Production and other workload
- A test environment for the SDDC itself.
- VMs from different Business Units are mixed in 1 cluster to avoid provisioning extra cluster. If they can share same LAN and SAN, there is no technical reason why they cannot share the same hypervisor. So physical Network, Storage and Server are all shared.
- Window and Linux are mixed in 1 physical cluster. If you have large numbers of RHEL OS, separate them into dedicated to maximise your OS license.
- If you have a large number of Oracle or MS SQL, or other software that charges per physical CPU, dedicating a cluster can result in saving. This is especially true when the software costs a lot more than the IaaS.
- Business Units cannot buy entire cluster, as this changes the business model from IaaS to Hosting. This increases complexity and prevents cost optimisation.
- DMZ and non DMZ are mixed in 1 cluster to avoid provisioning a cluster for DMZ. I am using VMware NSX to achieve the isolation.
- Software has Bugs. Hardware has Fault. We cater for hardware fault by having redundant hardware. What about software bugs? How do you cater for that when the entire data center is now defined in Software? 🙂
- Because the key component is software, you can do a fair amount of testing in a virtual ESXi. So I’d build an identical stack in Management Cluster. The reason for choosing Management Cluster instead of Non Production cluster is to keep the separation between IT and Business clean and clear.
- One key reason for not having an active/active infrastructure is to enable testing of the SDDC at the Passive Data Center. When both vSphere is active/active, serving 50% production workload each, it gets difficult to test/patch/update vSphere. You don’t have a “test environment” as both vSphere is live.
- It is related to availability, but it is not the same. Availability is normally achieved by redundancy. Reliability is normally achieved by keeping things simple, using proven components, separating things, and standardising. Standardisation extends beyond technical components. You can and should standardise your process, chargeback model, etc.
- One area that customers tend to standardise that I no longer believe is VM size. I used to advocate standard sizes (small, medium, large) where each size is fixed. I learned from customers that having different sizes do not make operations more complex. So I’d allow “odd size” VM, such as 3 vCPU in order to optimize performance minimize cost.
- This should be obvious, so I just want to highlight that there are 2 dimension to performance.
- How fast can we do 1 transaction? Latency, clock speed, CPU Cache Size, SSD quality matters here. One reason why I prefer to use CPU with the highest cache size is performance of a single transaction.
- How many transactions can we do within SLA? Throughput and scalability matters here.
- Include head room. Almost all companies need to have more servers, especially in non production. So when virtualisation happens, we have this VM Sprawl. As such, the design should have head room. Like one CIO told me, “You need to be ahead of the business.”
Existing Environment and People
- Brown Field is certainly a major consideration. We walk from where we stand. There are 2 elements to the brown field:
- A good CIO considers the skills of his IT team, as when the vendors leave, his team has to support it. Nowadays, most IT departments have complemented their staff with resident engineers from vendors. So the skills include both internal and external (preferred vendor who complement the IT team).
- In SDDC, it is impossible to be expert on all areas. I’m sure you have heard the saying “jack of all trades master of none”. Consider complementing the internal team by establishing long term partnership with an IT vendor. Having a vendor/vendee relationship saves cost initially, but in the long run there is a cost. You can negotiate hard with the vendor, but do not antagonize the human representing the vendor. There is a level of support above the highest level that the vendor provides. That level is called friendship.
- How does the new component fit into existing environment? E.g. adding a new Brand A server into a data center full of Brand B servers need to take into account management and compatibility with common components.
Can you notice a missing consideration? Something we should always consider, that I have not listed.
You are right, it is Security. In some regulated industry, you need to include Compliance also. This is a big topic, worth a blog by itself 🙂 One thing I want to quickly share here, is physical isolation is not good enough if it is the only solution. You need to complement it with logical isolation, just in case the physical isolation is bridged (intentionally or unintentionally).