What you architect is SDDC. What you handover as business result to CIO is IaaS. We can assess if the architecture is good or not, based on the actual result in production. Does it result in fire-fighting and blame-storming? Or you have a peaceful operations?
The litmus test below helps you assess the maturity of your IaaS.
Do your customers blame your infrastructure?
- If the answer is yes, take a step to ask yourself why. There is a high chance you’re relying on complaint in your operations. So you actually encourage it. No complaint, no problem. A Complaint-based Operations.
- The reason why you rely on complaint is you don’t have other means. You have not defined the performance of your IaaS.
- A sign of matured operations is you have Performance SLA. It is per-VM, measured every 5 minutes.
Is your IaaS cheaper than both VMware on Amazon and Amazon?
- If not, your CIO may question your business value. The reason for having an in-house architect is so you can bring lower cost, after taking into account your salary.
Does Help Desk provide a good first level defense?
- If Help Desk simply passes through to the next level, you need to look at why.
- Help Desk is your first line of defence. They are not as technical as you are. Equip them with simple dashboard so they can handle VM Owner complaint:
- Is the problem caused by IaaS not serving the VM well?
- If yes, which part of the Infra: CPU, RAM, Disk, Network?
- If not, how to prove it convincingly?
Can you justify new infrastructure when utilization is not yet high?
- This is not referring to additional money that comes with new project. This is referring to existing clusters/storage.
- Capacity is measured on utilization and performance. A cluster capacity is full if it can’t serve its VMs well. Since it takes time to buy hardware, you need to have have early warning to detect this performance degradation.
Do you struggle with many over-provisioned VMs?
- This is an indicator that you’re operating as a System Builder as opposed to a Service Provider.
- As a System Builder, you’re meddling with each System (read: Application). You size them, and argue with application team.
- As a Service Provider, you’re not “on the way”. IT simply uses an effective pricing model to drive the right behaviour. Does AWS block you when you buy 40 CPU EC2 VM when you only need 2 CPU?
Does Troubleshooting mean all hands on deck?
- Do you have a process that is followed by all teams (network, storage, server, OS, application)? Does that process end with Root Cause Analysis?
- As part of RCA, do you set up alert so issue can be detected faster if it happens again?
There are more questions, but I thought we start with those first. If you want to see details, download this.