Tag Archives: Cloud

VCDX, meet VCOX

Thanks for the positive feedback on the articles The Rise and Fall of Infrastructure Architect and Purpose-driven Architecture. Do read them first as this post builds from there.

I see Architecture and Operations as 2 equally large realms. While we certainly consider Operations when designing, it is not a part of Architecture. They complete each other, like Yin and Yang. They impact each other, like night and day. While I subscribe to the school of thought that the same person can be good in both Architecture and Operations, I’m yet to meet such person. I’m not a VCDX, so I will focus on what VCOX. Inspired by VCDX, I created this term to acknowledge the size of this world. To be 100% clear, VCOX is just a term I created. It has nothing to do with VCDX.

Architecture is Day 1, Operations is Day 2. Day 2 impacts Day 0, which is Planning. Why?

Because we begin with the end in mind. The End State drives your Plan. Your Plan drives your Architecture. So it’s 2 –> 0 –> 1, not 0, 1, 2.

I’ll use an example to illustrate how Day 2 impact Day 1. Say you an internal cloud provider, and you plan to charge per VM (e.g. $1 per vCPU per month). You plan to have 2 classes of offerings:

  • Gold: suitable for production workload. Performance optimized.
  • Silver: suitable for non production. Cost optimized

For Gold, you don’t overcommit CPU and RAM. Now…., if 1 CPU typically uses 4 GB RAM, then a 40-core ESXi will only need 160 GB. If you buy a 1 TB RAM, then you won’t be able to sell 864 GB as you have no vCPU to sell. This means your hardware spec is impacted.

You also promise the concept of Availability Zone. In the even of cluster failure, you cap the number of VMs affected. If you cap say 200 production VMs, then your cluster size cannot be too big.

In your service offering, you include the ability for customer to check her own VM health, and how her VM is served by the underlying platform. This means your architecture needs to know how to associate tenants with their VMs.

Your CIO wants a live information projected for his peers to see on how IT is serving the business. This requires you to think of the KPI. How do you know NSX is performing fast enough for its consumers?

I hope the above provide examples that Day 2 is where you want to start.

Operations cover the following pillars of management (planning, monitoring, troubleshooting):

  • Budget: Costing, Pricing and the business of IT
  • Capacity: it’s highly related to cost. Insufficient budget à Overcommit à Capacity Management.
  • Performance: focus on proactive and early warning. Performance SLA
  • Availability
  • Configuration: drift management
  • Compliance: security compliance, internal audit compliance
  • Inventory: license, hardware
  • Management Reporting

Notice a big pillar is missing above?

Yes, I did not cover Automation. IMHO, that’s part of Architecture. You should not automate what you cannot operate. So I see automation as not part of operations. Automation is a feature of your Architecture. It’s like automatic car. That’s a feature of the car. How you operate the car so passengers arrive at the destination on time, that’s operation.

VCOX answers questions such as:

  • Prove that the IaaS is cheaper than comparable IaaS. If it’s VMware SDDC, then prove that it’s cheaper than VMware on AWS. If it’s not, then the business case is weakened from CFO viewpoint.
  • Prove that Actual meets Plan. The architecture is built for a purpose. Quantify that purpose, and prove that it’s met.
  • The architecture carries a set of KPI. This enables its performance to be monitored. What are these KPIs. For each metric, what are the thresholds?
  • Is Operations performing? A poor sign of operations are lots of alerts, fire-fighting, blamestorming, hectic and intense day. The team is under stress as they struggle to operate the architecture.

From my interactions with customers, I notice that the Architect is not leading Day 0. They provide input to the Planning stage, but not the lead architect driving it. The Architect tends to focus on technical, something that CFO and CIO value less (hence they spend less time on it).

That’s my observation from travelling 200 days a year meeting customers, partners and internal. I hope it’s useful to you. Let me know what you think, as we are in early days of VCOX.

A test of your IaaS Operations maturity

What you architect is SDDC. What you handover as business result to CIO is IaaS. We can assess if the architecture is good or not, based on the actual result in production. Does it result in fire-fighting and blame-storming? Or you have a peaceful operations?

The litmus test below helps you assess the maturity of your IaaS.

Do your customers blame your infrastructure?

  • If the answer is yes, take a step to ask yourself why. There is a high chance you’re relying on complaint in your operations. So you actually encourage it. No complaint, no problem. A Complaint-based Operations.
  • The reason why you rely on complaint is you don’t have other means. You have not defined the performance of your IaaS.
  • A sign of matured operations is you have Performance SLA. It is per-VM, measured every 5 minutes.

Is your IaaS cheaper than both VMware on Amazon and Amazon?

  • If not, your CIO may question your business value. The reason for having an in-house architect is so you can bring lower cost, after taking into account your salary.

 Does Help Desk provide a good first level defense?

  • If Help Desk simply passes through to the next level, you need to look at why.
  • Help Desk is your first line of defence. They are not as technical as you are. Equip them with simple dashboard so they can handle VM Owner complaint:
    • Is the problem caused by IaaS not serving the VM well?
    • If yes, which part of the Infra: CPU, RAM, Disk, Network?
    • If not, how to prove it convincingly?

Can you justify new infrastructure when utilization is not yet high?

  • This is not referring to additional money that comes with new project. This is referring to existing clusters/storage.
  • Capacity is measured on utilization and performance. A cluster capacity is full if it can’t serve its VMs well. Since it takes time to buy hardware, you need to have have early warning to detect this performance degradation.

Do you struggle with many over-provisioned VMs?

  • This is an indicator that you’re operating as a System Builder as opposed to a Service Provider.
  • As a System Builder, you’re meddling with each System (read: Application). You size them, and argue with application team.
  • As a Service Provider, you’re not “on the way”. IT simply uses an effective pricing model to drive the right behaviour. Does AWS block you when you buy 40 CPU EC2 VM when you only need 2 CPU?

Does Troubleshooting mean all hands on deck?

  • Do you have a process that is followed by all teams (network, storage, server, OS, application)? Does that process end with Root Cause Analysis?
  • As part of RCA, do you set up alert so issue can be detected faster if it happens again?

There are more questions, but I thought we start with those first. If you want to see details, download this.