Tag Archives: VMware

Purpose-driven Architecture

When you architect IaaS or DaaS, what end goals do you have in mind? I don’t mean the design considerations, such as best practices. I mean the business result that your architecture has to deliver. A sign that your architecture has failed to deliver is you get into this situation:

The goal of IaaS is to ensure the VMs are running well. The goal of DaaS is to ensure End Users are getting good desktop experience. Have you defined well or good?

Let’s discuss IaaS first. Say you’re architecting for 10K VM in 2 datacenters. You envisage 2K VM in the first month, then ramp up to 10K within the first year. Do you know the basic info about each of these 10K VMs, so that you can architect an infra to serve them well?

  • How big are they? vCPU, RAM, Disk
  • How intense are they? CPU Utilization, RAM utilisation, Disk IOPS, Network throughput?
  • Their workload pattern? Daily, weekly, monthly, etc.

You don’t. Even the applications team don’t know. Their vendors don’t know either, as you’re talking about the future.

So why do you promise that your IaaS will serve them well?

That’s a strategic mistake you make as Systems Architect. It’s akin to promising the highway you architect will all the cars, buses and motorcycle well, when you have no idea how many they are and how often they will use it.

Can you do something about it? Yes. You simply provide a good set of choice. The principle you share to your customers are the common sense used in all service industry:

You want it cheap, it won't be fast.
You want it fast, it won't be cheap.

You then offer a few class of service. Give 3 good choice, at difference price point. The highest price has the best performance.

  • Your price has to be cheaper than VMware on AWS, else what’s the point. VMware on AWS  has identical architecture to yours, as it’s using the same software and providing same capabilities. This assures your customers that they are getting good price.
  • Your performance is well defined. It is not subject to interpretation. You put a Performance SLA on the table, assuring your customers that you’re confidence of delivering as promised.

You then architect your IaaS to deliver the above classes of service. The class of service is your business offering. It’s the purpose of your architecture. With class of service clearly defined, the question below becomes easy to answer.

When you know exactly the quality of service you need to deliver, the operations team will not suffer. You handover your architecture to them with ease, as it can be operated easily. It has clear definition of performance and capacity.

Keep the summary below when you are architecting IaaS or DaaS.

For more details, review Operationalize Your World.

A test of your IaaS Operations maturity

What you architect is SDDC. What you handover as business result to CIO is IaaS. We can assess if the architecture is good or not, based on the actual result in production. Does it result in fire-fighting and blame-storming? Or you have a peaceful operations?

The litmus test below helps you assess the maturity of your IaaS.

Do your customers blame your infrastructure?

  • If the answer is yes, take a step to ask yourself why. There is a high chance you’re relying on complaint in your operations. So you actually encourage it. No complaint, no problem. A Complaint-based Operations.
  • The reason why you rely on complaint is you don’t have other means. You have not defined the performance of your IaaS.
  • A sign of matured operations is you have Performance SLA. It is per-VM, measured every 5 minutes.

Is your IaaS cheaper than both VMware on Amazon and Amazon?

  • If not, your CIO may question your business value. The reason for having an in-house architect is so you can bring lower cost, after taking into account your salary.

 Does Help Desk provide a good first level defense?

  • If Help Desk simply passes through to the next level, you need to look at why.
  • Help Desk is your first line of defence. They are not as technical as you are. Equip them with simple dashboard so they can handle VM Owner complaint:
    • Is the problem caused by IaaS not serving the VM well?
    • If yes, which part of the Infra: CPU, RAM, Disk, Network?
    • If not, how to prove it convincingly?

Can you justify new infrastructure when utilization is not yet high?

  • This is not referring to additional money that comes with new project. This is referring to existing clusters/storage.
  • Capacity is measured on utilization and performance. A cluster capacity is full if it can’t serve its VMs well. Since it takes time to buy hardware, you need to have have early warning to detect this performance degradation.

Do you struggle with many over-provisioned VMs?

  • This is an indicator that you’re operating as a System Builder as opposed to a Service Provider.
  • As a System Builder, you’re meddling with each System (read: Application). You size them, and argue with application team.
  • As a Service Provider, you’re not “on the way”. IT simply uses an effective pricing model to drive the right behaviour. Does AWS block you when you buy 40 CPU EC2 VM when you only need 2 CPU?

Does Troubleshooting mean all hands on deck?

  • Do you have a process that is followed by all teams (network, storage, server, OS, application)? Does that process end with Root Cause Analysis?
  • As part of RCA, do you set up alert so issue can be detected faster if it happens again?

There are more questions, but I thought we start with those first. If you want to see details, download this.

Keeping VMware Tools current

Keeping VMware Tools current is one of the best practices of vSphere operations. VMware Tools interfaces with both ESXi and VM (the virtual motherboard or virtual machine). Hence, there are 2 comparisons to consider:

  1. VM Hardware version
  2. ESXi version

From the vSphere API, here is what you get when you query it:

  • Guest Tools Current
  • Guest Tools Not Installed
  • Guest Tools Supported New
  • Guest Tools Supported Old
  • Guest Tools Too Old
  • Guest Tools Unmanaged

What do they mean?

  • Guest Tools Not Installed:
    • Tools are not installed on the VM. You should install it as you get both drivers and visibility.
  • Current
    • Tools version matches with the Tools available with ESXi. Each ESXi has a version of Tools that comes with it. See this for the list. This is the ideal scenario.
  • Supported New
    • Newer than the ESXi VMware tools version, but it is supported.
  • Supported Old
    • The opposite of New. It is also supported. Even it is older by 0.0.1 is considered old. It does not have to far behind.
  • Too Old
    • Tools version is older than the minimum supported version of Tools across all ESXi versions. Minimum supported version is the oldest version of Tools we support. Basically, guest is running unsupported Tools. You should upgrade. As of now for Linux and Windows guests. minimum supported version is the Tools version bundled with ESXi 4.0 which is 8.0.1. Supporting such old versions is challenging. We are planning to change this in future to something newer. In the meantime, you should upgrade as might not work as expected
  • Unmanaged
    • Tools installed in the guest did not come from ESXi, so Tools is not being managed by ESXi host. It may be supported or maybe not, depends on what type of Tools is running in the guest. We support open-vm-tools packaged by Linux vendors and OSPs, which both show up as unmanaged.
    • If a customer builds their own open-vm-tools from source code, we may not support that because we will not know if they have done it correctly or not.

Operationalize Your World has a dashboard that highlights the VMs not running the current or supported new. You should expect the number to be minimal, or ideally none.