Author Archives: Iwan Rahabok

About Iwan Rahabok

A father of 2 little girls, my pride and joy. The youngest one quickly said "I'm the joy!"

The Rise and Fall of Infrastructure Architect

I’ve been with IT for almost 2.5 decades. We are fortunate as we experience a once in a life time journey in technology changes. Technology has changed both work and life. Business now runs on IT, and what we know as banks, airlines, telcos, practically depend on IT. Within IT, applications run on infrastructure. This infrastructure has improved drastically that it has become a commodity. With the arrival of the cloud computing, it has become utility too. When something has come down to both a commodity and utility, the human who knows it follow as a consequence. The value of Infrastructure Architect has diminished, as the technology has become good enough, simple enough, and cheap enough for most cases. Granted, mega infrastructure such as AWS and VMware on AWS are complex. But how many of us are working there?

Most of us aren’t doing this mega infrastructure. Most businesses have <10K VMs. At 25:1 consolidation ratio, that’s <400 ESXi. At around 12 ESXi per cluster, that’s just 36 clusters, including HA. Space wise, it will occupy just ~10 racks. 1000 VM per rack for all compute + storage + network are doable.

Compared with say 10 years ago, it’s much easier to architect and operate a VMware environment with just 10K VMs. It’s easier because there are many references architecture, such as VMware Validated Design and VMware Cloud Foundation. For those using VMware on AWS, the design, implementation, upgrade and support are done by VMware.

So what can you do as Infrastructure Architect?

If you are not moving into managerial or sales position, you need to add skills that are valued by CIO or Business. That means non-technical, as these folks care less about technical matters. The following diagram shows the career progression:

Since Infrastructure is becoming a service, you need to know how to architect a service (e.g. IaaS, DBaaS, Desktop as a Service).

  • What are the services the IaaS is providing? How you define a service?
  • What metrics do you use to quantify its quality?
  • How many services? How do you distinguish between higher class service and normal one?

You also need to know what type of services are on demand. Yes, this require you to go out and meet your customers. Understand their requirements. What Price/Performance are on demand? From there, you can architect a corresponding services.

The next step after Service Architect is Business Architect. This is especially valuable to CIO, who runs the business of IT. It’s also important to Cloud SP, whose business is actually selling the service.

For a start, know the business you are in. Below are the 2 main models. Be clear on the nuance, as Internal IT is morphing towards internal Cloud Provider.

As a Business Architect, you not just know the cost of running the service, but you also know how & when to break even. You do not have to responsible for P&L, as you’re not the CIO or Cloud SP CEO, but you play a strategic role to them. You’re not merely a techie. You know what to price, how to price and your price is competitive.

The world of Cost and Price is a complex one. vRealize comes with a tool to help you manage this part.

Summary

  • Systems Architect needs to evolve, as infrastructure is becoming commodity and utility.
  • Service Architect and Business Architect are the next steps for Infrastructure Architect.

Operationalize Your World with vRealize Operations 6.7

Hard to believe Aria, the code name for vRealize Operations 6.7 is finally here with us. I worked closely with the R&D team and was privileged to see how the man in the arena fought their best. Many folks (Monica, Sunny, Karen, Esther and countless others) worked long hours, including during holiday and in flight. I remember many times we’re having lunch at 3 pm and dinner at 10 pm as we developed and tested the products.

vRealize Operations 6.7 enables changes that makes Operationalize Your World better. I will not repeat the generic product benefits (e.g. scalability, usability, performance). I’d focus on specific benefits that I’ve taken advantage of (meaning it requires changes on my end, and it does not come to Operationalize Your World automatically. My goal is to show you what changes you can make to make your customisation even better. You will see that some of the dashboards change substantially, because the widgets enable me to do things differently. The View List widget ability to drive other widget is one of those features I apply to have a different dashboard experience.

I also use Percentile instead of Maximum. Maximum is good as early warning as it captures the outlier. In capacity use cases, you want to base on something more realistic. A 99th percentile provides additional data that help you make better judgement. In other cases, I replace Average with 95th percentile, as Average is simply not practical in operations.

With that, let’s go through the changes dashboard by dashboard. I won’t explain every single one in details, else you get bored 🙂 My goal is to give you a real world example on the type of customization you can do for your bespoke operations.

At the end of this blog, I list the steps required to upgrade Ops Your World to be compatible with vR Ops 6.7. Without it, some dashboards will not work.

Performance: Overall

  • The table that lists the clusters now sport color coding. Now that it’s using View instead of Object List, the columns can have proper unit. For example, 2.5 TB RAM instead of a long number in KB.
  • Added more insight into Resource Pool, since you need to ensure each pool has equal number of members.
  • I highlight clusters that have too few hosts as that means your HA overhead is high. A cluster with 4 nodes means your HA overhead is 25%.
  • I highlight clusters that have too few physical cores. Your VMware license gives you unlimited cores per socket. Microsoft has stopped giving unlimited cores per socket.
  • I merged the Affected VMs dashboard to make it easier. You no longer have to switch into another dashboard. I rounded the number to 1 decimal point to drive a point that you should not look beyond 1 decimal point.
  • Because Affected VM is gone, it’s 1 less dashboard-sprawl 🙂
  • It cannot pass the VM object when you move to other dashboard. This is because a limitation of View List widget, where it cannot drive a widget on another dashboard. Yes, we are aware of this limitation 🙂

Performance: Villain VMs

  • It’s now an independent dashboard. You now need to select the cluster again.
  • Added CPU and RAM, to complement Disk and Network.
  • I’m not using Cluster average, as unbalance is possible, especially if you have large VMs, resource pool, share and limit.
    • For CPU: I use CPU GHz, so it takes into account the VM CPU Size
    • For RAM: I use RAM Consumed. This is more accurate than Guest OS metrics as this is what ESXi maps to the physical DIMMs.
  • I’m using Health Chart so it can display in color. Because of HA, your ESXi utilization should below, as your HA host is participating.

Performance: Single VM Monitoring

  • I disable the Received Dropped Packet. I think vCenter drops the packet when it’s not intended for the VM, but the counter counts it. So it’s a false positive.

Performance: Single VM Troubleshooting

  • You no longer have to select the VM again. It’s automatically passed from Single VM Monitoring dashboard. This is handy as you may have 1000s of VM.
  • You no longer have to figure out which property to choose to find out the parent ESXi. It’s automatically selected, plus the name of the current ESXi is automatically shown. This is good as some customers could not find the property.
  • You also do not need to clear the metric chart, which is not intuitive.
  • The dashboard has been rearranged. It is neater and easier.

Capacity: Clusters Capacity

  • The table now provides an easier way to see capacity across clusters. The limitation of the health chart was it’s not scalable if you have lots of clusters. You also need to see 3 months data, which can be hard to read since there are thousands of data points.
  • Clicking on the table lets you drill down. I now add current allocation as some customers do not overcommit RAM.
  • I’ve removed 2 dashboards (Overcommit Clusters and Non-Overcommit Clusters) as I found almost all customers mix allocation and utilization. You overcommit CPU but do not overcommit RAM. To me, this is actually the right thing to do. The reason is most VM uses 2 – 8 GB of RAM for each vCPU. A 32 vCPU VM needs somewhere between 64 – 256 GB RAM. While Intel Xeon Platinum has 24 core, you may find the premium too high. I see most ESXi uses the 16-core Xeon, giving a total of 32 cores. So if you do not overcommit CPU, all you need is 64 – 256 GB of RAM. Buying 512 GB is simply a waste.

Capacity: Idle and Powered Off VMs

  • When you upgrade, the number you see may differ. This is due the new capacity engine. I’ve modified the group selection criteria to comply with the new engine.
  • The heatmap has been improved. Easier to see where these VMs are, as the group names are clearer. The color now attracts you to focus on those VMs, giving you more bang for the buck

Capacity: Oversized VM (CPU)

  • For Guest OS CPU utilisation, I’m replacing CPU Demand with a new supermetric. As you can see here, VM CPU Demand counter includes workload that is not coming from the VM. They should not be counted. Demand is also affected by frequency scaling and HT, which is not relevant in the context of VM CPU utilization. A VM is consuming a CPU at 100%, regardless whether the 2nd HT runs or not. The fact that the 2nd HT runs at 100% does not mean the VM utilization is 62.5%. We need to distinguish utilisation from capacity and performance use cases.
  • In the table that lists all the VM utilization, I replaced Average utilization with 95th percentile. It gives you more confidence to right size.

Capacity: Oversized VM (RAM)

  • We know that RAM is used as cached by OS. As a result, the memory consumption tends to be high. I change the dashboard to focus on what you can safely claim, which is the Free RAM.
  • The heatmap now focuses on Free RAM. The larger the box, the more you can claim. The color now indicates how safe it is to claim it.

NOC: Capacity ESXi Utilization

  • I changed ESXi Active to Balloon. It’s a better indicator of ESXi Memory Utilization

Storage: Capacity 

  • I merged the overall and detail dashboards into 1.
  • Limitation: to see the powered off VMs, you need to select the Datastore heatmap instead of Datastore cluster. This can be fixed by using supermetric, but since I’ve created almost 100, I thought I’d give you chance to practice 😉

Easier Import & Export

  • No need to import a dummy policy just to import super metric, and delete policy once imported. Now you know exactly what super metrics are imported.
  • You can also export the super metrics in bulk. Useful when replicating changes to your non-Prod vR Ops instance.
  • No need to manually create XML files for resource interaction. This is now configurable in the UI.

Upgrading Operationalize Your World

If you do not customize Ops Your World dashboards, super metrics, views, etc., then your upgrade is easier. It consists of 2 steps:

  1. Download and Import Ops Your World. Yes, overwrite your existing ones.
  2. Enable the super metrics in your Default policy (it is marked with D)

Enable these metrics & properties. They were disabled as majority of vR Ops users are small companies. Operationalize Your World targets the big deployment.

  • VM: Guest File System | Total Guest File System Free (GB)
    • Needed to show low disk space in absolute amount, as % alone does not tell the full picture. 1% of 2 TB vs 1% of 50 GB aren’t the same
  • VM: CPU Used.
    • Needed to show the individual vCPU. Ideally, you only want this on large VMs. If your environment is >10K VM, you can create a separate policy and enable the supermetric for this group only.
  • VM: Summary | Number of Datastores
    • Once you clean up, you can disable it back. Your IaaS architecture & operations policy should either put all VMDKs of a VM in 1 datastore or intentionally separate them. For ease of operations, put them in 1 datastore.
  • VM: Memory | VM Reservation
  • VM: Heartbeat
    • For better detection of Guest OS uptime. This is kind of redundant, as we already have OS Uptime counter.
  • VM: CPU Run
    • Used in VM Right Sizing. I’ve developed a new model to measure VM CPU Utilization. It uses this counter. I’m keen to get real world feedback from you.
  • VM: CPU Overlap
    • Used in VM Right Sizing.
  • ESXi: Error Packets Received
    • They were disabled as the remediation is likely not at vSphere layer.

You can safely delete the XML configuration files as they are no longer required.

Allocation Model in vSphere

Allocation model, using vCPU:pCore and vRAM:pRAM ratio, is one of the 2 capacity models used in VMware vSphere. Together with the Utilization model, they help Infra team manage capacity. The problem with both models is neither of them measure performance. While they correlate to performance, they are not the counter for it.

As part of Operationalize Your World, we proposed a measurement for performance. We modeled performance and developed a counter for it. For the very first time, Performance can be defined and quantified. We also add an availability concept, in the form of concentration risk ratio. Most business cannot tolerate too many critical VMs going down at the same time, especially if they are revenue generating.

Since the debut of Operationalize Your World at VMworld 2015, hundreds of customers have validated this new metric. With performance added, we are in the position to revise VMware vSphere capacity management.

We can now refine Capacity Management and split it into Planning, Monitoring and Troubleshooting.

Planning Stage

At this stage, we do not know what the future workload will be. We can plan that we will deliver a certain level of performance at some level of utilization. We use the allocation ratio at this stage. Allocation Ratio directly relates to your cost, hence your price. If a physical core costs $10 per month, and you do 5:1 over-commit, then each vCPU should be priced at least $2 per month. Lower than this, and you will make a loss. It has to be higher than $2 unless you can sell all resources on Day 1 for 3 years.

We also consider availability at this stage. For example, if the business can only tolerate 100 mission critical VMs going down when a cluster goes down, then we plan our cluster size accordingly. No point planning a large cluster when you can only put 100 VMs. 100 VMs, at average size of 8 vCPUs, results in 400 cores in 2:1 over-commit. Using 40 core ESXi, that’s only 10 ESXi. No point building a cluster of 16.

Monitoring Stage

This is where you check if Plan meets Actual. You have live VMs running, so you have real data, not spreadsheet 🙂 . There are 2 possible situation:

  1. Over-commit
  2. No over-commit.

With no-overcommit, the utilization of the cluster will never exceed 100%. Hence there is no point measuring utilization. There will be no performance issue too, since none of the VMs will compete for resource. No contention means ideal performance. So there is no point measuring performance. The only relevant metrics are availability and allocation.

With over-commit, the opposite happens. The Ratio is no longer valid, as we can have performance issue. It’s also not relevant since we have real data. If you plan on 8:1 over-commit, but at 4:1 you have performance issue, do you keep going? You don’t, even if you make a loss as your financial plan was based on 8:1. You need to figure out why and solve it. If you cannot solve it, then you remain at 4:1. What you learn is your plan did not pan out as planned 😉

There are 3 reasons why ratio (read: allocation model) can be wrong:

Mark Achtemichuk, VMware performance guru, summaries well here. Quoting him:

There is no common ratio and in fact, this line of thinking will cause you operational pain.

Troubleshooting Stage

If you have plenty of capacity, but you have performance problem, you enter capacity troubleshooting. A typical cause of poor performance at when utilization is not high is contention. The VMs are competing for resource. This is where the Cluster Performance (%) counter comes into play. It gives an early warning, hence acting as Leading Indicator

Summary

You no longer have to build buffer to ensure performance. You can go higher on consolidation ratio as you can now measure performance.

If you are Service Provider, you can now offer a premium pricing, as you can back it up with Performance SLA.

If you are customers of an SP, then you can demand a performance SLA. You do not need to rely on ratio as proxy.