Tag Archives: VMware vSphere

Architecting VMware vSphere for Operations

As an Architect, you take into account many requirements when designing your VMware vSphere environment. As an Operations person, I shall not question your Architecture. I’m sure it is fast, highly available, right on budget, etc. My role is to help you prove to CIO that what you architect actually lives up to its expectation. “Plan meets Actual” is what an Architect wants, because that means your architecture delivers its promise.

The plan of the Architecture exists in some diagram and documents. It's static.
The reality of the Architecture exists in Datacenter. It's live.

When proving, here are the questions we should be able to answer:

  • Availability: Does the IaaS deliver the promised Availability SLA? If not, what was its actual number, when was it breached, how long, which VMs affected? For each VM, when exactly it happened and ended?
  • Performance: Does the IaaS serve all its VMs well? An IaaS platform provides CPU, RAM, Disk and Network as services. If has to deliver these 4 resources when asked by each VM, 24 x 7 x 365 days a year. If not, which VMs were affected by what and when?

Those are simple questions, but are very difficult to answer. Let say you have 10,000 VMs. How do you answer that? How do you provide answer over time (e.g. monthly), proving you handle the peaks well?

To complicate matters, you need to able to answer per Business Units. Business Units A will not care about other business units. Since a Business Unit has >1 apps, you also need to answer per application. An application owner only cares about her applications.

There are a few things you need to do, so you are in the position to prove.

Step 1: Reflect the business in vSphere

Does your vCenter show all the business units? Can you show how the business is mapped into your vSphere environment? You design vSphere so Business can run on it, so where is the Business? A company is made up business units, which may have multiple departments. The structure below is a typical example.

An application typically has multiple tiers. Does your vSphere understand that?

Map the above as folders in your vCenter.

I see many naming convention that is not operations-friendly. It’s impossible to guess what it is. The names are very similar and hard to read, hence it’s easy for operators to make mistake! In some companies, these operators are outsourced or contractors, who are not that familiar or don’t care as much as employee. The naming convention typically originates from mainframe or MS-DOS era, where you cannot have space and have limited characters. Examples are SG1-D01-INS-0001W-PRD. Can you guess what on earth that is? You’re right, you can’t. Imagine there are 1000s of them like that, and you have new operators joining the help desk team.

If you have shared application, you can create a folder for that. Multiple vR Ops applications can point to the same vCenter folder.

Folder, Tags and Annotation

Have you seen a vSphere environment where there are tags and annotation everywhere?

It’s rare to meet customers with a 100% well-thought and documented approach to the 3 features above. They may have general guidelines, but not explicit Do’s and Dont’s. As a result, these 3 features are used wrongly.

Use Tags when the values are discrete, ideally Yes/No. I’d use tag to tag the following:

  • VMs with RDM.
  • VMs with MSCS or Linux Clustering.

Do not use Tags when the values are unlikely to be common. Use annotation for this. Examples are VM Owner Name, Email Address, Mobile Number. In an environment where there are >10K VMs, there can be 1000 VM Owners.

Do not use Tags to tag Service Tier. For Infra objects such as Cluster and Datastore Cluster, that should be clearly reflected in the name itself. I’d prefix all Tier 1 clusters with Tier 1, so the chance of deploying into the wrong tier is minimized.

Step 2: Design Service Tiers into vSphere

Does your vSphere understand that there are different classes of service? Are Tier 1 clusters and datastores clearly labelled?

You should avoid mixing multiple classes of service into a single cluster or datastore. While it is technically possible to segregate, it’s operationally challenging. Resource Pool expects the number of VMs for each pool to be identical among the sibling pool.

For each tier, you need to have both Availability SLA and Performance. For Performance SLA, review this doc.

Step 3: Define and Map Tiers in vR Ops

Now that you’ve considered service tier into your vSphere architecture, time to show it. You cannot show it in vSphere as vCenter does not understand Performance SLA and Availability SLA. You can use vR Ops do this. Follow this step.

Step 4: Map Applications in vR Ops

Use custom groups to create applications. If you have a proper naming convention, it should not be difficult to select members of the applications. All you need is a query that says the names contains XYZ. There should not be a need for regular expression.

Once apps are mapped, you can do something like this.

Step 5: Consider Debug-ability

Things go wrong. Especially in production. Your architecture should lend itself for troubleshooting.

A major area is to ensure the counters are reliable, else it’s hard to troubleshoot performance. The CPU Contention counter, which is the main counter for IaaS Performance SLA, is greatly affected by Power Management. Ensure your ESXi power management follow this guide by Sunny Dua.

Once you have that in place, you will be able to prove that your Architecture lives up to its expectation. Use the dashboards from Operationalize Your World to show that proof!

Monitoring changes to VMware vSphere Template

Template is a common features used by many VMware Administrators. There are articles such as this on how to manage the version. So I will cover something that I could not find in google, which is how you prove to auditor that your templates have not been modified by unauthorised person. If a template has been modified, you want to know who did it.

The good thing is there are only a few things you can change to a template. The bulk of the changes require the template to be converted into a VM. The changes you can make to the templates are shown below:

0 what vCenter captures

You can see that you can rename the template, change the permission, and convert it into a VM. All these are tracked in vCenter. This means a log analysis tool can visualise it better for you.

Let’s see who rename it. Perform a text search on template and renamed. You can see an example below.


As most changes on template require the conversion into a VM, let’s see who converted a template into a VM, and vice versa. Log Insight already has a field for it, so it’s a matter of specifying it. Choose the field, and specify that it should contain mark*

summary of changes

In the above, I only have 1 template that I changed. You can see that it captures information such as who did it, what time, to what template and in which cluster.

If you want to see only the changes to VM, you can filter the field further, as shown below.

who made the VM a template

Hope you find it useful in entertaining, I mean assuring, your auditor team.

Is vSphere performing well?

In general, you know that you’ve done a good job with your vSphere IaaS because the VM Owners are happy with the performance of their VMs. Business is powered by the VMware infrastructure that you design and operate.


What does the logs of vSphere say? Is there anything lurking in the log files? 

As VMware professionals, we know vSphere well and probably have years of experience on it. We can architect, design, implement, upgrade, and even troubleshoot it.

The same thing cannot be said with the logs. Generally speaking, the deep knowledge of vSphere logs belongs to VMware GSS engineers, as they read logs on daily basis, performing all kinds of troubleshooting. That knowledge has been slowly codified into Log Insight. I’m not sure which engineers are doing this great job, but Steven Flanders will be a good starting point. If you have any feedback on the Content Pack, drop him a note.

BTW, if you are new to Log Insight, there are many bloggers who have written good articles on Log Insight. Examples are VMware Arena (vSphere), VMguru (NSX integration), Cody (variety of articles). BlueShift has also written a good overview here.

Back to the question: what does the log of vSphere say about its health?

This is where the vSphere content pack comes in. As you can see from the following screenshot, it comes with many out of the box dashboards, queries, alerts and field. I’d encourage to review the definition, and not just simply look at dashboard content.


Going through the above can be daunting, as it is indeed a lot. Yes, vSphere has a lot of logs, and Log Insight merely reflects that breadth of information. One approach that works for me is Search. I searched for the particular thing I am after. For example, in the following screen, I searched for vMotion. The result told me what information I can get about vMotion.


You might be wondering how to use the variables, or Field as Log Insight calls it. Well, you don’t have to do anything, as they are automatically appears in the drop down field. All you need to do is to type the name. I type vmw_ in the following screenshot as all the VMware specific content packs follow this naming convention. Notice the variable name as the Content Pack name in bracket. This lets you know which content pack is contributing to that field. Nice!


You will eventually create your own field. Please have a naming convention. I typically use prefix of the company name.

Now that you understand the basic, let’s review some of the dashboards.

The first one answers these questions: What vSphere alarms are we getting? How often do they happen? When do they happen? Is there a pattern? Which hosts, cluster, etc get it?


I plot the above for just 1 day. We can tell my small environment hit by what alarms and when. There was a spike at 10 pm.

Do you notice how easy it was to produce the dashboard? All it takes was 2 built-in variables. Log Insight has created vc_event_type and vmw_vc_alarm_type fields. The dashboard was built by simply choosing they exists. So long there is a value, it’s counted as exist.

The above chart was grouped by type of alarm. What if you are looking at your customers, and want to be grouped by VM?

This is where another built-in field comes in. The field vm_vm_name identifies the VM name.


The preceding 2 screenshots are individual charts. It’s in the Interactive Analytics mode of Log Insight. Notice at the top of the screen, there are Dashboards and Interactive Analytics. The dashboards is where you will come first in your day to day operations.

You will eventually build your own dashboard, picking what you like from the out of the box dashboards. The good thing about your own is they won’t be superseded during Log Insight upgrade.

The dashboard below shows the vSphere alarms. From here, you can drill into any of the widgets. I normally use existing dashboard as starting point, drilled down into one of the widget, customise it, and convert the chart into a widget in my own custom dashboard.


Let’s move into Storage, as that’s one area VMware Admin have to deal with. Regardless of storage platform, ESXi is the one executing the IO on the VM behalf. You can track errors by device, hostname, and path. You can also track the latency as seen by the VM kernel. Knowing the latency at hypervisor level complements the info you have at VM level and Guest OS level.

SCSI not all device

I put the word “Regardless” in red, because it is not actually regardless. There is a situation where ESXi cannot provide you with the above info. Can you figure out what situation is that?

Yup, it is the distributed storage architecture. Specifically, when the physical disk is directly pass through to the Storage VM as PCI device (not RDM). In this case, the hypervisor does not see it. Only a monitoring tool that knows how that specific product works can monitor it. vCenter and vRealize Operations cannot help you. My point here is you need to know how your things works first, before you can monitor it properly.

You might notice on the preceding dashboard the latency widget show no result. That’s because the setting was set at 1 second. That’s 1000 ms. You can easily change it. In the dashboard below, I’ve changed the field vmw_esxi_scsi_latency from >1000000 to just exists. I also plot the Minimum and Maximum, so I can see the variation.


If I’m interested in just specific device, I can filter it. I can also change the chart type. I can also display only Maximum, so it’s easier to see.


Ok, that was Storage.

Let’s move to DRS and HA. The dashboard below gives an overall picture of DRS and HA events. I can see that during the time period I specified, all my clusters were balanced. However, I have quite a number VM heartbeat issue. [Yes, the product team is aware that the chart legend says red but it’s showing blue]


What about the speed of vMotion? How much bandwidth does it take? Did it utilize my 10Gb Ethernet well?

01 vMotion

From the above, my precopy stun time is on the high side. In a healthy environment, this number would be <300 ms. The bandwidth would also be much higher than that. This is a lab, and I know the physical connectivity.

I think you’ve got the idea of how Log Insight helps you. Let’s do just 1 example at VM level. This dashboard answer a specific question: Did any VM hit high CPU Usage? If yes, which VM and when?


We covered quite a fair of things. They are all something familiar to you. Storage, Cluster, VM, etc.

Now…. your CIO may ask: what about something we don’t know? Is there any errors, warning, timeout, abort, etc. that we need to know?

Below is the built-in query that gives you that. It’s a pretty complex query 🙂


An example of the result is below. I have grouped them by ESXi host


Hope you find it useful. Deploy it (free for 25 OSI), do the health check, and let me know what you found! You might be surprised, and have to cancel that vacation 🙂 🙂