Category Archives: Architecture

Cover both architecture and engineering. It does not cover operations and strategy.

Multi-tier Application Monitoring

This post is part of Operationalize Your World post. Do read it first to get the context.

I covered a single-tier application in this post. Read that first, as this blog builds upon that.


Great! A lot of you have shared that you want a multi-tier applications. One customer has a mission critical application that spans 5 tiers and 68 VMs. The dashboard I shared earlier does not scale to that level, as you certainly don’t want to check 68 VM one by one!


To make sure we are on the same page, here is an example of multi-tier application:

A multi-tier application can suffer from either horizontal or vertical problem:

  • By horizontal, I mean a tier has problem. When the web tier is slow, it can slow down the entire application. The speed of a convoy is determined by the slowest car. We will use the following formula to determine the application performance:
    Application Performance = Minimum (Tier Performance)
  • By vertical, I mean something that cut across tier. Storage, for example. If the slowness is caused by something common, there is no need to troubleshoot individual VM, as they are simply victim.

That means we need to check both angles when an application had performance problem:

  • Which tier had the problem? Since when? How bad? What was the problem?
  • What infra problem did the app had? Storage, Network, CPU, RAM?

The above check makes a good starting point in your analysis. Don’t zoom into a particular VM until you know the overall picture. No point fire fighting the kitchen if the whole house is on fire.

Application Tier

The health of a tier is the average health of its member. This is because a tier scales out. We are not taking the minimum value. This is not a convoy.

Hold on!” you might say. Since it is scale out, App Team has catered for this. If they only need 3 web servers, they will deploy 4 or even 5. So both performance and availability are not affected. The tier performance has to take into account this extra node, not simply doing an average.

This logic sounds reasonable. But is it correct?

It is not actually. Because this is not about Availability. This is about Performance. All web servers are still up, but if node no 4 is slower, user experience will be affected.

My fellow blogger Luciano Gomes advise that some Load Balancer can detect the performance of a node. This is good, as simply counting the number of session is not a complete measurement. The node measurement is based on this formula. It takes into account how the IaaS platform is serving the Node. So it’s looking beyond the Guest OS. This is because performance cannot be measured from within the Guest OS only. Review this discussion between David Davis and Sunny Dua.

This is why we are doing an average. You want to be informed if there is degradation, since this is performance, not availability.

The health of a VM is simple. We leverage the work we’ve done for a single-tier application here.


I found that doing a logical design of dashboard actually saves time. Follow best practices to help you.

Here is the logical design of the dashboard. Notice it has 3 levels: App, Tier, VM.

Here is what it looks like

Click on the image below to see explanation of each section:

Hope you find it useful. As usual, you do not have to build this from scratch. This is part of Operationalize Your World, which give you 50+ dashboards.

Architecting VMware vSphere for Operations

As an Architect, you take into account many requirements when designing your VMware vSphere environment. As an Operations person, I shall not question your Architecture. I’m sure it is fast, highly available, right on budget, etc. My role is to help you prove to CIO that what you architect actually lives up to its expectation. “Plan meets Actual” is what an Architect wants, because that means your architecture delivers its promise.

The plan of the Architecture exists in some diagram and documents. It's static.
The reality of the Architecture exists in Datacenter. It's live.

When proving, here are the questions we should be able to answer:

  • Availability: Does the IaaS deliver the promised Availability SLA? If not, what was its actual number, when was it breached, how long, which VMs affected? For each VM, when exactly it happened and ended?
  • Performance: Does the IaaS serve all its VMs well? An IaaS platform provides CPU, RAM, Disk and Network as services. If has to deliver these 4 resources when asked by each VM, 24 x 7 x 365 days a year. If not, which VMs were affected by what and when?

Those are simple questions, but are very difficult to answer. Let say you have 10,000 VMs. How do you answer that? How do you provide answer over time (e.g. monthly), proving you handle the peaks well?

To complicate matters, you need to able to answer per Business Units. Business Units A will not care about other business units. Since a Business Unit has >1 apps, you also need to answer per application. An application owner only cares about her applications.

There are a few things you need to do, so you are in the position to prove.

Step 1: Reflect the business in vSphere

Does your vCenter show all the business units? Can you show how the business is mapped into your vSphere environment? You design vSphere so Business can run on it, so where is the Business? A company is made up business units, which may have multiple departments. The structure below is a typical example.

An application typically has multiple tiers. Does your vSphere understand that?

Map the above as folders in your vCenter.

I see many naming convention that is not operations-friendly. It’s impossible to guess what it is. The names are very similar and hard to read, hence it’s easy for operators to make mistake! In some companies, these operators are outsourced or contractors, who are not that familiar or don’t care as much as employee. The naming convention typically originates from mainframe or MS-DOS era, where you cannot have space and have limited characters. Examples are SG1-D01-INS-0001W-PRD. Can you guess what on earth that is? You’re right, you can’t. Imagine there are 1000s of them like that, and you have new operators joining the help desk team.

If you have shared application, you can create a folder for that. Multiple vR Ops applications can point to the same vCenter folder.

Folder, Tags and Annotation

Have you seen a vSphere environment where there are tags and annotation everywhere?

It’s rare to meet customers with a 100% well-thought and documented approach to the 3 features above. They may have general guidelines, but not explicit Do’s and Dont’s. As a result, these 3 features are used wrongly.

Use Tags when the values are discrete, ideally Yes/No. I’d use tag to tag the following:

  • VMs with RDM.
  • VMs with MSCS or Linux Clustering.

Do not use Tags when the values are unlikely to be common. Use annotation for this. Examples are VM Owner Name, Email Address, Mobile Number. In an environment where there are >10K VMs, there can be 1000 VM Owners.

Do not use Tags to tag Service Tier. For Infra objects such as Cluster and Datastore Cluster, that should be clearly reflected in the name itself. I’d prefix all Tier 1 clusters with Tier 1, so the chance of deploying into the wrong tier is minimized.

Step 2: Design Service Tiers into vSphere

Does your vSphere understand that there are different classes of service? Are Tier 1 clusters and datastores clearly labelled?

You should avoid mixing multiple classes of service into a single cluster or datastore. While it is technically possible to segregate, it’s operationally challenging. Resource Pool expects the number of VMs for each pool to be identical among the sibling pool.

For each tier, you need to have both Availability SLA and Performance. For Performance SLA, review this doc.

Step 3: Define and Map Tiers in vR Ops

Now that you’ve considered service tier into your vSphere architecture, time to show it. You cannot show it in vSphere as vCenter does not understand Performance SLA and Availability SLA. You can use vR Ops do this. Follow this step.

Step 4: Map Applications in vR Ops

Use custom groups to create applications. If you have a proper naming convention, it should not be difficult to select members of the applications. All you need is a query that says the names contains XYZ. There should not be a need for regular expression.

Once apps are mapped, you can do something like this.

Step 5: Consider Debug-ability

Things go wrong. Especially in production. Your architecture should lend itself for troubleshooting.

A major area is to ensure the counters are reliable, else it’s hard to troubleshoot performance. The CPU Contention counter, which is the main counter for IaaS Performance SLA, is greatly affected by Power Management. Ensure your ESXi power management follow this guide by Sunny Dua.

Once you have that in place, you will be able to prove that your Architecture lives up to its expectation. Use the dashboards from Operationalize Your World to show that proof!

SDDC Network Monitoring

Thank you Sasha Velednitsky and Hsien-chung Woo from NetFlow Logic for contributing this post!

Monitoring Network Metadata in Real Time

Network devices are rich source of information about the network’s traffic, in the form of NetFlow, sFlow, or IPFIX formats. This metadata is voluminous and most valuable for operational and security purposes.

You get the best insights when the data are captured and analyzed in real time. This is where the data processing engine in NetFlow Integrator comes in. It can process hundreds of thousands of these records per second. Users can apply a myriad of solutions to understand the health and robustness of their networks, as well as the imminence of security threats. The results of NetFlow Integrator processing and analytics are then visually displayed via vRealize Log Insight.

Most network management tools use LLDP or CDP protocols (designed for topology discovery) to reveal network device connectivity, and do not identify the actual network traffic. On the other hand, NetFlow Integrator’s analytics are based on real network traffic. A useful analogy: if you are driving within a city, a city map will be helpful. However, it is much better to have both a map and a depiction of the traffic congestion, so you can navigate more efficiently.

SDDC Monitoring

One of the biggest operational concerns for IT Operations and SDDC Administrators is the lack of visibility between the virtual and physical networking layers — how to trace and troubleshoot connectivity issues. Typically, SDDC management tools monitor virtual network devices, such as vSphere Distributed Switch (VDS), Distributed Logical Routing, Distributed Firewall, Edge Services Gateway, and others. What if a performance degradation or outage is caused by physical device failures or overloading?

How do we know where virtual network traffic is encapsulated, and how it traverses the physical network?

Legacy tools break down at the virtual to physical boundary. Lacking correlation between logical and physical networks leads to longer time to resolution, and unacceptable outage time frames for many customers.

For complete visibility you need to collect and analyze flows from both virtual and physical devices. Luckily, most vendors support some sort of flow generation technology (Cisco  – NetFlow, Juniper – jFlow, Dell, HP, Arista, Brocade – sFlow, VDS – IPFIX).

Configure all of your flow-capable exporters, such as Top of Rack switches, core and aggregation switches, routers, and virtual switches (e.g. as VDS or Open vSwitch) to send NetFlow/sFlow/IPFIX to NetFlow Integrator for visibility of virtual and physical networks.

Network Counters

NetFlow Integrator accepts network flow data, applies algorithms to the data to extract the information needed to address desired use cases, converts the processed data to syslog, then sends that useful information to other systems for visualization. The granularity of these counters is configurable.

Network bandwidth is typically consumed by a relatively small number of users or applications. With NetFlow Integrator and Log Insight, SDDC administrators can identify which applications are using the most network bandwidth. Log Insight dashboards, shown below, provide this information by source IP, destination IP, ports and protocols.

1 2 3

Micro-segmentation enables organizations to divide SDDC logically into segments, and to implement security groups and firewall rules down to workload levels.

East-West network traffic patterns by application ports and protocols enable administrators to plan and implement micro-segmentation using VMware NSX.

As NetFlow Integrator receives flow information from physical network devices, it reports network bandwidth consumption by each physical network device interface. The following counters are provided:

  • Traffic In Rate (Bytes/sec)
  • Traffic Out Rate (Bytes/sec)
  • Relative load %
  • Packets In Rate (Packets/sec)
  • Packets Out Rate (Packets/sec)
  • Relative Packets Rate %

Virtual traffic is encapsulated at Virtual Tunnel End Point (VTEP). For each VTEP the following counters are provided:

  • Traffic In Rate (Bytes/sec)
  • Traffic Out Rate (Bytes/sec)
  • Packets In Rate (Packets/sec)
  • Packets Out Rate (Packets/sec)
  • Flow count

Advanced Analytics

Application performance and availability could also be impacted by a variety of factors, such as DDoS attacks. Sophisticated DDoS attacks are notoriously difficult to detect on a timely basis and to defend against. Traditional perimeter-based technologies such as firewalls and intrusion detection systems (IDSs) do not provide comprehensive DDoS protection. Solutions positioned inline must be deployed at each endpoint, and are vulnerable in case of a volumetric attack. Typically, solutions require systems to run in a “learning” mode, passively monitoring traffic patterns to understand normal behavior and establishing a baseline profile. The baseline is later used to detect anomalous network activity, which could be a DDoS attack. The building of these baselines takes days or weeks, and any change in the infrastructure makes a baseline obsolete, resulting in many false positives.

In contrast to systems relying on the baselines, NetFlow Logic’s Anomaly Detection – Traffic solution is based on flow information analysis. Thus it is not susceptible to volumetric flood attacks. Additionally, since it does not rely on baseline data collection, NetFlow Logic’s anomalous traffic detection solution can be operational 15-20 minutes after deployment.


NetFlow Logic’s solution is based on statistical and machine learning methods and consists of several components, each analyzing network metadata from a different perspective. Results of these analyses are combined and a final event reporting decision is made. The result of this “collective mind” approach is the reduction of false positives.