Category Archives: Architecture

Cover both architecture and engineering. It does not cover operations and strategy.

SDDC Network Monitoring

Thank you Sasha Velednitsky and Hsien-chung Woo from NetFlow Logic for contributing this post!

Monitoring Network Metadata in Real Time

Network devices are rich source of information about the network’s traffic, in the form of NetFlow, sFlow, or IPFIX formats. This metadata is voluminous and most valuable for operational and security purposes.

You get the best insights when the data are captured and analyzed in real time. This is where the data processing engine in NetFlow Integrator comes in. It can process hundreds of thousands of these records per second. Users can apply a myriad of solutions to understand the health and robustness of their networks, as well as the imminence of security threats. The results of NetFlow Integrator processing and analytics are then visually displayed via vRealize Log Insight.

Most network management tools use LLDP or CDP protocols (designed for topology discovery) to reveal network device connectivity, and do not identify the actual network traffic. On the other hand, NetFlow Integrator’s analytics are based on real network traffic. A useful analogy: if you are driving within a city, a city map will be helpful. However, it is much better to have both a map and a depiction of the traffic congestion, so you can navigate more efficiently.

SDDC Monitoring

One of the biggest operational concerns for IT Operations and SDDC Administrators is the lack of visibility between the virtual and physical networking layers — how to trace and troubleshoot connectivity issues. Typically, SDDC management tools monitor virtual network devices, such as vSphere Distributed Switch (VDS), Distributed Logical Routing, Distributed Firewall, Edge Services Gateway, and others. What if a performance degradation or outage is caused by physical device failures or overloading?

How do we know where virtual network traffic is encapsulated, and how it traverses the physical network?

Legacy tools break down at the virtual to physical boundary. Lacking correlation between logical and physical networks leads to longer time to resolution, and unacceptable outage time frames for many customers.

For complete visibility you need to collect and analyze flows from both virtual and physical devices. Luckily, most vendors support some sort of flow generation technology (Cisco  – NetFlow, Juniper – jFlow, Dell, HP, Arista, Brocade – sFlow, VDS – IPFIX).

Configure all of your flow-capable exporters, such as Top of Rack switches, core and aggregation switches, routers, and virtual switches (e.g. as VDS or Open vSwitch) to send NetFlow/sFlow/IPFIX to NetFlow Integrator for visibility of virtual and physical networks.

Network Counters

NetFlow Integrator accepts network flow data, applies algorithms to the data to extract the information needed to address desired use cases, converts the processed data to syslog, then sends that useful information to other systems for visualization. The granularity of these counters is configurable.

Network bandwidth is typically consumed by a relatively small number of users or applications. With NetFlow Integrator and Log Insight, SDDC administrators can identify which applications are using the most network bandwidth. Log Insight dashboards, shown below, provide this information by source IP, destination IP, ports and protocols.

1 2 3

Micro-segmentation enables organizations to divide SDDC logically into segments, and to implement security groups and firewall rules down to workload levels.

East-West network traffic patterns by application ports and protocols enable administrators to plan and implement micro-segmentation using VMware NSX.

As NetFlow Integrator receives flow information from physical network devices, it reports network bandwidth consumption by each physical network device interface. The following counters are provided:

  • Traffic In Rate (Bytes/sec)
  • Traffic Out Rate (Bytes/sec)
  • Relative load %
  • Packets In Rate (Packets/sec)
  • Packets Out Rate (Packets/sec)
  • Relative Packets Rate %

Virtual traffic is encapsulated at Virtual Tunnel End Point (VTEP). For each VTEP the following counters are provided:

  • Traffic In Rate (Bytes/sec)
  • Traffic Out Rate (Bytes/sec)
  • Packets In Rate (Packets/sec)
  • Packets Out Rate (Packets/sec)
  • Flow count

Advanced Analytics

Application performance and availability could also be impacted by a variety of factors, such as DDoS attacks. Sophisticated DDoS attacks are notoriously difficult to detect on a timely basis and to defend against. Traditional perimeter-based technologies such as firewalls and intrusion detection systems (IDSs) do not provide comprehensive DDoS protection. Solutions positioned inline must be deployed at each endpoint, and are vulnerable in case of a volumetric attack. Typically, solutions require systems to run in a “learning” mode, passively monitoring traffic patterns to understand normal behavior and establishing a baseline profile. The baseline is later used to detect anomalous network activity, which could be a DDoS attack. The building of these baselines takes days or weeks, and any change in the infrastructure makes a baseline obsolete, resulting in many false positives.

In contrast to systems relying on the baselines, NetFlow Logic’s Anomaly Detection – Traffic solution is based on flow information analysis. Thus it is not susceptible to volumetric flood attacks. Additionally, since it does not rely on baseline data collection, NetFlow Logic’s anomalous traffic detection solution can be operational 15-20 minutes after deployment.

1

NetFlow Logic’s solution is based on statistical and machine learning methods and consists of several components, each analyzing network metadata from a different perspective. Results of these analyses are combined and a final event reporting decision is made. The result of this “collective mind” approach is the reduction of false positives.

Announcing VMware Performance & Capacity Management, 2nd edition

Img-5235

I have been waiting for a long time to be able to post this. The book started around Dec 2014, when the writing of the 1st edition was complete and the publisher did a cut off date for changes. I knew many items were not covered. It was 1.0 after all.

Fast forward to February 2016, and we have revamped the content. From both the amount of effort, and the resultant book, this to me is more like 2.0 than 1.1. Page wise, it is 500+ pages, doubling the 1st edition. You can see the structure of the second edition in this link. I have tried to codify the knowledge I have into a structured process.

It’s a surprise how much things changed in just 14 months. I certainly did not expect some of the changes back in Jan 2015!

  • Major improvement in monitoring beyond vSphere.
    • vRealize and its ecosystems have huge improvement in Storage, Network, Application monitoring. This includes newer technology such as VSAN and NSX.
    • Many adapters (management packs) and content pack were released for both vRealize and Log Insight. I’m glad to see thriving ecosystems. Blue Medora especially have moved ahead very fast.
  • Rapid adoption of NSX and VSAN, that I had to add them. They were not plan of the original 2nd edition.
  • Rapid adoption of VDI monitoring using vRealize. I had to include VDI use cases.
  • Adoption by customers, partners and internal have increased.
    • In the original plan, I wasn’t planning of asking any partners to contribute. So I’m surprised that 2 partners agreed right away.
    • It is much easier to ask for review, as people are interested and want to help.
  • vSphere 6.0 and 6.0 Update 1 were released.
    • Since the book focus on Operations (and not architecture) and monitoring, the impact of both releases is very minimal.
    • Not many counters changed compared with vSphere 5.5
  • vRealize Operations had 6.1 and 6.2 releases. Log Insight has many releases too.
    • Again, this has minimal impact, since the book is not a product book.

Release Notes of 2nd edition

The existing 8 chapters have expanded and reorganized, resulting in 15 chapters. It now has 3 distinct parts. The 3 parts are structured specifically in that order to make it easier for you to see the big picture. You will find the key changes versus the 1st edition below.

More complete

  • More explanation on Performance and Capacity Management.
  • Elaborates the Performance SLA concept as it has resonates with customers (from engagements, events and blogging)
  • Add Network monitoring, with focus on NSX
  • Add VSAN monitoring.
  • Add Horizon View monitoring. Practical tips like this.
  • Incorporate monitoring that is better done via VMware Log Insight
  • Add application-level monitoring. I asked Blue Medora to contribute as they know this better than I do.

More practical, less theory.

  • Move contents that are more theoretical to the back.
  • Add more examples, and structure them so readers can see the relationship.

Easier to read

  • Less of long sentences, long paragraphs or complex tables.
  • More bullet points.
  • Break long chapters into smaller chunks.
  • Add more white space on places where it’s full of text.
  • More diagrams, to complement explanation.
  • Lighter words. Friendly chat among friends, not formal research paper.
  • Add humour.
  • Add adult picture. Ok, this is not a good idea.
  • Clear picture. Some pictures were too small.
  • Clearer heading and layout. The style heading 2 is relatively too big to the text.

Fix title

  • The book is actually not just for vRealize Operations user. It’s for the broader VMware team. This is more of a vSphere book than vRealize book.
  • This would also make the darn title shorter 🙂 Yes, in future this will evolve to just SDDC Operations Management.

What does not change?

  • It remains focus on Performance and Capacity. I’m not adding Configuration, Availability, Security, etc.
  • The book will also remain a solution book, and not a product book. There is already a great product book here by fellow CTO Ambassador.

I will provide as many free information as possible, that the Publisher allows me. We are looking at early April publication, so in the meantime, here is what they have made available. When they have officially released it, I’d add more information, such as proper acknowledgement to those who have made the book possible, and certainly discount code.

If you want to publish a review in your blog or LinkedIn, I'll link you with Packt.

I hope you find it useful. Any correction and suggestion, let me know at Twitter or LinkedIn.

Capture

PS: No, please don’t ask me about the 3rd edition. Right now I need a break 🙂 and to spend time with family! Below is wife, my 2 girls and my 1st niece.

11057344_1674393329473505_6520074326172583340_o

VMware Performance SLA

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Google “performance SLA” VMware, and you will find only few relevant articles. The string performance SLA has to be within a quote, as it is not performance and SLA, but Performance SLA. Yes, I’m after web page with the words Performance SLA together. You will get many irrelevant results if you simply google VMware Performance SLA without the quote.

I just tried it again (2 April 2017). It’s only 6000 results, up from 2330 results in Nov 2016, and 1640 results on Oct 2015. The first 10 are shown below. Notice 8 of them are actually from my blog, book or event. If you ask your peers, you will not find many customers have Performance SLA.

I didn't change the screenshot below, as it's similar to what I got in April 2017

performance-sla-nov-2016

I checked beyond the first 10 results. Other than my own articles, Google returned only 5 relevant articles. The rest are actually not relevant. An example of relevant article is by former colleague, and a good friend Michael Webster. All the relevant articles are good and informative articles. They also mention Performance SLA. They just do not define and quantify what Performance SLA is. If something is not quantified, it is subjective. It’s hard to reach formal agreement with customers quickly and consistently when the line is not clearly drawn. If you have disagreement with your customers, especially paying customers, guess who win 🙂

A former colleague, Scott Drummonds, covered Performance SLA in his old blog back in 2010. It is unsurprising to me, knowing Scott, that he had thought about it years ago! However, what he covered was Application layer, not IaaS layer. He also did not provide a counter to measure. Certainly, it was virtually impossible to provide that years ago, considering the maturity of the IaaS at that time.

Availability SLA protects you when there is downtime. Performance SLA protects you when there is performance issue. How?

If you are within your SLA, you are safe.

Below is example of Performance SLA. Describe (or define) the service for each of the 4 infrastructure component (CPU, RAM, Disk, and Network).

Take note:

  • Do not set SLA at the individual vCPU level. Set at the whole VM level. It’s much harder to comply and monitor at per vCPU.
  • Do not set SLA at both Read and Write latency. Set at the aggregate.
  • All numbers are measured in 5 minute average. If a spike only lasts for 1 minute, then it calms down for the remaining 4 minutes, it won’t show up.
  • To change the power management settings, see this KB (VM application runs slower than expected in ESXi)
  • The SLA impacts your architecture. It poses a constraint as it sets a formal threshold.
    • Example: how do we ensure Tier 3 storage does not impact Tier 1 since they are on the same array (spindles, CPU, etc). Some storage array has shares internally. Even on vSAN, this is challenging.

The above is is an example as your policy as IaaS provider may vary. For each, list all the properties that impact the quality of the service.

Notice what’s missing in the table?

Something that you normally have if you are doing Capacity Management based on spreadsheet 🙂

Yup, it’s the Consolidation Ratio.

It’s not there because it’s not relevant. In fact, the ratios can be misleading as it does not take into account VM utilization, VM size, ESXi power management, DRS, backup period, etc. It is definitely a good guide for initial planning. Once you are in production, you need to monitor based on what’s happening in production. Performance Experts like Mark Achtemichuk has explained it well here. I recommend you read it first.

Done?

Great! Let’s dive deeper. I will take one component, Storage, as it’s the easiest to understand.

  • VM Disk Latency for Tier 1: 5 ms
  • VM Disk Latency for Tier 2: 15 ms
  • VM Disk Latency for Tier 3: 25 ms
  • All values measured as 5-minute average.
  • SLA is breached when value exceed SLA at any given 5 minutes, 365 days a year.

When a VM owner complains that her VM is slow because of storage, and that VM resides on a Tier 2 storage, both of you can see the VM disk latency. If it’s below 15 ms, it’s not your fault. Perhaps her application needs a faster storage, and she can pay more and upgrade to Tier 1. If it higher than 15 ms, you as IaaS provider does not even have to wait until she complains 🙂 Better still, do something before she notices.

What number should you set?

  • If you have no data, the above is a good starting point.
  • If you have vR Ops running, you can set a number based on your actual data.
    • There is no point is setting something much higher or lower than what you actually have.
    • Use the super metric preview as shown below.
    • I plotted 1 month data (that’s 8650 data points). That’s more than enough.
    • Take the maximum. That’s your baseline.

Alerts

Now… you have thousands of VMs under your management. I guess what you want is to be alerted if any of them hit the SLA you promise.

Yes, vRealize Operations can alert you, so you can proactively do something before VM owner complains. Brandon Gordon, Integration Architect at VMware, showed me how we can achieve the above in vRealize Operations.

See the screenshot below, courtesy of Brandon.

alert 4

In order to get such alerts at each VM, you need to create and define the alerts. Brandon has defined for CPU, RAM and Disk. He has also defined it for each tier.

alert 2

Hope you find useful, just like many of my customers have. If not, drop me a note.