Monthly Archives: June 2015

Architecting Log Insight as Log Management Platform

I shared in this blog the reasons why your VMware environment needs a proper Log Management Platform (LMP). There are criteria that an enterprise LMP needs to have before it can manage your VMware platform.

This blog shows how you can architect Log Insight to help you manage a large scale and distributed VMware environment.

Let’s look at the requirements, and this time we will show how Log Insight meets the requirements.

Global visibility

  • It is common for business to have branches. This can be small remote site with just 2 ESXi Host, or large data center with multiple vCenter servers.
  • You want to be able to see all the logs in one place. This makes it easier to query and analyze too.
  • Log Insight can be deployed as Forwarder in the remote site. All it does is simply forward the entries.
  • Log Insight can forward different rules to different servers. This is useful because the Master sites are active/active across 2 data centers.

Remote site traffic should be compressed and encrypted.

  • The WAN link is typically a constraint and shared, so compressing the traffic gives you higher chance that your syslog arrives at the HQ safely. Log Insight Forwarder compress the syslog entries and use its own proprietary protocol to send it to HQ.
  • Log Insight also provide protection when the link is temporarily down. It holds the logs locally.

Disaster should not result in loss of visibility.

  • The LMP becomes a critical piece of your infrastructure. As a result, you want to protect it with DR. Using replication technology results in a active/passive architecture. It also means you are using another technology, which has no awareness of the LMP.
  • Log Insight allows you to have application-level active/active setup. We achieve this by getting the Forwarder to forward to both Global DC Log Insight instances. The 2 main Log Insight instances are completely independant of each other.

The LMP should scale to thousands of ESXi Hosts.

  • You achieve this via clustering. Log Insight has built-in active/active clustering. In the architecture below, the Main Log Insight instances are all clustered.
  • The local Forwarder is not clustered as their role is to simply forward data. They do not hold weeks of data (1 week is more than sufficient, as you should not have 1 week downtime). Also, you do not login to the Forwarders, so they do not have to handle queries. Some queries, especially against a large data set, are resource intensive.

The LMP should handle special users or use cases.

  • One use case of log is Audit and Compliance. Your vSphere provides a wealth of info that auditor or security team want to see. Unlike the rest of the data in your vSphere, this data need to be kept for years.
  • Most data in vSphere only need to be kept for weeks. Take performance or availability data. After 4 weeks, the data is unlikely to be relevant. If they can wait for weeks, then it’s not an issue 🙂

So…what the architecture look like? Below is an example.

Log Insight - Overall Architecture

The resultant architecture results in a lot of Log Insight instances. This is where vRealize Operations come in. You can create a custom group for all your Log Insight VMs. You can then manage and monitor them as a group.

I hope you like the above architecture. The next question will be, how do you test it? Below is a sample setup you can use in the lab to validate the architecture.

Lab setup

Hope found it useful. Here is a great write up by Manny Sidhu explaining his experience. Customer says it best!

VMware vCenter Support Assistant 6.0

There are many articles explaining this useful little products, so I will just focus on items that I was not able to find. I’d recommend you read this great post by Vladan.

Installation is pretty straight forward as it is a virtual appliance. You know you’ve got the deployment right when you have the console screen looking like this below. I was expecting it to show FQDN instead of IP.

11

The above should have stated that you should login with root. This useful KB Article has a small mistake. The password is not specified during install. Rather it is hard-coded to vmware. I learned this from a great post by Chris Wahl here.

Once you login, you will see the following screen:

12

Enter your vSphere 6 Platform Service Controller address. It will automatically append the https, the port number and the rest of the URL. You just need to type the IP or FQDN.

13

Once it finds it, you need to register. I use the usual administrator@vsphere.local.

14

Click Finish in the above screen. You will see the screen below. I created a separate service account so I can tell if it is Support Assistant that login. There is a minor limitation here, it shows only 1 vCenter. There are actually 2 vCenter Servers in the lab, sharing the same PSC. I am not able to add the second one manually as it says it’s not finding it.

15

Click Next button. The rest is pretty straight forward, and Chris Wahl has explained it here. As Chris said it, the rest of the configuration is done at the vSphere Web Client. This is what mine looks like after I configured it. Remember I wrote it is a minor issue that it does now show my second vCenter? That’s because it actually does see it. The screenshot below it automatically recognises both.

16

Besides getting the email notification, it actually integrates into your vSphere Web Client and put all the issues it find there. To see it, you go to Monitor –> Issues. There is a new category under Triggered Alarms.

17

You can see the result of the scheduled collection in the Monitor tab. I scheduled mine and the first run was successfully completed, as you can see below.

21

VMware SDDC Architecture: sample for 500 – 2000 VM

If you are to architect a virtual infrastructure for 500 VM, what will it look like? The minor details will certainly differ from one implementation to another, however the major building blocks will be similar. The same goes for say a 2000 VM class environment. Given the rapid improvement in hardware price/performance, I categorise 500 – 2500 VM as a medium SDDC. 2500 VM used to be a large farm, occupying rows of racks in a decent size data center. I think in 2016, it will be merely 1-3 racks, inclusive of network and storage! What used to be the whole data center has become the size of a small server room! Yes, a few experts are all it takes to manage 2500 VM.

Before you proceed, I recommend that you read the official VMware Validated Design. I've taken idea and solution from there and added my 2 cents. You should also review EVO SDDC, especially this interview with Raj Yavatkar, VMware Fellow and Lead Development Architect of EVO SDDC.

I’m privilege to server customers with >50K server VM, and even more desktop VM. For customers with >10K VM per physical data center, I see 2015-Q4 Pod to consist of 2 racks and house 2000 – 3000 VM. While data center power supply is reliable, most customers take into rack failure or rack maintenance. So the minimum size is 2 physical racks. The pod is complete and can stand on its own. It has server, storage, network and management. It is 1 logical unit, and managed as 1. They are patched, updated, secured and upgraded as 1. It may not have its own vCenter, as you do not want to have too many vCenter as that increases complexity. Because of this operation and management challenge, a pod has only 1 hypervisor. You either go with VMware SDDC Pod, or pod from other vendors. If you are going with multiple hypervisors, then you will create 2 independent pods. Each pod will host similar number of VMs. If your decision to go multiple hypervisors because you think they are commodity, read this blog.

I understand that most customers have <2500 VM. In fact, in my region, if you have >1000 VM, you are considered large. So what does a pod look like when you don’t even have enough workload for half a pod? You don’t have economics of scale, so how do you optimize a smaller infrastructure?

Architecture is far from trivial, so this will be a series of blogs.

  1. Part 1 (this blog) will set the stage, cover Requirements, provide Overall Architecture and summary.
  2. Part 2 covers Network Architecture
  3. Part 3 will cover Storage Architecture (coming after VMworld, as I need some details from Tintri!)
  4. Part 4 covers the Rack Design.
  5. Part 5 explains the design considerations I had when thinking through the solution. This is actually critical. If you have a different consideration, you will end up with a different design.
  6. Part 6 covers the methodology I used.

Requirements

To enable us to discuss, we need to take an example. The following table scopes the size and requirements:

1

The SDDC will have to cater for both server VM and desktop VM. I call them VSI and VDI, respectively. To save cost, they share the same management cluster. They have their own vCenter as this allows Horizon to be upgraded independently of the server farm.

It has to have DR capability. I’m using VMware SRM and vSphere Replication. It has to support active-active application too. I consider VDI as an application, and in this architecture, it is active/active, hence DR is irrelevant. Because I have Active-Active at application layer already, I do not see a need to cater for Disaster Avoidance.

The server farm is further broken into Service Tiers. It is common to see multiple tiers, so I’m using 3. Gold is the highest and best tier. Service Tier is critical, and you can see this for more details. The 3 tier of service is defined below:

2

VMs are also split into different environments. For simplicity, I’d group them into Production and Non-Production. To make it easier to comply with audit and regulation, I am not mixing Production and Non-Production in the same vSphere cluster. Separate cluster also allows you to test SDDC changes in a less critical environment. The problem is the nasty issue typically happens at production. Just because it’s running smoothly in Non-Production for years does not mean it will be in Production.

For the VDI, I simply go with 500 VM per cluster, as 500 is an easy number to remember. In a specific customer environment, I normally refine the approach and use 10 ESXi host per cluster as the number of VDI VMs vary depending on the user profile.

SDN is a key component of SDDC. NSX best practices tells to have a dedicated cluster for Edge. I’m including a small cluster on each physical data center.

VMware best practice also recommends a separate cluster for management. I am extending this concept and call it IT + Management Cluster. It is not only for management, which is out of band. It is also for core or shared services that IT provides to business. These services are typically in-band.

Overall Architecture

Based on all of the above requirements, scope, and consideration, here is what I’d propose:

It has 2 physical data centers, as I have to cater for DR and active-active applications. It has 2 vCenters servers for the same reason. Horizon has its own vCenter for flexibility and simplicity.

The physical Data Center 1 serves the bulk of production workload. I’m not keen on splitting Production equally between 2 data centers as you need a lot of WAN bandwidth. As you can see in the diagram, I’m also providing active/active capability. Majority of traffic is East – West. I’ve seen customers who have big pipes and yet encounter latency issue even though the link is not saturated.

The other reason is to force the separation between Production and Non Production. Migration to Production should be controlled. If they are in the same physical DC, it can be tempting to shortcut the process.

  1. Tier 1. I further split the Gold tier into 2. This is to enable mission critical applications to have long distance vMotion. Out of 100 VM, I allocate 25% for active/active applications and Disaster Avoidance (DA).
  2. Tier 2. I split it into 2 physical data center, as I need to meet the DR requirements. Unlike Tier 1, I no longer provide DA and active/active application.
  3. Tier 3. I only make this available for Non-Production. An environment with just 300 Production VM is too small to have >2 tiers. In this example, I am actually providing 2+ tier, as the Gold Tier has option for active/active application and DA.

My sizing is based on the simple model of consolidation ratio. To me, they are just guidance. For proper sizing, review this link. You may wanna get yourself a good cup of coffee as that’s a 5-part series.

Let’s now add the remaining component to make the diagram a bit more complete. Here is what it looks like after I add Management, DR and ESXi.

500 - 2

DR with SRM. We use SRM to failover Tier 1 into Tier 3. During DR, we will freeze the Tier 3 VMs, so Tier 1 VMs can run with no performance impact. I’ve made the cluster size identical to ensure no performance loss. For Tier 2, it will failover to Tier 2. So there is 50% performance loss. I’m drawing the arrow one-way for both, in reality you can fail over in either direction.

ESXi Sizing. I have 2 sizes: 2 socket and single-socket. The bigger square is 2 sockets, and the smaller one is 1 socket. Please review this for why I use single socket. I’m trying to the cluster size 4 – 12 nodes, and I try not to have too many sizes. As you can see, I do have some small cluster as there is simply not enough workload to justify more node.

We’ve completed the overall architecture for 500 server VM and 1000 VDI. Can we scale this to 2000 server VM and 5000 VDI, with almost no re-architecting? The answer is yes. Here is the architecture. Notice how similar it is. This is why I wrote in the beginning that “the major building blocks will be similar”. In this case, I’ve shown that they are in fact identical. My little girl told me as I went back and forth between the 2 diagram…. “Daddy, why are drawing the same thing two times?” 🙂

5000 - 1

The only changes above is just the ESXi sizes and cluster size. For examples:

  • For the VDI, I have 5 clusters per site.
  • For the Tier 1 server VM, I have 3 clusters. Each has 8 ESXi host. I keep all 8 to make it simpler.
  • For the Tier 3 server VM, I have 2 clusters. Each has 12 ESXi hosts. Total is 24 hosts, so it’s enough to run all Tier 1 during DC-wide failover.

By now, you likely notice that I have omitted 2 large components of SDDC Architecture. Can you guess?

Yup, it’s Storage and Network.

I will touch Storage here, and will cover Network on a separate blog. I’m simplifying the diagram so we can focus on the storage subsystem:

5000 - 2

I’m using 2 types of storage, although we can very well use VSAN all the way. I use VSAN for VDI and IaaS clusters (Management and Network Edge), and classic array for Server clusters (Tier 1, 2, and 3).

I’ve added the vSphere integration that Storage arrays typically have. All these integration need specific firmware level, and they also impact the way you architect, size and configure the array. vSphere is not simply a workload that needs a bunch of LUNs.

I’ve never seen an IT environment where the ground team is not stretched. The reality of IT support if you are under-staff, under-trained, lack of proper tools and bogged down by process and politics. There are often more managers than individual contributors.

As you can see from this article, the whole thing becomes very complex. Making the Architecture simple pays back in Operation. It is indeed not a simple matter. This is why I believe the hypervisor is not a commodity at all. It is your very data center. If you think adding Hyper-V is a simple thing, suggest you review this. That’s written by someone with actual production experience, not consultant who leave after project is over.

As Architect, we all know that it is one thing to build, it is another to operate. The above architecture requires a very different operation than classic, physical DC architecture. SDDC is not physical DC, virtualised. It needs a special team, led by the SDDC Architect.

In the above architecture, I see that adding a second hypervisor as “penny wise pound foolish”. If you think that results in a vendor lock in, kindly review this and share your analysis.

Limitations

  • Not able to do Disaster Avoidance. The main reason is I think it increases cost and complexity with minimal additional benefit. For critical applications, it is already protected with Active/Active at application layer, making DR and DA redundant. For the rest, it already has SRM.

BTW, if you want the editable diagram, you can get it here. Happy architecting! In the next post, I’ll cover Network architecture..