Category Archives: People

Cover things such as career, soft skills, and community.

SDDC Operations Dashboards for SMB environment

This post continues from the Operationalize Your World post. Do read it first so you get the context.

The SMB segment is a world of its own. There are things that are mandatory in Enterprise segment, but not relevant in SMB segment. As a result, products should be tailored for that market segment.

IMHO, there are actually 4 different market segments when it comes to SDDC Operations. I use No of VM as the marker for each segment. Each of the following segment requires different dashboards and reports:

  1. 100 VM
  2. 1000 VM
  3. 10000 VM
  4. 100000 VM

Now, it will be difficult to create a product with 4 sets of vROps dashboards & reports. I make a compromise on the above, and use this one instead:

  1. 400 VM: SMB market
  2. 4000 VM: Enterprise market
  3. 40000 VM: <give me a name here folks> market

I hope the above is acceptable. As the above has very wide range, I’d take the following reference point

  1. SMB market: 250 VM
  2. Enterprise market: 2500 VM
  3. Huge Private Cloud: 25000 VM

Let’s dive to the 250 VM segment. What are the unique characteristics?

  • 1-2 guys doing everything. No siloes in the team. You and your best friends take care of the whole darn IT.
  • You only have a few clusters. Each cluster only has a few ESXi Host.
  • You know your environment very well because it’s small. They all fit into 1 rack. Architecture is simple. You have a mental picture of it in your head.
  • You don’t buy hardware or VMware every quarter. Likely it’s every 2 years. Capacity planning and monitoring are simple.
  • The workload is quite stable. You are not adding/removing/changing VM every day.
  • Service Tier is an overkill as you only have 1-2 clusters for all workload.

Which of the above points apply to a large environment?

You are right. None.

As a result, SMB needs a purpose-built dashboards. It covers the following:

  1. Availability
  2. Performance
  3. Capacity
  4. Reclaimable Capacity
  5. Compliance
  6. When a VM Owner complains

Home

Your main dashboard. It’s the first dashboard you check, likely on a daily basis as part of your cadence. It answers the no 1 question: is everything healthy?

This is what it looks like in vR Ops 6.3. I’ve added explanation so you can easily see that it’s layered into 4 areas.

home

Availability

The first element of Health is Availability. If a VM or ESXi is down, there is no need to talk about performance or capacity as the damn thing is dead 🙂

The Availability dashboard gives you details info. You can answer questions such as “When did it go down? For how long?”

availability

The dashboard is also useful when you need to report uptime. You do need to create a report and customize it though. If you need it, email me your requirement.

Performance

Just because something is up, does not mean it’s fast. Performance dashboard provides the info here. The dashboard sports the new concept of Performance, which you can review here. It does not apply the formal SLA, as that’s not applicable in SMB. Even without SLA, you can use it to prove your innocence, or justify new hardware purchase.

Line Charts are used as performance problem might have started earlier, or it’s no longer happening and you’re doing a root cause.

If the performance issue is caused by villain VM, the dashboard lets you find the VM. Change the time line in the Top-N widget to the time where there is performance problem.

BTW, if you like the ability to find out which VM was causing the problem, send your thank you to Matthew Hurley

Capacity

Generally speaking, Performance problem happens because supply is not being met by demand. The Capacity dashboard gives detail info on the supply side. As there are only a few clusters, capacity management is much simpler.

capacity

Notice it takes into account performance.

If you mix Prod and Non Prod, capacity management becomes harder. Since the hardware is shared, we need to monitor at the overall cluster level. Since the Production VMs have a more stringent SLA, naturally their number reflects that. As a result, we need to show Prod and Non Prod differently. Let me know if you need it, as to me that complicates operations. This is another reason why I advocate separate cluster for Prod and Non Prod.

One common issue in virtual environment is VM sprawl. Some of these VMs end up not being used. You can reclaim CPU, RAM and Disk from these VMs.

  • The easiest to reclaim is from orphaned VMs, as they are not even registered in vCenter.
  • The second easiest is snapshot. You should only keep snapshot for 1 day or less.

Once the above is reclaimed, you need to look at Powered Off VMs and Idle VMs

  • CPU and RAM are reclaimed from running VMs, as powered off VMs are no longer consuming the resource.
  • CPU: claim from large VM (e.g. 8 vCPU or more). Avoid reclaiming from 2 vCPU unless you’ve completed the large VMs.
  • RAM: claim from large VM (e.g. 16 GB RAM of more) that has Guest OS metrics. It’s more accurate than hypervisor metric.

The Reclaimable dashboard lists all the VMs that have been idle or powered off. It also lists the orphaned VMs and large snapshots.

reclaimable

Configuration

If you configure vSphere hardening guide, and your Infra and VMs comply to it, you will see all green in the dashboard below. If not, you can see exactly which VM or infra is not complying. You can customize the default threshold, although it’s better than you customize the symptoms & alert instead.

You can see compliance for Network and vCenter too, under the vSphere Compliance widget. There is a drop-down there that is not shown.

IaaS

Last but not least, your job is actually about making sure the VM is being served well. It’s a service. Your customers don’t care about your infrastructure. So when they complain that their VM has a problem, you need a dashboard that quickly prove if the problem is at your end or their end. TTI is not Time to Investigate, but Time to Innocence 😉

The Troubleshoot a VM dashboard is built exactly for that!

troubleshoot-a-vm

This dashboard is quite long, as it lets you check underlying ESXi and datastore. You can collapse the widget, as shown below, to see more.

troubleshoot-a-vm-2

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

VMware CTO Ambassadors in Asia Pacific

Quoting from the official page at VMware.com, “The CTO Ambassador program is run by the VMware Office of the CTO. The CTO Ambassadors are members of a small group of our most experienced and talented customer facing, individual contributor technologists. They are pre-sales systems engineers (SEs), technical account managers (TAMs), professional services consultants, architects and global support services engineers. The ambassadors help to ensure a tight collaboration between R&D and our customers so that we can address current customer issues and future needs as effectively as possible.”

For more info, see here

With that, here are your Ambassadors for the Asia Pacific region for 2016. We form a small community and work closely together. There are only 28 of us spread across Asia Pacific and we work closely with R&D. Reach out to them via LinkedIn, Twitter, or their personal blogs. If you need their email, you can ask your local VMware representative.

Full NameLocationBlog & TwitterRoles
Travis Wood
Australia (Brisbane)EUC
Michael FrancisAustralia (Brisbane)PS Engineeering
Daniel KingAustralia (Brisbane)PS Engineeering
Paul JamesAustralia (Brisbane)@prj32TAM
Greg MulhollandAustralia (Melbourne)@g_mulhollandSE (Storage)
Roman TarnavskiAustralia (Sydney)blog.romant.net
@romant
SE (Cloud Native)
Chengkai “CK” KongChina (Shanghai)@ckkong_shSE
Sanjaya Kanungo
India (Bangalore)vmpower.blogspot.inSE (Global Alliance)
Sriram Rajendran
India (Bangalore)GSS
Sunny DuaSingaporevXpresss.blogspot.com
@Sunny_Dua
PSO
Toru Kaneko Japan (Tokyo)vmware.10rukaneko.net
@10rukaneko
SE (vCloud)
Yasunari Saito
Japan (Tokyo)SE
Iwan 'e1' RahabokSingaporevirtual-red-dot.info
@e1_ang
SE
Tessa DavisSingaporeSE
Alex ZhaoChinaPSO
Cedric RajendranIndia (Bangalore)virtualknightz.comGSS
Chris SlaterAustraliaPSO
David HeadAustraliaEUC
David WakemanAustraliaEUC
Grant OrchardAustraliagrantorchard.comSE
James McAfeeChina (Hong Kong)SE (Advisory)
Joshua LambertAustralia (Sydney)PSO
Junnosuke NakajimaJapanSE
Kapil KasetwarIndiaSE (Global Account)
Raminder SinghIndiaTAM
Rick ChenTaiwan (Taipei)SE (Network)

Yes, this post supersedes the 2015 CTO Ambassadors list.

1000 VM per rack is the new minimum

The purpose of the above article is to drive a point that you need to look at the entire SDDC, and not just a component (e.g. Compute, Storage, Network, Security, Management). Once you look at the whole SDDC infrastructure in its entirety, you maybe surprised that everything fits into just 1-2 rack!

The purpose is not to say that you must achieve 1000 VM per rack. It is also possible that you can’t even achieve 100 VM per rack (for example, you are running all Monster VMs). I’m just using “visual” so it’s easier for you to see that there is a lot of inefficiency in typical data center.

If your entire data center shrinks into just 1 rack, what happens to the IT Organisation? You are right, it will have to shrink also.

  • You may no longer need 3 separate team (Architect, Implement, Operate).
  • You may no longer need silos (Network, Server, Storage, Security).
  • You may no longer need the layers (Admin, Manager, Director, Head)

With less people, there is less politics and the whole team becomes more agile.

The above is not just my personal opinion. Ivan Pepelnjak, a networking authority, has in fact shared back in October 2014 that “2000 VMs can easily fit onto 40 servers”. I recommend you review his calculation on this blog article. I agree with Ivan that “All you need are two top-of-rack switches” for your entire data center. Being a networking authority, he elaborates from networking angle. I’d like to complement it from a Server angle.

Let’s take a quick calculation to see how many VMs we can place in a standard 42 RU rack. I’d use Server VM, not Desktop VM, as they demand higher load.

I’d use a 2RU, 4 ESXi Host form factor, as this is a popular form factor. You can find example at SuperMicro site. Each ESXi has 2 Intel Xeon sockets and all flash local SSD running distributed virtual storage. With Intel Xeon E5-2699 v3, each ESXi Host has 36 physical cores. Add 25% of Intel Hyper-Threading benefit, you can support ~30 VM with 2-3 vCPU each as there are enough physical cores to schedule the VMs.

The above take into account that a few cores are needed for

  • VMkernel
  • NSX
  • VSAN
  • vSphere Replication
  • NSX services from partners, which take the form of VM instead of kernel module.

30 VM for each ESXi. That’s 30:1 consolidation ratio, which is a reality today. You have 4 ESXi in a 2RU form factor. That means 30 x 4 = 120 VM fits into 2 RU space. Let’s assume you standardise on a 8-node cluster, and you do N+1 for HA. That means a cluster with HA will house 7 ESXi x 30 VM = 210 VMs. Each cluster only occupies 4 RU, and it comes with shared storage.

To hit ~1500 VMs, you just need 7 clusters. In terms of rack space, that’s just 7 x 4 RU = 28 RU.

Capture

A standard rack has 42 RU. You still have 42 – 28 = 14 RU. That’s plenty of space for Networking, Internet connection, KVM, UPS, and Backup!

Networking will only take 2 x 2 RU. You can get 96 ports per 2 RU. Arista has models you can choose here. Yes, there is no need for spine-leaf architecture. That simplifies networking a lot.

KVM will only take 1 RU. With iLO, some customers do not use KVM as KVM encourages physical presence in data center.

If you still need a physical firewall, there is space for it.

If you prefer external storage, you can easily put 1400 VM into a 2RU all-flash storage. Tintri has an example here.

I’ve provided a sample rack design in this blog.

What do you think? How many racks do you still use to handle 1000 VM?

Updates

  • [7 Nov 2015:  Tom Carter spotted an area I overlooked. I forgot to take into account the power requirements! He was rightly disappointed, and this is certainly disappointing for me too, as I used to sell big boxes like Sun Fire 15K and HDS 9990! On big boxes like this, I had to ensure that customers data center has the correct cee form. Beyond just the Ampere, you need to know if they are single-phase or triple-phase. So Tom, thank you for the correction! Tom provided his calculation in Ivan’s blog, so please review it]
  • [15 Nov 2015: Greg Ferro shared in his article that 1000 VM is certainly achievable. I agree with him that it’s a consideration. It’s not a goal nor a limit. It all depends on your application and situation]
  • [27 Mar 2016: Intel Xeon E5-2699-V4 is delivering 22 cores per socket, up from 18 cores in v3]