Monthly Archives: July 2018

A new Adventure!

I’ve joined the Product Management team as a PM. Being a member of a small team (Sunny and a few others), we are given the privilege to plan and drive vR Ops to the next level.

Why did I change job?

As a human being, there are 3 levels of what we do: Job, Career and Calling. Out of 7 billions people, most of us have a job. Those luckier have a career. A few have a calling. IMHO, a calling is when you have a balance among the 3M (Money, Meaning, Merry) of what you do. A calling is not perfect, as it’s a trade off the 3 corner of a triangle. Ideally, the triangle is as small as possible, so you’re close to all 3. To read more, follow this.

Some folks asked how I could get a Product Manager job out of Singapore, since there is no R&D, QA, UX, Tech Marketing, Product Marketing, and Management here.

If you want to know, here is a short story.

This job took me >5 years to vrealize. I’ve been doing vR Ops since 1.0, when it was released back in early 2011. I was one of the first to get trained in Asia Pacific. I remember when David Lavigna trained us in Sydney. I saw how I could apply super metric and custom dashboards to help customers monitor and troubleshoot. Instead of spending a lot of time with vCenter performance tabs, I could simply slice and dice the whole environment.

By 2014, I’d already spent a few years on the product. Customers taught me things they need to monitor or troubleshoot. It’s amazing how much you can learn in production environment vs lab. Real problems, real people. I compiled these lessons learned, gave it a structure, and published my first book on Dec 2014.

In 2014, VMware elected me as a member of the CTO Ambassador program. In my 1+ decade in VMware, this is the best “training” program. It opens door. It gave me trips to Palo Alto and RADIO, where I could develop the relationship with R&D.

Sunny and I brought the material to the world at VMworld 2015. We did 2 sessions, ~600 audience. The feedback told me we’re on the right path. That was the turning point to start packaging the dashboards into an integrated suite.

I continued enhancing the material, and published a second edition of my book in March 2016. Product Management team, who had been super supportive of my work, invited me to Palo Alto, VMware HQ in Silicon Valley. They paid for my first Take 2, and I spent 2 weeks in with R&D team in March 2016.

Kenon took the material and turned it into a program in June 2016. He called it Operationalize Your World. He worked with all the regions in Asia Pacific. Both of us traveled heavily and met many customers and partners. He secured the travel funding and worked with local team to get the event going. I am averaging 150 – 200 days a year since then.

VMworld 2016 was another success. I met even more customers, who convinced me that there is a big market for me to focus on. Post VMworld, Product Management team decided to bring on board part of Operationalize Your World. vRealize Operations 6.4 was the first release where we replaced bulk of existing dashboards. It was released in Nov 2016, and the feedback was very positive. Since then I had been privilege to get involved with the release, giving feedback as basically Customer[0].

By this time, Sunny had moved to Palo Alto. That changed a lot of things for me, and I benefited from that close partnership. In life, Sunny gave me an experience that 1+1=3. Each of us will have a solution, and after some fight, we end up with a 3rd and better solution.

In June 2017, I was given the chance to spend 2 weeks R&D. It gave the chance to meet more developers. Their eagerness to make the products better, and most importantly how they treated me like a member of the family, convinced me that this is where I wanted to focus. Since then, I’ve been back 2 more times, for a total of 4 weeks. All were kindly paid by CMBU. Yes, they really treated me like a member of the team.

In VMworld 2017, Product Marketing got me to speak in both US and Europe events. That was my first time meeting EMEA customers. Glad to know Operationalize Your World was resonating. In fact, it resonated better than US.

I got a chance to participate in 6.5, 6.6 and 6.7 releases. My main focus was on the ability to customise. If you compare 6.7 vs 6.4, you notice it’s easier to work with the widgets. They have better control, and look more pleasing too. You also have a lot less metrics, hence it’s easier to know what to pick. We also added a lot of property.

In March 2018, R&D invited me to do a Take 3. It is a 3 month secondment where I was part of the Product Management team. Upon completion of the Take 3, they helped to work with my CS management to transfer me. I’m grateful for my management, who gave their blessings and did the transfer with my interest at heart.

Throughout all these years, customers and partners feedback are clear: vR Ops and Log Insight are useful to them, and they want to use vRealize even more. At the end of the day, it is this assurance from them that made jump into the PM role. I’m blessed to have met probably a thousand customers since 1.0 in 2011. Collectively, they educate me, using their production environment as real examples. Their feedback shape my thought, and give me clear guidance on where we should take the products.

vR Ops Super Metrics

Super Metric is a feature that I use heavily. When you download Operationalize Your World, you get >70 of them. If you compare vR Ops 6.3 and 6.7, you will notice we’ve added new metrics which were originally super metrics.

They complement regular metrics, as shown by the following shows table:

I’ve never met 2 customers with identical requirements. 2 customers can adopt identical architecture, but will always operate it differently. Operations is unique like fingerprint. Super Metrics enable that bespoke customisation.

I’ve used them for a few years by now. Here is how they have been useful to me:

As a constant

  • You simply enter the constant value. Yes, that’s all!

To convert units

  • You use the $This. Also, use metric= and not attribute= as you’re dealing with 1 object.

To get a summary from a group

  • How many VMs in a cluster is facing high CPU Ready? You can use the Count and Where clause to answer it. Apply this super metric to the cluster object.
  • Is any of the VM facing high CPU Ready? If yes, how high? You can use the Max (VM CPU Ready) and apply it at the World object.
  • I use the function Count, Min, Max, Average. If needed, I use Where Clause to filter the selection.

To answer higher level question

  • A higher-level question resonates more with senior management, as they care less about low level technical details. I build super metric on top of super metric, until the highest-level meets my need. Layering super metric also enables me to drill down, when I see the higher level number is giving warning.
  • What’s the uptime of Tier 1 VMs? The answer you expect is certainly 100%. The mathematics to answer that seemingly simple question is actually very complex. Super metric enables me to implement the maths, as you can see here.

To provide a common metric that depends on the policy.

  • In Operationalize Your World, you get Performance SLA. Notice the line automatically adjust depending the class of service. You know that the SLA line is a metric. If you have 3 class of services (Gold, Silver, Bronze), you have 3 different metrics, 1 for each policy. So how does it know which SLA to display in the widget? I use the Max function, knowing that a VM can only have 1 SLA.

Summary of Features

Here are the 5 features I used to implement all the super metrics in Operationalize Your World:

  • $This
  • Metric vs Attribute
  • Where clause
    • Useful in comparing against a value. A limitation here is the value has to be a constant. It can’t be another metric.
  • IF Statement
    • The VM KPI has 21 IF statements. I need to categorise value into Green, Yellow, Orange, Red. That’s 3 IF statements, nested as 1. I need to consider 7 factors. Since each factor has 3 IF statements, the total is 21.
  • Count, Max, Min, Sum

I’m aware of the usability and functionality enhancements, and am working to enhance it. As always, I welcome your feedback, as it’s valuable to have datapoints.

Hope you find super metrics as useful as I have.

Large Scale vSAN Monitoring

Large scale VMware vSAN operations raises the need for easier and faster monitoring. With many and large vSAN clusters, monitoring and troubleshooting become more challenging. To illustrate, let’s take a single vSAN cluster with the following setup:

Here are some of the questions you want to ask in day to day operations:

  • Is any of the ESXi running high CPU utilization?
  • Is any of the ESXi running high Memory utilization?
  • Is any of the NIC running high utilization?
    • With 4 NIC per ESXi, you have 40 TX + 40 RX metrics.
  • Is vSAN vmkernel network congested?
  • Is the Read Cache used?
  • Is the Write Buffer sufficient?
  • Is the Cache Tier performing fast?
    • Each disk has 4 metrics: Read Cache Read Latency, Read Cache Write Latency, Write Buffer Write Latency, Write Buffer Read Latency
    • Since there are 20 disks, you need to check 80 counters
  • Is the Capacity Disks performing fast?
    • Check both Read and Write latency.
    • Total 120 x 2 = 240 counters.
  • Is any of the Disk Group running low on space?
  • Is any of the Disk Group facing congestion?
    • You want to check both the max and count the number of occurrence > 60.
  • Is there outstanding IO on any of the Disk Group?

If you add them the above, you are looking at 530 metrics for this vSAN cluster. And that’s just 1 point in time. In 1 month you’re looking at 530 x 8766 = 4.6+ millions data points!

How do you monitor millions of data so you can be proactive?

vRealize Operation 6.7 sports vSAN KPIs. We collapsed each of those questions. So you only have 12 metrics to check instead of 530, without losing any insight. In fact, you get better early warning, as we hide the average. Early Warning is critical as buying hardware is more than a trip to local DIY hardware store.

The KPIs achieve this simplification by using supermetrics:

Using Min, Max, Count, it picks the early warning.

The KPI has been a hit with customers. But it falls short when you have many vSAN clusters. If you have say 25 hybrid clusters and 25 All Flash clusters, you need to check 50 clusters. While you can click 50x, what you want is to see all 50 at the same time.

This means we need to aggregate the metrics further. There should only be 1 and only 1 metric per cluster.

The challenge is the KPI has different units and scale. How do we normalize them into Green, Yellow, Orange and Red?

We do it by defining a normalization table. We need 1 table for All Flash and 1 for Hybrid, as they have different KPI and threshold. Here is the table for All Flash:

Read Cache Hit Rate (%) is missing from the above as it’s not applicable to All Flash. It does not have dedicated Read Cache.

I’m setting CPU Ready and CPU Co-Stop at 1%, so we can catch early warning. For RAM, as most ESXi sports 512 GB RAM, I set the RAM Contention at 0%.

The metrics that I’m not sure if the Disk Group Congestion. It’s based on 60, which I think is a good starting point in general.

Here is the table for Hybrid:

Do you know why I do not have Utilization counter (e.g. CPU Utilization) there?

Utilization does not impact performance. ESXi running at 99% is not slower than ESXi running at 1%, so long there is no contention or latency. This is vSAN KPI, not vSAN KUI (Key Utilization Indicators). Yes, vSAN KUI needs its own table.

Once you have the table, you can map into threshold. I use Green = 100, Yellow = 67, Orange = 33, Red = 0. I use 0 – 100 scale so it’s easier to see the relative movement. If you don’t want to be confused with %, you can use 0 – 10 or 0 – 50.

vSAN Performance is the average of all these. We are not taking the worst to prevent 1 value from keeping it red all the time. If you take the worst, the value will likely remain constant. That’s not good, as pattern is important in monitoring. The relative movement can be more important than the absolute value.

You implement the above using super metric. You need 2 super metrics, 1 for Hybrid and 1 for All Flash. For simplicity, I’d not use Policy but rather apply both super metrics to all my vSAN clusters. I then use the correct metrics when building the dashboard.

Hope you find it useful.