Author Archives: Iwan Rahabok

About Iwan Rahabok

A father of 2 little girls, my pride and joy. The youngest one quickly said "I'm the joy!"

From Local to Global

I made a career move from local role to a global role mid last year. It’s been almost a year, so I’d like to share the pro and con of both roles. Hopefully it helps you in your career, and keeping the fun in work!

It’s common for a global corporation to have 4 levels of geographic coverage:

  • Local: cover a city or a small country. In my case, it’s just Singapore.
  • Regional: cover a region or large country. In my case, it’s ASEAN.
  • Continent: Asia Pacific, Europe, Africa, America
  • Global.

In vendor environment (as opposed to end-user), there are 2 large primary teams

  • Product team: develop the product. The sub-team are:
    • Product Management, R&D, QA, Sustaining, UX (focus on the UI), IX (focus on documentation).
    • They focus on releasing the next big thing.
  • Field team: sell, implement, support the product.
    • Sales, SE, Consulting, Technical Account Managers, Support, Education.
    • They focus on the quarterly target, closing large deals.

Of course, they are supported by many smaller & supporting team, such as marketing, pricing, and CTO Office.

I’ve never worked in Product Team, so when an opportunity arose, I took the leap of faith.

  • From Local to Global, bypassing ASEAN and Asia Pacific.
  • From Field to R&D. My boss no longer in Singapore, ASEAN and Pacific, but directly at our HQ in Palo Alto. A few months ago, there was a re-org and my boss now in Europe. I have not met him yet!
  • From generalist to specialist. I now do vR Ops full time.

Now that it’s almost a year, I can say with enough confidence that it was a good decision. We all work for 3 reasons. I call them the 3M of work:

  • Money
  • Meaning. It has to fill your spirit, not just your pocket.
  • Merriment. It’s gotta be fun, and you love your work.

The job at global level is harder, much harder. Instead of thinking for just 1 customer (my job was Account SE), or a few customers, I have to think of the world. While working on future version, I have to think of current version and previous version. Brownfield is much harder than greenfield. I learned from R&D team that there was many things to be considered before adding or removing a feature. The complexity makes the job meaningful. Life is short, and the journey is as important as the destination. I’ve never done product development before. Luckily, folks are kind and we got along really well. I work with R&D team in Armenia and Palo Alto. They have never, never asked me to accommodate their time zone. I’m truly grateful for that. Folks like Monica Sharma (Director, Product Management) and of course Sunny Dua provide a lot of coaching and guidance.

My perspective was widened. Before, I was just working with a few customers in Singapore, and a bit of ASEAN. Now I work with customers from Europe to US. What I accepted as the best before, has been reset. I’ve seen other region and customers achieved something better.

I didn’t know there was so much work! The demand for the role I took was apparently untapped. I had no idea since I was not busy when doing local role! There was so much request for help outside Singapore. I do webex regularly with customers, helping them remotely. They would login to their production environment, and we troubleshoot issue together. I get to see live environment, and insight into operation.

The downside is travel. Controlling the schedule is important, else I could travel non-stop and just spend weekend at home. My travel schedule is practically full 3 months in advance. Again, folks are generally accommodating. I learn when we explain to folks openly why we can’t be there (they are sponsoring my trip), they are willing to accommodate.

Travel can be sudden. I gotta a call to help large customer on Thursday morning, and on Sunday I was already in the plane to see them. If you have young kids, this can be deal breaker. My 2 kids are 12 and 15 already, but my Mum at 80 needs care.

Speaking of travel, gluten free is a challenge. I am allergic to both dairy and gluten, so keeping these 2 away is difficult when abroad. There is no easy solution today. Singapore Airlines changes the menu every 3 months, so I know in advance exactly what I’m getting 🙂

I hope the sharing is useful for those who are thinking of taking global role.

Which VMs need more resources?

You can reduce the following resources from a VM:

  • CPU
  • RAM
  • Storage

Network isn’t something you can reduce, but you know that already 🙂

You can check which VMs need more resources by building a dashboard like the one below. It’s a simple dashboard, which you can customize and enhance. It lets you reduce the resources independently.

I’ve marked the above dashboard with numbers, so we can refer to them:

  1. This is a table that lists all VMs. It’s sorted by the highest 1-hour average of CPU Demand and RAM Demand. The table also lists the VM CPU and RAM configuration, so you can see if the VMs are small or large. It also shows the cluster the VMs are located. The table is sorted by the highest CPU Demand. I’m showing both CPU and RAM in a single table. You can clone the view and split them if that suits your operations better.
  2. This is a table that lists all VMs, but focusing on storage only. With storage, we do not have the complexity of checking peak utilisation. We simply need to check the present situation.
  3. This lists the Top-15 VMs with highest CPU Demand and RAM Demand in a given period. The list is now split, as they can be different VMs. Do not that Top-N widget will average the number over the selected period. A VM with cyclical workload may not show up. The Top-N is complemented with a distribution chart. Select a VM from the Top-N, and you can see where the VM utilisation is.
  4. The distribution chart helps you see if the VM is really under resources or not. The 95th percentile is marked with a vertical green line. You expect that line to be at 100%, indicating that the VMs hit 100% utilisation frequently. If the 95th percentile is at a low number, and you do not see the number 100 in the x-axis, that means the VM is not under resourced.
  5. Storage is easier, as we can simply use the last data. As a result, we can show a distribution of all the VMs. We use a heat map as it can show 2 dimensions. Every VM is represented as a box. The bigger the box, the more storage the VM is configured with. The color indicates if the VM use it.
    • 0% = Black. Wastage
    • 10% = Green. Balanced usage
    • 100% = Red. Need more space!

The CPU and RAM have limitations. For example, they may show high utilisation during AV backup. You want to ignore those period. At this moment, the only way is to plot the high usage over a line chart. We use Log Insight for this. The chart below shows VMs that hit high CPU usage in a given period. Every time a VM hits high CPU usage, it will show up here. As you can see, there are only 4 VMs that hit high CPU usage. All other VMs do not need more CPU.

The above is an example from a healthy environment. What about an environment where a lot of VMs are under-sized? You expect to see lots of alarm! That’s what you have below

Hope the above is useful. If not, drop me an email.

Dashboards for IT Senior Management

This post is part of Operationalize Your World post. Do read it first to get the context.

CIO, Head of Global Infrastructure, and other IT Senior Management have a different requirements for dashboard than technical folks.

Generally they want:

  • big picture, not details.
  • exception. Things that they need their attention.
  • less technical info. Ideally, present in business terms, not IT.
  • a portal that is easy to access. They may not want to login to vR Ops. If they do, they may forget their password. [e1: vR Ops 6.5 cannot do login-less yet]
  • UI that is easy to understand. So keep each dashboard to a specific question.
  • system that is easy to use. So keep the interaction, clicking, zooming, sorting, etc. minimal.

That’s what they want from you.

What do you want from them?

You show them something so you can get help (e.g. budget, resource). Here are some goals:

  • Show transparency. Giving visibility into live environment to senior management.
  • Prove that you do need additional hardware.
  • Prove the wastage you have been talking for months.

What do you not want to show? There are things you do not want to show. Urgent issues are something that you should not display. It is not about hiding information to CIO. This is about giving you the time or space to do your job. If there is an active fire that requires your full time concentration, you do not want to be interrupted by CIO asking why it’s showing red on the dashboard!

I covered dashboard best practice in this post. Read that first, as this blog builds upon that.

Done? Great!

We take the same approach we did when planning dashboards for specific roles (e.g. Storage team, Network team). We ask a set of questions.

If we implement the above, we will end up with at least 5 dashboards. I’ve combined some of them. I see a wide variety of requirements, so you will customise them anyway 🙂

Basic Visibility

  • How many VMs in our cloud? What’s their CPU, RAM, Disk allocation? This gives you a size of the environment the IaaS is supporting.
  • How much CPU, RAM, Disk do we have? Is it enough to support the above requirement?
  • You should also give the history of VM growth. What is enough today may not be enough in 3 months.

In the dashboard above, I’ve added Availability information. As VMs can be powered off intentionally by application team, you should only report for Tier 1 VMs. Tier 3 VMs, especially those in Test and Dev, can be rebooted frequently and hence will give misleading information.

Performance

The dashboard below shows all VMs. In a large environment, the heat map will automatically combine VMs with the same value (read: color & size).

Every VM is represented by a box. The box can take on value between 0 and 3.

  • Green = 0. The VM is served well.
  • Yellow = 1. One of the IaaS is not delivered as per Performance SLA. We track CPU, RAM and Disk. If your SLA states 10 ms disk latency, then the VM has to get 10 ms.
  • Orange = 2. Two of the IaaS is not delivered.
  • Red = 3. All 3 services not delivered.

The VMs are grouped by Datacenter, then cluster. This lets you see which Datacenter or Cluster aren’t coping well.

The above shows the VMs. What about applications? An Application spans multiple tiers and multiple VMs. Just because a VM does not perform does not mean the whole application is affected. As this is for Senior Management, we’re only showing the Tier 1 applications.

This blog explains the implementation.

Capacity

  • CIO is not in charge of capacity management. He just need to know the decision you want him to make (which is to approve hardware purchase, or get VM Owners to rightsize). For that, he needs to know if you are running out of capacity, and existing capacity is not wasted.
  • How is it growing? This can be taken care of by having a projection. This projection should take into account committed projects too.
  • Capacity is more complex than performance. Just because vSphere cluster is running low on utilization does not mean it can serve the VMs well. See this for detail explanation.
  • Capacity is best presented with a line chart. This enables you to see the trend. For environment with <10 clusters, you can fit all the clusters in the screen. For large environment, you need to make a trade off:
    • show live data. You can be detail as you’re only showing 1 data.
    • show historical data. You can’t be detail as you’re showing >1 data.

Here is an example with historical data. Notice we cannot show details, and the screen only accommodates <10 clusters.

Here is an example where we only show live data. We can show a lot more clusters, and for each we can show CPU, RAM, Disk and Network.

You may run out of capacity. But if you have a lot of wastage, you may have sufficient capacity after you reclaim them. See this for details.

Configuration

  • Do we have “bad” configuration? Examples are old & unsupported versions of Windows, Linux, ESXi, VMware Tools, etc.
  • How uniform is our environment? Complexity is required to optimize cost (hardware, software) and performance. However, there is cost in complexity.
  • Do we have outdated and unsupported products?

If your CIO does not appreciate the complexity, showing CIO the complexity is good for you. It will result in appreciation of your expertise & effort, as it’s certainly easier if the complexity is low. Complexity increases when you have a wide variety of things.

Factor impacting complexity:

  • No of ESXi versions. The more variants, the more complex.
  • No of ESXi CPU version
  • No of brand. The more vendor, the more complex as you need to learn them, and spend time with the their team.
  • No of cluster node size
  • No of shared Datastore size

See the examples provided here for Infrastructure and here for VM.