Tag Archives: SDDC Operations

vSphere visibility for Storage Team

This post is part of Operationalize Your World post. Do read it first to get the context.

Ask any Storage Team and Platform Team whether the collaboration between them can be improved by a mile, and you are likely to get a nod. One reason for this issue is there is lack of common visibility. You need to see the same thing if you want to collaborate. Storage Team do not get always get access vSphere vSphere. Even if they do, vCenter UI is not designed for Storage team. It is designed for VMware Admin.

vRealize Operations and Log Insight can bridge that providing a set of read-only, purpose-built dashboards, that answer common questions such as:

  • When a VM Owner complains, can we clear if it’s a storage issue within 1 minute?
    • No ping pong between VM Owner, vSphere Admin, Storage Admin
  • Is the Storage serving all the VMs well?
    • If not, who are affected, when and how bad? Read or Write?
    • The answer has to be tier based, as Tier 1 VM expects lower latency than Tier 3
  • What’s the total demand hitting the array? Are they growing fast?
    • Who are the heavy hitters among the VMs?
  • When & where are we running out of capacity?
    • How much disk space can be reclaimed? From which VMs?
  • What have we got?
    • Are they consistently configured?

The questions above cover the main areas of SDDC Operations, such as performance, capacity, configuration and availability. They enable joint troubleshooting, capacity planning, performance monitoring. For better collaboration, add Blue Medora TVS, so you can analyze physical arrays and fabrics, and then correlate back with vSphere.

Overview

The first dashboard provides overall visibility to Storage team. It gives insight into the SDDC by showing relevant objects.

  • It quickly show the summary of key information.
  • It shows VM, datastore, datastore clusters, compute cluster, and datacenter. It shows their relationship, which you can interact and drill down.
  • It shows all the VMs, where they are located, how much space they are allocated, and how they are using it.

overview

Limitations & Customisation:

  • No RDM. Customers are moving away, and not many are using it to begin with.
  • Physical array Availability. This requires Storage MP or Blue Medora MP

Performance

The purpose of the vSphere platform is running VM. So long it is providing good service to the VM, we don’t have to explain the thing underneath. Whether 10,000 IOPS at VM level translates into 8,000 at hypervisor level due to caching at the host, is not as important as the VMs are being served well (as defined by SLA).

So we need to find out what’s the latency of every single VM in the vCenter. This is near impossible to establish in vCenter, as you have to go thru a lot of VMs. vR Ops helps using super metric.

Overall Performance

The set of dashboards answer questions such as:

  • What’s the overall performance, for each cluster and datastore? No point troubleshooting a VM or ESXi if the overall array is heavily hit.
  • What’s the total demand hitting the storage system? Who are the heavy hitters?
  • When a cluster is not performing, do we know when and which VMs were affected? Looking at cluster is useful as that’s where the demand comes from.

performance-overall

Total Latency is not Read + Write latency. In IP Storage, it is not Tx and Rx. It is both Tx as the ESXi host is sending the packets.

Datastore Performance

As this is for Storage Team, we can drill down to a specific datastore. It provides detail line charts of the datastore latency, throughput, outstanding IO and IOPS.

It also shows the VMs in the datastore, and if any of them is generating a lot of IOPS (villain VM).

performance-troubleshooting

Heavy Hitters

The performance problem could be caused by high overall loads. The dashboard shows you the total IOPS and total Throughput. If the number is high, you can drill down to see if there are Heavy Hitters.

What is Heavy Hitters? It has to be defined.

  1. What: IOPS or Throughput?
    • A VM with large block size can generate high throughput without doing excessive IOPS.
  2. How long: 5 minutes, 1 day, 1 week?
    • When we say heavy, how long before it becomes heavy?
    • The heavy hitters dashboard distinguishes between Bursty Hitter and Sustained Hitter.
  3. When: Today or yesterday? Any pattern?
    • The heavy hitters of yesterday may not be the heavy hitters of today.

It uses line chart so you can see pattern.

heavy-hitters

Limitation and Customisation:

  • Storage Team normally want sharper granularity than 5 minute. You can complement the above view VMkernel. I’ve shared posted here, so follow that link.
  • VM-level IOPS are different to Infra-level IOPS
    • Front-End IOPS are different to Back-End IOPS.
    • Distributed Storage has its own set back-end IOPS, such as synchronisation and replication.
  • We should know the cause of the poor storage performance.
    • The array?
      • Is it near its peak? Low cache, not enough spindle, etc.
      • Is it uneven (one SP doing the bulk of the work)? SP trespass?
      • Does it have hot spots?
    • The network or fabric?
      • Higher chance if you’re on IP Storage.
      • Happens if you don’t turn on Network QoS. There could be storage vMotion happening.
    • The ESX?
      • This could be bad configuration, like wrong multipath or insufficient HBA queue depth.
    • The VM?
      • Why does it generate high IOPS. Which process inside the Guest OS doing that?
      • A snapshot not removed?
      • A developer running IOmeter?

Capacity

The set of dashboards answer questions such as:

  • What’s the overall capacity? How is it used? Where are we on over-subscription?
  • Any datastore is running low on capacity? Are we using the datastores equally?
  • Are the VMs equally distributed among the datastores?
  • For each shared datastores, what are the capacity?

capacity-overall

The heat maps are useful in comparing across datastores:

  • Size by capacity
    • This shows if datastores are consistent in sizes. You should not have too many variance in size.
  • Color by capacity remaining.
    • Red = running out of capacity. Do not deploy more VM
    • Black = wastage. You’re not using it.
    • You should expect green.
  • Group by vCenter, then Datastore Cluster.
    • You know quickly where they are.

Combining the above 3 info into a single heat map tells you if you are using your storage properly.

You should use Datastore Cluster as part of your design. If you do, you also get visibility into their capacity, as shown below:

capacity-overall-datastore-cluster

You can drill down into each datastore and determine their capacity.

capacity-details-datastore

Limitation and Customisation:

  • Use datastore cluster. This should be part of your architecture anyway.
    • A limitation here is distributed storage, as it has no datastore clusters. You should exclude them and deal with them separately as the monitoring metrics are different.
  • If you have too many datastore clusters, then use Service Tier to group them

Capacity: Space Reclamation

There are 4 places where you can reclaim storage, from easiest to the hardest

  1. Orphaned VMs and orphaned VMDK. They are the easiest as they are not even owned.
  2. Snapshot.
  3. Powered Off VM
  4. Idle VM

capacity-reclamation

The dashboard does not list active VM because politically it’s hard to reclaim. Don’t bother trying 🙂

Storage as a Service

When a VM owner complain, can we rule out within 1 minute whether Storage is the issue?

Using the following dashboard, you select or browse for the VM in question. Its key storage properties and KPI will be automatically shown. We are using line chart as the problem might happen in the past and no longer present. You can also verify if it’s one off issue or regular issue.

The VM’s datastore will be shown automatically. The VM in the screenshot has its VMDK files in 3 different datastores. You can click on each, and the performance will be automatically shown. This lets you verify if the underlying datastore was able to cope or not.

performance-single-vm-troubleshooting

Limitation and Customisation:

  • The dashboard does not show Throughput. Throughput matters more on large block size. 4 – 32K block size should not be problem when IOPS is low.
  • The dashboard does not show Outstanding IO. This is useful to tell if underlying infra unable to process.
  • Add snapshot. Latency for snapshot will be higher as it has to go through multiple operations.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

Right sizing VM Memory without using agent

This post continues from the Operationalize Your World post. Do read it first so you get the context.

The much needed ability to get visibility into Guest OS Memory is finally possible in vSphere. Part of the new features in vR Ops 6.3, you can now get Guest OS RAM metrics without using agent. So long you have vSphere 6.0 U1 or later, and the VM is running Tools 10.0.0 or later, you are set. Thanks Gavin Craig for pointing this out. The specific feature needed in Tools is called Common Agent Framework. That removes the need for multiple agents in a VM.

As a result, we can now update the guidance for RAM Right Sizing:

For Apps that manage its own RAM, use metrics from the Apps.
For others, use metrics from the Guest OS.
Use vR Ops Demand if you have no Guest OS visibility. Do not use vCenter Active.

Examples of applications that manage its own RAM are JVM and Database. If you use Guest OS counter, you can result in wrong sizing and make situation worse. Manny Sidhu provides a real example here. The application vendor asked for 64 GB RAM when they are only actively using 16 GB, as he shared in the vCenter screenshot below.

For apps that do not manage its own RAM, you should use Guest OS data. The table below compares 63 VMs, using a variety of Microsoft Windows. A good proportion of them are just idle, as this is a lab, not real life production.

  1. What conclusion do you arrive at? I’ve added a summary at the bottom of the list.
  2. How do you think VM Consumed vs VM Active vs Guest OS Used?

comparison-windows

And the table below shows comparison for Linux.

What do you spot? What’s your conclusion? How does this change your capacity planning? 😉

comparison-linux

Here is the summary for both OS. Total is 101 VM, not a bad sample size. I’ve also added comparison. Notice something does not add up?

total

To help you compare further, here is a vR Ops heatmap showing all the VMs.

compare

I created a super metric that compares Guest OS metric with VM Active. As expected, Guest OS is higher as it takes into account cache. It’s not just Used, and Windows does use RAM as cache (I think Linux does too, but not 100% sure).

The super metric is a ratio. I divide Guest OS : VM Active. I set 0 as Black, 5 as yellow, and 10 as red. Nothing is black, as VM Active is lower than Guest in all samples.

Conclusion

  • VM Consumed is always near 100%, even on VM that are idle for days. This is expected, as its nature as a cache. Do not use it for right sizing.
  • Windows memory management differs to Linux. Notice its VM Consumed is higher (94%) than Linux (82%). I guess it’s writing zero during boot creates this.
  • VM Active can be too aggressive as it does not take into account cache. vR Ops adds Demand counter, which makes the number less aggressive.
  • Guest OS Used + Cache is much greater than VM Active or VM Demand. It’s 69% vs 15% vs 31%
  • Guest OS Used + Cache + Free does not add up to 100%. In the sample, it only adds to 83%

Based on the above data, I’d prefer to use Guest OS, as it takes into account cache.

  • Side reading, if you need more info:
    Refer to this for Windows 7 metrics, and this for Windows 2008 metrics. 
    This is a simple test to understand Windows 7 memory behaviour.

You can develop a vR Ops dashboard like the one below to help you right size based on Guest OS. Notice it takes similar approach with the dashboard to right size CPU.

vm-right-sizing-memory

The dashboard answers the following questions:

  • How many large VMs do I have? What’s the total RAM among them?
    • Answered by the scoreboard widget. It only shows large VM (default is >24 GB RAM) which is powered on and has Guest OS metric.
  • Are the large VM utilizing the RAM given to them?
    • Answered by the 2 line charts:
      • Maximum Guest OS Used (%) in the group
      • Average Guest OS Used (%) in the group
    • In general, Guest OS Used can hit 100% as Windows/Linux takes advantage of the RAM as cache. Hence you see the peak of Used is high.
  • Where are these large VMs located?
    • Answered by the heat map.

The dashboard excludes all VMs that do not have Guest OS RAM data. Since not all VMs have Guest OS RAM data, the first step is to create a group that only contains VMs with the data. Use the example below.

group

You should also manually exclude app that manages its own memory.

Notice the Group Type is VM Types. Follow that exactly, including the case!

Once you created the group type and group, the next steps is to download the following:

  • Super metrics. Don’t forget to enable them!
  • Views
  • Dashboard

You should download the dashboard, view, super metric and the rest of Operationalize Your World package.

You can customize the dashboard. Do not be afraid to experiment with it. It does not modify any actual metric and object as dashboard is just a presentation layer.

Take for example, the scoreboard. We can add color coding to quickly tell you the amount of RAM wasted. If you have > 1 TB RAM wasted, you want it to show red.

customize

To do that, it’s a matter of editing the scoreboard widget. I’ve added thresholds, so it changes from green to yellow when I cross 500 GB, to orange when I cross 750 GB, and to red when I cross 1 TB.

scoreboard

Hope that helps. I’m keen to know how it helps you right sizing with confidence, now that you have in-guest visibility.

Sunny Dua and Simon Eady events

Sunny Dua and Simon Eady have been doing a monthly webex where they are sharing their knowledge on VMware vRealize Operations. The latest one is coming this Friday, Thursday, 25th August. It’s 1:30 PM – 2:45 PM Singapore time. I know it’s not a good time for certain cities. If you cannot make it, it’s recorded.

I’ll join them in the next session. We are hoping to answer questions like the following. We put some answers in light hearted words as you know it’s a serious question.

Capture

We live in an era where society is hypersensitive to people who are not sensitive. In the example above, I use her but I meant her/his/him.

The session aims to help you monitoring performance and capacity. Hopefully, you gain a new perspective, and questions like the following will make sense:

2

3

You will also be able to answer questions like this:

4

See you next week!