Tag Archives: dashboard

Right sizing VM Memory without using agent

This post continues from the Operationalize Your World post. Do read it first so you get the context.

The much needed ability to get visibility into Guest OS Memory is finally possible in vSphere. Part of the new features in vR Ops 6.3, you can now get Guest OS RAM metrics without using agent. So long you have vSphere 6.0 U1 or later, and the VM is running Tools 10.0.0 or later, you are set. Thanks Gavin Craig for pointing this out. The specific feature needed in Tools is called Common Agent Framework. That removes the need for multiple agents in a VM.

As a result, we can now update the guidance for RAM Right Sizing:

For Apps that manage its own RAM, use metrics from the Apps.
For others, use metrics from the Guest OS.
Use vR Ops Demand if you have no Guest OS visibility. Do not use vCenter Active.

Examples of applications that manage its own RAM are JVM and Database. If you use Guest OS counter, you can result in wrong sizing and make situation worse. Manny Sidhu provides a real example here. The application vendor asked for 64 GB RAM when they are only actively using 16 GB, as he shared in the vCenter screenshot below.

For apps that do not manage its own RAM, you should use Guest OS data. The table below compares 63 VMs, using a variety of Microsoft Windows. A good proportion of them are just idle, as this is a lab, not real life production.

  1. What conclusion do you arrive at? I’ve added a summary at the bottom of the list.
  2. How do you think VM Consumed vs VM Active vs Guest OS Used?

comparison-windows

And the table below shows comparison for Linux.

What do you spot? What’s your conclusion? How does this change your capacity planning? 😉

comparison-linux

Here is the summary for both OS. Total is 101 VM, not a bad sample size. I’ve also added comparison. Notice something does not add up?

total

To help you compare further, here is a vR Ops heatmap showing all the VMs.

compare

I created a super metric that compares Guest OS metric with VM Active. As expected, Guest OS is higher as it takes into account cache. It’s not just Used, and Windows does use RAM as cache (I think Linux does too, but not 100% sure).

The super metric is a ratio. I divide Guest OS : VM Active. I set 0 as Black, 5 as yellow, and 10 as red. Nothing is black, as VM Active is lower than Guest in all samples.

Conclusion

  • VM Consumed is always near 100%, even on VM that are idle for days. This is expected, as its nature as a cache. Do not use it for right sizing.
  • Windows memory management differs to Linux. Notice its VM Consumed is higher (94%) than Linux (82%). I guess it’s writing zero during boot creates this.
  • VM Active can be too aggressive as it does not take into account cache. vR Ops adds Demand counter, which makes the number less aggressive.
  • Guest OS Used + Cache is much greater than VM Active or VM Demand. It’s 69% vs 15% vs 31%
  • Guest OS Used + Cache + Free does not add up to 100%. In the sample, it only adds to 83%

Based on the above data, I’d prefer to use Guest OS, as it takes into account cache.

  • Side reading, if you need more info:
    Refer to this for Windows 7 metrics, and this for Windows 2008 metrics. 
    This is a simple test to understand Windows 7 memory behaviour.

You can develop a vR Ops dashboard like the one below to help you right size based on Guest OS. Notice it takes similar approach with the dashboard to right size CPU.

vm-right-sizing-memory

The dashboard answers the following questions:

  • How many large VMs do I have? What’s the total RAM among them?
    • Answered by the scoreboard widget. It only shows large VM (default is >24 GB RAM) which is powered on and has Guest OS metric.
  • Are the large VM utilizing the RAM given to them?
    • Answered by the 2 line charts:
      • Maximum Guest OS Used (%) in the group
      • Average Guest OS Used (%) in the group
    • In general, Guest OS Used can hit 100% as Windows/Linux takes advantage of the RAM as cache. Hence you see the peak of Used is high.
  • Where are these large VMs located?
    • Answered by the heat map.

The dashboard excludes all VMs that do not have Guest OS RAM data. Since not all VMs have Guest OS RAM data, the first step is to create a group that only contains VMs with the data. Use the example below.

group

You should also manually exclude app that manages its own memory.

Notice the Group Type is VM Types. Follow that exactly, including the case!

Once you created the group type and group, the next steps is to download the following:

  • Super metrics. Don’t forget to enable them!
  • Views
  • Dashboard

You should download the dashboard, view, super metric and the rest of Operationalize Your World package.

You can customize the dashboard. Do not be afraid to experiment with it. It does not modify any actual metric and object as dashboard is just a presentation layer.

Take for example, the scoreboard. We can add color coding to quickly tell you the amount of RAM wasted. If you have > 1 TB RAM wasted, you want it to show red.

customize

To do that, it’s a matter of editing the scoreboard widget. I’ve added thresholds, so it changes from green to yellow when I cross 500 GB, to orange when I cross 750 GB, and to red when I cross 1 TB.

scoreboard

Hope that helps. I’m keen to know how it helps you right sizing with confidence, now that you have in-guest visibility.

SDDC Dashboards: The Kitchen

This post is part of Operationalize Your World post. Do read it first to get the context.

There are only 4 parts in the IaaS Monitoring:

  1. Capacity
  2. Configuration (with Inventory)
  3. Audit and Compliance
  4. Availability

Can you figure out why we do not have Performance in “the kitchen” area of your restaurant business?

Performance SLA concept explains why. I’ve also applied it to VDI use case and give an example.

Capacity

The Capacity dashboards below take into account Performance SLA and Availability SLA. Only when these 2 are satisfied, that it considers Utilization. Review this series of blogs for an extensive coverage on this new model.

The set of dashboards answer questions such as:

  • What’s the capacity of my clusters?
  • What’s the consumption on the clusters?
  • Which clusters are running low?
  • Is the cluster still coping well with demands?
  • Does a cluster consist of mostly large VMs?

Here it the dashboard for Tier 1, where we do not overcommit. As a result, both performance & utilization are irrelevant. It is driven by Availability SLA.

The vCPU and vRAM remaining is based on allocation model. It takes into account HA setting.

BTW, the lines in the 2 line chart above do not gradually come down (or up) because this is a lab, not a real life environment. Your production environment will have line chart that makes sense 🙂

Here it the dashboard for Tier 2 or 3. Since we overcommit, we now have to take into account performance, and then utilization.

capacity-tier-2-compute

As you can see from the above dashboard, it has 3 sections:

  • Availability SLA. Do we reach the concentration risk?
  • Performance SLA. Do we serve existing workload well?
  • Utilisation. It uses the net usable capacity as the ceiling. This ceiling takes into account your HA settings and Buffer. The default value for buffer is 10%, which you can change via policy.

Can you spot a limitation on the capacity dashboards I’ve shown so far?

Yes, it’s hard to compare across clusters. If you have many clusters, you want to know which clusters to check first. This dashboard lets you compare. It’s color coded so it’s easier for you to see.

For implementation details, refer to this post.

The twin-sister of Capacity is Reclamation. What can you reclaim and from which VMs?

reclamation

For implementation details, refer to this post.

Configuration 

In the world of Software-Defined, configurations are easy to change. So consistency and drift become 2 areas you need to watch.

The set of dashboards answer questions such as:

  • Are my ESXi config consistent, especially if they are member of the same cluster?
  • Are my ESXi & Clusters configured to follow best practice?
  • Do I have too many combination, which increase complexity?
  • What have I got?

The dashboard below is for ESXi:

configuration-esxi

The dashboard below is for Cluster:

configuration-cluster

That’s all you can do in vR Ops 6.4. If you need more details, you need to deploy VCM. The latest release is 5.8.3. For the list of configuration that it can track per object, review this.

Inventory differs to Configuration.

  • Configuration has Standard, and hence Drift. Inventory does not.
  • Configuration has Compliance. Inventory does not. Well, not generally 😉
  • Configuration has value that can be bad (e.g. ESXi has no syslog). Inventory does not.
  • Inventory has stock take (typically annual). This can trigger work, which impact Configuration.
  • Inventory is typically reported on regular basis.

Because of the above, we’ve provided a purpose built dashboard to track inventory.

inventory

Audit and Compliance

You can check your environment compliance to vSphere Hardening Guide. The dashboard belows shows the summary of compliance, with ability to drill down to each object.

capture

vCenter tasks, events and alarms are 3 areas that you can mine to help answer compliance and audit. Log Insight complements vR Ops nicely here. For example, the following screenshot answer this audit question

  • Who shutdown what VM and when?

compliance

There are many things it can answer, and it’s covered in the workshop.

Availability

Because of HA and DRS, tracking Cluster makes more sense than tracking each ESXi. A cluster uptime remains 100% when 1 host is not available because you have HA. You have catered for that, and as a result, you should not be penalized.

The set of dashboards answer questions such as:

  1. What’s the availability (%) of each cluster in the last 24 hours? Each cluster has its own line chart, and it’s color coded. You expect a green bar, as shown below.
  2. What’s the availability now? The heatmap provides that answers quickly. You can drill down into the cluster if you spot a problem.
  3. Am I containing risk when there is a major outage. How many VMs am I willing to lose when a cluster or datastore goes down?

availability-cluster

The heat map also provides the ESXi uptime. You can toggle between Cluster and ESXi.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

A set of dashboards for SDDC Operations

This post continues from the Operationalize Your World. Do read it first to get the context.

A common requirements among customers is to have a set of vRealize Operations dashboards to help them manage their VMware IaaS platform. They want a suite of inter-connected dashboards, not individual dashboards.

In the past several years, we have developed around 50 dashboards to help you operate your VMware SDDC. The dashboards form 1 story. We group each dashboard into the 4 pillars of SDDC Operations.

The set of dashboards also go beyond vSphere Admin, and provide dashboards for Storage Team, Network Team and NOC Team. However, they are yet to provide a complete coverage for every role and every purpose. The table below shows the coverage. No means there is no dashboard yet.

blog 2

Different roles in the team are interested in what’s relevant to them first, which is why the dashboards are tailored for each. Here is the dashboards provided, grouped by role and purpose.

There are naturally more dashboards for the Platform Team. The team was known as the Server Team in the old days of physical world. They have evolved into Platform Team, and is typically where VMware Admin and Architect belong.

They have 2 interfaces in the company:

  • upstream: to VM Owners, application team.
  • downstream: to Storage Team, Network Team

In addition, they also deal with IT Management (CIO, etc), Help Desk and Security/Compliance team.

Capture

You may notice in the above picture that some dashboards are in grey. That means they are not available. Need MP means it needs a Management Pack. We have not included MP as part of this solution. You should get vSphere under control first before extending coverage. Need feedback means I’m yet to see a use case for it. Every dashboard answers a question, and has to be complementary to other dashboards.

The tools we use to manage VMware SDDC is vRealize Operations and Log Insight. We do live demo during the events and customers ask for a copy that they can import into their environment. This blog provides the steps to import.

Here is what they look like in vRealize Operations 6.4. vR Ops 6.3 is the minimum requirement as it uses 6.3 new feature.

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.