Tag Archives: Operationalize Your World

Datastore Capacity Management

This post is part of Operationalize Your World post. Do read it first to get the context.

This is the 2nd installment of Storage Capacity Management. The previous post covers the overall storage capacity management, where you can see the big picture and know which datastores are low in capacity. This post drills further and lets you analyze a specific datastore.

Datastore capacity is driven by 2 factors:

  • Performance: If the datastore is unable to serve its existing VMs, are you going to add more VM? You are right, the datastore is full, regardless of how space it has left.
  • Utilization: How much capacity is left? Thin provisioning makes this challenging.

This is what the dashboard looks like.

You start by selecting a datastore you want to check. This step is actually optional, as you would have come from the overall dashboard.

When you select a datastore, its Performance and Utilization are automatically shown.

  • Performance
    • Both actual and SLA are shown.
    • You just need to ensure that actual does not breach SLA.
  • Utilization
    • This shows the total capacity, the provisioned capacity (configured to the VM), and what’s actually used (thin provisioned).
    • You want to be careful with thin provisioning, as the VM can consumed the space as it’s already allocated to it. The line chart has 30-day projection to help you plan.

The 2 line charts is all you need. It is simple enough, yet detailed enough. It gives you room to make the judgement call. You can decide to ignore the spike because you knew it was a special event.

If you want to analyse, you can see the individual VMs. The heatmap shows the VMs distribution. You can see if there are large VMs, because they are bigger. You can see if any VM is running out of capacity, or any VM is wasting the allocated capacity.

The heatmap configuration below shows how it’s done.

You can also check if there are VMs that you can delete. Reclamation will give you extra space. The heatmap has a filter for powered off VMs, so only powered off VMs are shown.

From there, you can drill further to check that the VM has indeed met your Powered Off definition. It’s showing the VM powered off time (%) in the past 30 days. I’ve set the threshold to be 99%. Green means the VM is at least 99% powered off in the past 30 days.

Logic

I hope you agree by now that datastore performance is measured on how well it serves its VMs. We can track this by plotting a line chart showing the maximum storage latency experienced by any VM in the datastore. This maximum number has to be lower than the SLA you promise at all times.

For Utilization, we will plot a line chart showing the disk capacity left in the datastore cluster.

You should be using Datastore Cluster. Other than the benefits that you get from using it, it also makes capacity management easier.

  • You need not manually exclude local datastore.
  • You need not manually group the shared datastores, which can be complex if you have multiple clusters.

With vSAN, you only have 1 datastore per cluster and need not exclude local datastores manually. This means it’s even simpler in vSAN.

Include buffer for snapshot. This can be 20%, depending on your environment. This is why I’m not a fan of many small datastores, as you have pockets of unusable capacity. This does not have to be hardcoded in your super metric, but you have to be mentally aware of it.

Super Metrics

The screenshot below shows the super metric formula to get the Maximum latency of all the VMs in the cluster. I’ve chosen at Virtual Disk level, so it does not matter whether it is VMFS, VMFS, NFS or VSAN.

super metric - vDisk

You can copy paste the formula below:

Max ( ${adapterkind=VMWARE, resourcekind=VirtualMachine, attribute=virtualDisk|totalLatency, depth=2 } )

The screenshot below shows the super metric formula to get the total number of disk capacity left in the cluster. This is based on Thin Provisioning consumption.

You can copy paste the formula below:

sum( ${adapterkind=VMWARE, resourcekind=Datastore, attribute=capacity|available_space, depth=1} )

For Thick Provision, use the following super metric:

super metric - Disk - space left in datastore cluster - thick

You can copy paste the formula below:

Sum
(
${adapterkind=VMWARE, resourcekind=Datastore, attribute=capacity|total_capacity, depth=1}
) –
Sum
(
${adapterkind=VMWARE, resourcekind=Datastore, attribute=capacity|consumer_provisioned, depth=1}
)

Hope you find it useful. Just in case you’re not aware, you don’t have to implement all these manually. You can import this dashboard, together with 50+ others, from this set.

vSphere visibility for Network Team

This post continues from the Operationalize Your World post. Do read it first so you get the context.

Similar to the problem face between Storage Team and Platform Team, VMware Admin needs to reach out to Network Team. A set of purpose-built dashboards will enable both team to look at issue from the same point of view.

The dashboards must answer the following basic questions for Network Team:

  1. What have I got in this virtual environment?
    • What is the virtual network configuration? What are the networks, and how big are they?
    • We have Distributed vSwitches, Distributed Port Group, Datacenter, Cluster, ESXi, etc. How are they related?
    • Who are the consumer of my network? Where are they located?
  2. Are they healthy?
    • Do we have any errors in our networks? Which port groups see packets dropped? If there is problem, which VMs or ESXi, are affected?
    • Do we have too many special packets? Broadcast, multicast and unknown packets. Who generates them and when?
  3. Are they optimized?
    • Just because something is healthy does not mean they are optimized. Look for opportunity to right size.

Once Network Admin know what they are facing, they are in better position to analyse:

  1. Utilization
    • Is any VM or ESXi near its peak?
    • Who are the top consumer for each physical datacenter?
    • How is the workload distributed in this shared environment?
  2. Performance
    • When VM Owner thinks Network is the culprit, can both Network Team and Platform verify that quickly?
  3. Configuration
    • Are the config consistent?
    • Do they follow best practice?

Once we know the questions and requirements, we can plan a set of dashboards. I’ll show some sample dashboards to get you going. They follow the dashboard best practices.

What have I got? 

This first dashboard provides overall visibility to Network team. It gives insight into the SDDC.

  • It shows the total environment at a glance. A Network Admin can see how the virtual network maps to the virtual environment (ESXi, vCenter Datacenter, vCenter).
  • It quickly shows the structure of virtual network. For each Distributed vSwitch, you can see what its port groups are and ESXi are connected to it. You see the config of both objects (port group and ESXi). You can see if the configuration is not consistent.
  • The heatmap quickly shows all port groups by size, so you can find out your largest ones easily. The color code also lets you see which ones are used the most.

overview

Once you know your environment, you are in a position to do monitoring. I don’t feel comfortable doing monitoring unless I have the big picture. It gives me context.

Are they healthy?

The next dashboard shows quickly if there is dropped packet and error packet in your network.

  • The first line chart shows the maximum packet dropped among all Distributed Switch. So if any switch has dropped packet, it will show up.
  • The second line chart sums all the error packets.

Both line charts are color coded, and you should expect to see green. This means no dropped packets nor error packets.

performance

The dashboard above has interaction, allowing you to drill down when the line charts are not showing green.

  • If you do not see green, you can drill down into each Distributed Switch.
    • As a virtual switch can span thousands of ports, it helps if you can drill down by Port Group and ESXi host. The dashboard automatically shows the relevant Port Group and ESXi of the distributed switch.
  • If there is a need to, you can even drill down to individual VM.
    • The table at the bottom is collapsed by default. Expand it and you’ll see all VMs with dropped packets information.

Other than dropped packets and error packets, Network Admin can also check for multicast, broadcast, and unknown packets. You don’t want to have too many of them zipping around your DC.

special-packets

The same concept is being applied, hence Network team only need to learn the dashboard once.

The line charts show the total broadcast and total multicast packets. We are not doing hourly average as at the global level, the law of large numbers will ensure its smooth. A significant deviation is required to move the number. Hence if there is big jump, you know something amiss.

Just like the previous dashboard, this dashboard lets you drill down too. You can check which VMs are generating broadcast packets.

Is Utilisation running high? 

Another factor that can impact performance is high utilization. Network team can see the total utilization of the network. The following dashboard answers questions such as:

  • What’s the total workload hitting our physical switches?
  • If the total workload increasing?
  • Any crazy pattern in utilization? Any sudden spike that should not have happened?

capacity

Line charts are used instead of a single number, as you can see pattern over time. In fact, 2 line charts are provided: detailed and big picture.

The chart gives you the Total throughput hitting the physical switches, so you know how much bandwidth is generated. The chart will be defaulted to 1 month as this is more of a long term view, not really for troubleshooting.

Based on the line charts, you can drill down into a specific time period where the peak was high. The Top-N lists the ESXi with the highest usage. Click on it, and its detailed utilization will be shown. You can see if any of its vmnic is near the physical limit. The super metric takes into account both RX and TX, and you can hit a limit on either.

If you see an ESXi is saturated, but others are barely used, that means your workload is not well distributed. Note that vSphere 6.0 DRS does not consider network, so unbalance is possible. vSphere 6.5 DRS takes network into account.

Customisation:

  • You can add the total line chart as above, but for VM.
    • VM traffic does not include hypervisor traffic (vMotion, management network, IP storage). So it’s pure business workload.
    • We should be expecting this number to slowly rise, reflecting growth.
    • A sudden spike is bad, and so is a sudden drop. We can turn on analytics on it so you get alert.
  • For details on how to do it, see http://virtual-red-dot.info/is-any-of-your-esxi-vmnics-saturated/

How is the workload distributed?

Distributed Switch does not span beyond vSphere Data Center. So data center is a logical choice to start analysing the traffic. The following dashboard compares the workload of each data center. Using color code makes it easy to see which DC reaches a workload that is high.

You can drill down inside the datacenter object. Click on it, and all the ESXi and port groups connected to it will be automaticaly shown. Click on an ESXi, and you can drill down into a VM.

workload-distribution

You can change the threshold (limitation: same limit for every DC) to suit your need.

Who are the Top Talkers? 

Another reason for performance is you have VMs that are consuming excessive network resources. Or you have a peak period, where the total is simply much higher than low period. The next dashboard provides 2 line charts. Again, line chart is used as you can see the pattern.

The table provides a list of VM, sorted by their peak utilization. You can find out who are the bursty-users (5 minute highest), and who are the sustained-users (1 hour highest and 1 day highest).

top-consumers

This example is only for the VM. We can build one for the ESXi if that’s needed. The concept is the same.

Is Network the culprit?

Lastly, it’s all about Service, not System or Architecture. When a VM Owner complains that IaaS is causing the problem, Network Admin and VMware Admin can quickly see the same dashboard to agree whether network is the culprit or not.

vm-troubleshooting

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

SDDC Operations Dashboards for SMB environment

This post continues from the Operationalize Your World post. Do read it first so you get the context.

The SMB segment is a world of its own. There are things that are mandatory in Enterprise segment, but not relevant in SMB segment. As a result, products should be tailored for that market segment.

IMHO, there are actually 4 different market segments when it comes to SDDC Operations. I use No of VM as the marker for each segment. Each of the following segment requires different dashboards and reports:

  1. 100 VM
  2. 1000 VM
  3. 10000 VM
  4. 100000 VM

Now, it will be difficult to create a product with 4 sets of vROps dashboards & reports. I make a compromise on the above, and use this one instead:

  1. 400 VM: SMB market
  2. 4000 VM: Enterprise market
  3. 40000 VM: <give me a name here folks> market

I hope the above is acceptable. As the above has very wide range, I’d take the following reference point

  1. SMB market: 250 VM
  2. Enterprise market: 2500 VM
  3. Huge Private Cloud: 25000 VM

Let’s dive to the 250 VM segment. What are the unique characteristics?

  • 1-2 guys doing everything. No siloes in the team. You and your best friends take care of the whole darn IT.
  • You only have a few clusters. Each cluster only has a few ESXi Host.
  • You know your environment very well because it’s small. They all fit into 1 rack. Architecture is simple. You have a mental picture of it in your head.
  • You don’t buy hardware or VMware every quarter. Likely it’s every 2 years. Capacity planning and monitoring are simple.
  • The workload is quite stable. You are not adding/removing/changing VM every day.
  • Service Tier is an overkill as you only have 1-2 clusters for all workload.

Which of the above points apply to a large environment?

You are right. None.

As a result, SMB needs a purpose-built dashboards. It covers the following:

  1. Availability
  2. Performance
  3. Capacity
  4. Reclaimable Capacity
  5. Compliance
  6. When a VM Owner complains

Home

Your main dashboard. It’s the first dashboard you check, likely on a daily basis as part of your cadence. It answers the no 1 question: is everything healthy?

This is what it looks like in vR Ops 6.3. I’ve added explanation so you can easily see that it’s layered into 4 areas.

home

Availability

The first element of Health is Availability. If a VM or ESXi is down, there is no need to talk about performance or capacity as the damn thing is dead 🙂

The Availability dashboard gives you details info. You can answer questions such as “When did it go down? For how long?”

availability

The dashboard is also useful when you need to report uptime. You do need to create a report and customize it though. If you need it, email me your requirement.

Performance

Just because something is up, does not mean it’s fast. Performance dashboard provides the info here. The dashboard sports the new concept of Performance, which you can review here. It does not apply the formal SLA, as that’s not applicable in SMB. Even without SLA, you can use it to prove your innocence, or justify new hardware purchase.

Line Charts are used as performance problem might have started earlier, or it’s no longer happening and you’re doing a root cause.

If the performance issue is caused by villain VM, the dashboard lets you find the VM. Change the time line in the Top-N widget to the time where there is performance problem.

BTW, if you like the ability to find out which VM was causing the problem, send your thank you to Matthew Hurley

Capacity

Generally speaking, Performance problem happens because supply is not being met by demand. The Capacity dashboard gives detail info on the supply side. As there are only a few clusters, capacity management is much simpler.

capacity

Notice it takes into account performance.

If you mix Prod and Non Prod, capacity management becomes harder. Since the hardware is shared, we need to monitor at the overall cluster level. Since the Production VMs have a more stringent SLA, naturally their number reflects that. As a result, we need to show Prod and Non Prod differently. Let me know if you need it, as to me that complicates operations. This is another reason why I advocate separate cluster for Prod and Non Prod.

One common issue in virtual environment is VM sprawl. Some of these VMs end up not being used. You can reclaim CPU, RAM and Disk from these VMs.

  • The easiest to reclaim is from orphaned VMs, as they are not even registered in vCenter.
  • The second easiest is snapshot. You should only keep snapshot for 1 day or less.

Once the above is reclaimed, you need to look at Powered Off VMs and Idle VMs

  • CPU and RAM are reclaimed from running VMs, as powered off VMs are no longer consuming the resource.
  • CPU: claim from large VM (e.g. 8 vCPU or more). Avoid reclaiming from 2 vCPU unless you’ve completed the large VMs.
  • RAM: claim from large VM (e.g. 16 GB RAM of more) that has Guest OS metrics. It’s more accurate than hypervisor metric.

The Reclaimable dashboard lists all the VMs that have been idle or powered off. It also lists the orphaned VMs and large snapshots.

reclaimable

Configuration

If you configure vSphere hardening guide, and your Infra and VMs comply to it, you will see all green in the dashboard below. If not, you can see exactly which VM or infra is not complying. You can customize the default threshold, although it’s better than you customize the symptoms & alert instead.

You can see compliance for Network and vCenter too, under the vSphere Compliance widget. There is a drop-down there that is not shown.

IaaS

Last but not least, your job is actually about making sure the VM is being served well. It’s a service. Your customers don’t care about your infrastructure. So when they complain that their VM has a problem, you need a dashboard that quickly prove if the problem is at your end or their end. TTI is not Time to Investigate, but Time to Innocence 😉

The Troubleshoot a VM dashboard is built exactly for that!

troubleshoot-a-vm

This dashboard is quite long, as it lets you check underlying ESXi and datastore. You can collapse the widget, as shown below, to see more.

troubleshoot-a-vm-2

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.