VDI workload differs to Server workload. As a result, we cannot use the same approach to right size them. You probably know it well, so let me just highlight some of the differences
- Usage is spiky, not predictable.
- A server generally speaking has a nice predictable pattern on any given 5 minutes. The CPU and RAM does not go from 5% to 95% back and forth within 1 minute.
- Human does not work non-stop.
- Typing time, thinking time, meeting, coffee break, travelling, public holiday, sick leave, etc. A server, being a machine, has none of this 🙂
- In this age of mobile cloud, there is no fixed “working hours”. Each user has his or her own work schedule. We cannot average across a long time period. 1 hour is probably as long as you want it when it comes to averaging.
- There is non-user workload
- Windows weekly AV scan, Windows update, Windows patch, Horizon events (e.g. recompose, rebalance), VSAN events (e.g. rebalance, staging)
- Large application install
- Issues like runaway process can chew up CPU for >10 minutes.
- I’ve seen this on a web based application.
- User does not want to wait for hours when performance issue happens
- You are user too. How long are you willing to wait for an application to launch? Yup, 1 minute 🙂
For the time being, I’d ignore Disk (IOPS) and Network, and just focus on Compute (CPU and RAM) for the time being.
As I have shared in this blog, RAM has different behaviour to CPU. As a result, we need a different counters for CPU and RAM.
For CPU, we should use the data from outside the Guest. For RAM, we should use the data from inside the Guest.
Picking the right counter is critical. As you can see here, choosing the wrong counter can result in wrong decision.
Set aside the technology and tool, when should we give a user more CPU or more RAM?
- Well… when she needs more.
How do we define “more”?
We must see her workload certainly.
- If we want to be less generous, we consider the workload in the past 1 week.
- If we want to provide a high performance, snappy VDI experience, then any given day is enough to warrant an upsize. We don’t wait for 1 week of unacceptable performance.
Ok, how do we get insight into the workload in the past 1 day? There is no point in getting the average of the last 24 hours, as she likely only generate workload for 8 hours. Maybe even less, as she may have meetings, phone calls, or even not in the office. As we said, the average will be low.
What we need is the Max of any given 5 minutes. This gives us insight whether she demanded more resource. 5 minute is a good and balanced window. Going to 1 minute will be too sensitive. Going to 10 minutes is too long for a user to wait.
vRealize Operations provides this via its View widget. The following screenshot show that it can display the Maximum during the sample period.
Beside the Maximum, what else do you notice?
I’d use a simple example. Say user Marie CPU Workload average is 50% in the past 1 day. The standard deviation is 10%. That means in the past 24 hours, 95% of her workload falls between 30% – 70%. Standard Deviation formula states that 95% of the data falls within 2 standard deviation. If the max is 95%, that means she only hit that workload 5% in the past 1 day. That’s still 72 minutes, a long time from her viewpoint. 3 Standard Deviation takes us to 99.7%. That means 99.7% of the time, her CPU workload falls between 20 – 80%. That 0.3% translates into 4 minutes in the last 24 hours. So as what Michael said, the devil is in the detail, and now you have the details 🙂
Let’s now take a real example. Notice the first one has average of 21.52%. Standard Deviation is only 2.24%. Maximum is however a whopping 96%. So it is off the range. We can tell quickly that it not normal. Since the sample period below is 24 hours (1440 minutes), that means this is a one off data in vRealize Operations.
Zooming into the VM to plot the entire 24 hours, we can see it’s indeed one off. Bingo! 🙂
Now that you have insight, you can confidently decide if that’s a one off instance, or something that does need an upsize. BTW, since this is VDI, your starting line is probably 2 vCPU (I’d avoid going 1 vCPU) and you should only increment 1 vCPU at a time. Another word, I won’t jump from 2 vCPU to 4, 6, 8. I’d go 2, 3, 4, 5 as that hits my consolidation ratio.
Yes, that’s all you need to find out which User needs more CPU. Simple, yet accurate. Sometimes as engineer, we over engineering a solution 🙂
What about RAM?
Well… that’s a topic of another blog. I want you to review this first. Let me know which counter to use!
Hint: it is not Memory Usage, Memory Consumed, Memory Workload, Memory Active
You can find the answer here.