Detail
When dealing with mpstat, multiple lines of output are present. This short article describes them and focuses specifically on the %steal part of the output.
16:10:08 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 16:10:10 all 7.87 0.00 11.61 6.07 0.00 4.87 1.73 0.00 0.00 67.85 16:10:10 0 9.28 0.00 12.89 5.67 0.00 5.15 1.55 0.00 0.00 65.46 16:10:10 1 6.91 0.00 13.30 5.85 0.00 2.66 1.60 0.00 0.00 69.68 16:10:10 2 7.53 0.00 11.83 5.91 0.00 5.91 2.15 0.00 0.00 66.67 16:10:10 3 8.60 0.00 10.75 5.38 0.00 4.84 2.15 0.00 0.00 68.28 16:10:10 4 8.47 0.00 11.11 6.35 0.00 5.29 1.59 0.00 0.00 67.20 16:10:10 5 5.43 0.00 13.59 4.89 0.00 5.98 2.17 0.00 0.00 67.93 16:10:10 6 8.38 0.00 9.42 7.33 0.00 5.24 2.09 0.00 0.00 67.54 16:10:10 7 7.69 0.00 9.34 7.14 0.00 4.40 1.65 0.00 0.00 69.78
As seen in the output above, mpstat provides the following:
| name | detail |
|---|---|
| %usr | the amount of time spent in userland, handling applications |
| %nice | the amount of time spent in the ‘nice’ state, for applications with higher/lower ‘nice’ values, or priorities |
| %sys | amount of time spent in dealing with system handling, this is the kernel and drivers |
| %iowait | the amount of time the CPU has spent waiting for available IO bandwidth. High values indicate bottleneck in network or disk |
| %irq | the amount of time spent waiting on hard hardware interrupts to finish |
| %soft | how much time the CPU has spent waiting on soft interrupts. This will often relate to network drivers and will be directly connected to %sys increasing as well |
| %steal | the amount of time stolen from the CPU by the hypervisor (this is discussed in detail below) |
| %guest | this will only show if the machine itself is a hypervisor - the amount of time spent serving virtual machines |
| %gnice | this shows the %guest usage when the guest virtual machine has %nice priority applied to it - niced guest |
Answer
When running Aerospike in a virtualized platform, it is particularly important to monitor the %steal. This parameter shows the amount of time the physical CPU has “stolen” from the vCPU. In other words, this parameter shows how much time the vCPU core has spent waiting while the physical CPU core deals with another vCPU. In a 1:1 ratio, this number should be close to 0, if not 0. Overprovisioned vCPUs (i.e. those where the vCPU count exceeds the physical CPU count on the underlying hardware) will notice this number increase.
If you see this number increase, check with your cloud team (if hosting internally) or cloud vendor (if using a public cloud host).
Also note that a %steal of just 2% is significant. While it doesn’t exactly translate to this representation, it can be visualised as the vCPU core not doing anything for 1 second out of each 50 seconds. In this time, the physical CPU core is dealing with another vCPU request, from another virtual machine.