Problem Description
When an Aerospike system is experiencing latency the following symptoms may be observed:
- high disk load observed via the
uptimecommand topoutput showing CPU utilization by either, or all of:kworker,kswapd0,khugepaged, while there is no swapping happening to disk (as observed fromfree)- mpstat showing one CPU core highly utilized - at 100% at all times, used by
%system. - iostat showing one or more disks with either 100% util or very high
w_await(in 100s)
Example uptime output on a 16-core machine, showing load over 40:
$ uptime
13:02:52 up 571 days, 1:01, 1 user, load average: 44.84, 43.19, 41.05
Example output from top showing high kworker, khugepaged and kswapd0 utilization.
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
116145 root 20 0 0.113t 0.100t 0.090t S 1209 85.2 34769,37 asd
9203 root 20 0 0 0 0 R 97.7 0.0 198:27.23 kworker/u+
120 root 20 0 0 0 0 R 48.8 0.0 51336:15 kswapd0
1298 root 20 0 64816 23652 1048 S 18.3 0.0 2664:49 collectl
95 root 39 19 0 0 0 S 6.1 0.0 414:40.23 khugepaged
31230 root 20 0 23680 2872 2400 R 6.1 0.0 0:00.02 top
The free command shows no swapping happening and a lot of memory in cached state:
$ free -m
total used free shared buffers cached
Mem: 122878 114948 7929 94433 71 101744
-/+ buffers/cache: 13132 109745
Swap: 0 0 0
The mpstat command shows one core at 100% utilization by system.
$ mpstat -P ALL 2 3 01:02:48 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 01:02:50 PM all 29.54 0.00 43.70 6.48 0.00 11.54 1.19 0.00 0.00 7.55 01:02:50 PM 0 29.90 0.00 39.69 3.09 0.00 21.65 1.03 0.00 0.00 4.64 01:02:50 PM 1 30.10 0.00 44.90 2.04 0.00 19.39 0.51 0.00 0.00 3.06 01:02:50 PM 2 26.13 0.00 46.23 1.51 0.00 22.11 1.01 0.00 0.00 3.02 01:02:50 PM 3 26.29 0.00 44.85 2.06 0.00 22.16 1.03 0.00 0.00 3.61 01:02:50 PM 4 28.93 0.00 36.55 3.05 0.00 24.87 1.02 0.00 0.00 5.58 01:02:50 PM 5 26.80 0.00 47.42 2.06 0.00 19.59 1.03 0.00 0.00 3.09 01:02:50 PM 6 21.50 0.00 48.00 2.00 0.00 23.00 1.00 0.00 0.00 4.50 01:02:50 PM 7 27.78 0.00 36.36 2.02 0.00 29.29 1.01 0.00 0.00 3.54 01:02:50 PM 8 31.91 0.00 35.64 15.43 0.00 0.00 1.60 0.00 0.00 15.43 01:02:50 PM 9 35.60 0.00 37.17 13.09 0.00 0.00 1.57 0.00 0.00 12.57 01:02:50 PM 10 49.48 0.00 34.38 7.29 0.00 0.00 1.04 0.00 0.00 7.81 01:02:50 PM 11 27.18 0.00 40.00 14.87 0.00 0.00 2.05 0.00 0.00 15.90 01:02:50 PM 12 35.42 0.00 33.85 14.58 0.00 0.00 1.56 0.00 0.00 14.58 01:02:50 PM 13 47.12 0.00 31.41 9.95 0.00 0.00 1.57 0.00 0.00 9.95 01:02:50 PM 14 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 01:02:50 PM 15 30.98 0.00 39.67 12.50 0.00 0.00 1.63 0.00 0.00 15.22
The iostat command shows one ore more disks with very high w_await constantly:
$ iostat -y -x 5 4
Linux 4.4.0-92-generic (test-slow-disk) 06/10/2019 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
30.05 0.00 52.62 7.50 1.30 8.53
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 19.00 0.00 9.40 0.00 180.00 38.30 0.00 0.51 0.00 0.51 0.51 0.48
xvdf 0.00 1.60 198.40 13.80 820.80 1068.00 17.80 5.50 27.01 0.67 405.74 3.17 67.36
nvme1n1 0.00 0.00 8502.40 854.40 124980.90 109363.20 50.09 6.72 0.72 0.75 0.42 0.09 81.60
nvme0n1 0.00 0.00 8059.20 683.20 102825.00 87449.60 43.53 5.00 0.57 0.58 0.45 0.09 76.32
Explanation
Here, the issue is with a slow disk. In the example shown, xvdf. As the disk is slow, with over 400ms for write ops queued operations at all times, all writes end up going to the dirty pages cache. Since the dirty page cache is now using a lot of the RAM that was free, the kernel worker (kworker) constantly tries to flush that memory to disk and the kernel swap daemon (kswapd0) keeps working on swapping out the page caches. This results in a single core being overutilized, and any threads the Aerospike daemon (asd) is running on that CPU core will fall behind as a consequence. At an application level this will manifest as latency.
The purpose of the disk in question is not relevant, this issue is not isolated to data disks. It may be the root drive, a drive used for the XDR digest log (pre-5.0 version, or data disks or any other drives.
Solution
The solution is to troubleshoot the disk in question. The disk may have not enough max write IOPS / max throughput or it may be faulty.
The hardware manufacturer should provide instructions on how to troubleshoot the faulty drive. A logical starting point would be SMART status should the drive in question support this (from the Linux smartmontools package).
Notes
-
Aerospike advises that Transparent Hugepages are disabled. The effect of those on high-performance databases has been seen to be negative. This article provides full details.
-
When mpstat shows 100% utilization under
irqorsoft, the issue is most likely IRQ balancing. See here for more details on how to resolve IRQ balancing issues.