Articles in this section

Nodes getting OOM Killed when using CentOS 7/RHEL 7

Problem Description

Customers have noticed OOM error on multiple nodes in different Aerospike clusters running on CentOS 7/RHEL 7 which resulted in restarting of the entire cluster.
Since those are production environment, it has caused a considerable data loss and business impact.

Example stacktrace from one of the nodes shown in dmesg:
[Sat Jun  3 23:13:15 2023] zabbix_agentd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
[Sat Jun  3 23:13:15 2023] zabbix_agentd cpuset=/ mems_allowed=0
[Sat Jun  3 23:13:15 2023] CPU: 6 PID: 32764 Comm: zabbix_agentd Kdump: loaded Not tainted 3.10.0-1062.4.3.el7.x86_64 #1
[Sat Jun  3 23:13:15 2023] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[Sat Jun  3 23:13:15 2023] Call Trace:
[Sat Jun  3 23:13:15 2023]  [<ffffffffbc579ba4>] dump_stack+0x19/0x1b
[Sat Jun  3 23:13:15 2023]  [<ffffffffbc574c6a>] dump_header+0x90/0x229
[Sat Jun  3 23:13:15 2023]  [<ffffffffbbf058f2>] ? ktime_get_ts64+0x52/0xf0
[Sat Jun  3 23:13:15 2023]  [<ffffffffbbfc0e84>] oom_kill_process+0x254/0x3e0
[Sat Jun  3 23:13:15 2023]  [<ffffffffbbf325e1>] ? cpuset_mems_allowed_intersects+0x21/0x30
[Sat Jun  3 23:13:15 2023]  [<ffffffffbbfc092d>] ? oom_unkillable_task+0xcd/0x120
[Sat Jun  3 23:13:15 2023]  [<ffffffffbbfc09d6>] ? find_lock_task_mm+0x56/0xc0
[Sat Jun  3 23:13:15 2023]  [<ffffffffbbfc16d6>] out_of_memory+0x4b6/0x4f0
[Sat Jun  3 23:13:15 2023]  [<ffffffffbbfc81df>] __alloc_pages_nodemask+0xacf/0xbe0
[Sat Jun  3 23:13:15 2023]  [<ffffffffbbe98f6d>] copy_process+0x1dd/0x1a50
[Sat Jun  3 23:13:15 2023]  [<ffffffffbbe9a991>] do_fork+0x91/0x330
[Sat Jun  3 23:13:15 2023]  [<ffffffffbbe9acb6>] SyS_clone+0x16/0x20
[Sat Jun  3 23:13:15 2023]  [<ffffffffbc58d2b4>] stub_clone+0x44/0x70
[Sat Jun  3 23:13:15 2023]  [<ffffffffbc58cede>] ? system_call_fastpath+0x25/0x2a

Explanation

As shown in the stacktrace, the "zabbix_agentd" attempted to allocate 16KB of memory from the order=2 line.
In general, Linux allocates memory in 4KB pages and here, this process is requesting 16KB continuous space which is 4 of continuous 4KB pages.

As reported in dmesg, the system did not have any 16Kb contiguous pages of memory able to be allocated.  It did have some  amount of memory unused, however, it seems this memory was fragmented thus it  was unable to allocate the contiguous pages of memory requested.
 
[Sat Jun  3 23:13:16 2023] Node 0 Normal: 392633*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1570532kB
[Sat Jun  3 23:13:16 2023] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Sat Jun  3 23:13:16 2023] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
In this example, we can see that there are 4 KB pages available but those are not continuous enough to allocate 16KB or more of memory.

 

Solution

In this case, the guest OS is based on Linux 3.10.0.  That was released June 30 2013. Memory compaction during page allocation in this kernel is weak, we can easily be contended on zone->lock and abort our defragmentation to make high-order memory available.
Upgrading the guest OS version will likely allow memory compaction to be more successful at defragmenting memory so that we can allocate more physically contiguous memory 16KB and larger in size (needed for fork()).
The workload may be doing a ton of fork()'s so we are inducing a lot of memory fragmentation in the system. Memory compaction, reclaim, and page allocator logic have all substantially changed over the past 10 years.

We have seen the same issue with multiple customer’s environment and the hardware vendor has recommended an upgrade of the OS to resolve the issue. We saw that upgrading the OS had resolved this issue for them, thus, we would suggest the same of upgrading the guest OS especially since CentOS 7 will be EOL next year: https://access.redhat.com/support/policy/updates/errata/#Life_Cycle_Dates
 
Was this article helpful?
0 out of 0 found this helpful