Context
When a server OOM's, has issues allocating Arena segments or thrashes the OS disk due to paging (even without swap enabled) it can be hard to determine if the issues are related to just a pure lack of memory or if what is being experienced is actually a lack of contiguous memory.
Here are some pointers on looking at contiguous free memory segments.
Method
Looking at :-
- The contents of /proc/pagetypeinfo, if this file exists.
-
The contents of /sys/kernel/debug/extfrag/extfrag_index, if this file exists.
Both allow us to get an understanding of the degree of kernel memory fragmentation. Let’s look at /proc/pagetypeinfo.
server:# cat /proc/pagetypeinfo
...
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone Normal, type Unmovable 1118 818 393 122 34 5 0 0 0 0 0
Node 0, zone Normal, type Movable 2866 944 334 66 9 1 0 0 0 0 0
Node 0, zone Normal, type Reclaimable 1 1 1 0 115 33 5 0 0 0 0
...
Each line represents one type of allocation the kernel can make. The columns numbered 0 through 10 represent allocation sizes: 2 ^ 0 = 1 page, 2 ^ 1 = 2 pages, 2 ^ 2 = 4 pages, …, 2 ^ 10 = 1024 pages. The matrix says how many pages of a given size are available for a given kernel allocation type. So, for 2-page allocations of the “zone Normal, type Unmovable” type, 818 pages would still be available.
A *clean* (ie non-fragmented) machine will have high order pages (8,9,10), whilst a machine with fragmentation will only have the lower order pages available.
Now let’s look at /sys/kernel/debug/extfrag/extfrag_index.
server:# cat /sys/kernel/debug/extfrag/extfrag_index
...
0 1 2 3 4 5 6 7 8 9 10
Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.995 0.998
This also offers one line per kernel allocation type, but the grouping is coarser than in /proc/pagetypeinfo. E.g., zone Normal vs. zone Normal, type Unmovable.
The 11 numbers again correspond to allocation sizes of 2 ^ 0 = 1 page, 2 ^ 1 = 2 pages, …, 2 ^ 10 = 1024 pages. The numbers say why and allocation of this size would fail at the current point in time. The numbers mean the following:
-
-1.000 means that an allocation of the size would currently succeed. In the above example, given the current state of physical memory, allocations of sizes 2 ^ 0 through 2 ^ 8, e.g., 1 through 256 pages, would succeed.
-
A value between 0.000 and 1.000 means that an allocation of the size would currently fail. In the above example, given the current state of physical memory, this is the case for sizes 2 ^ 9 (512) and 2 ^ 10 (1024) pages - or 2 MiB and 4 MiB, respectively.
-
When the value is closer to 0.000, then the reason for the allocation failure is a general lack of memory.
-
When the value is closer to 1.000, then the reason for the allocation failure is kernel memory fragmentation.
-
-
In the above example, the values are 0.995 and 0.998, respectively. So, the reason is fragmentation. It looks like the kernel memory right now doesn’t have any physically contiguous chunks of 512 pages (2 MiB) or 1024 pages (4 MiB). So, if the kernel at this point tried to allocate a contiguous 2-MiB or 4-MiB chunk of physical memory, then this allocation would fail.
Notes
Usually, the page size is 4k (512 pages per block * 64 bits)The above can be used to see the current state of any system regardless of whether there has been a crash or not.
Memory Types :-
TYPE Size (x86) Size (x86_64) Description
DMA <16MB <16MB For very old devices
DMA32 N/A 16-4096MB For devices addressing up to 32bits (4GB)
NORMAL 16-896MB >4096MB Memory directly mapped by kernel
HIGHMEM 896MB-4096MB N/A For devices addressing up to 32bits (4GB)
Allocation types
TYPE Description
Unmovable Locked in memory
Reclaimable Reusable after clean
Movable Immediately available
Reserve Last resort reserve
Isolate Keep on local NUMA node
CMA Contiguous memory allocator, for DMA devices
Up until Kernel 3.10, various mechanisms were used to reclaim memory, but as of 3.10, they were removed and replaced with the new memory compaction algorithms.
In kernel v3.10, memory compaction is performed under any of the following situations:
The kswapd kernel thread is called to balance zones after a failed high-order allocation.
The khugepaged kernel thread is called to collapse a huge page.
Memory compaction is manually triggered via the /proc interface.
The system performs direct reclaim to meet higher-order memory requirements, including handling Transparent Huge Page (THP) page fault exceptions.
When the kernel allocates pages, if there are no available pages in the free lists of the buddy system, the following occurs:
The kernel processes this request in the slow path and tries to allocate pages using the low watermark as the threshold.
If the memory allocation fails, which indicates that the memory may be slightly insufficient, the page allocator wakes up the kswapd thread to asynchronously reclaim pages and attempts to allocate pages again, also using the low watermark as the threshold.
If the allocation fails again, it means that the memory shortage is severe. In this case, the kernel runs asynchronous memory compaction first.
If the allocation still does not succeed after the async memory compaction, the kernel directly reclaims memory.
After the direct memory reclaim, if the kernel doesn’t reclaim enough pages to meet the demand, it performs direct memory compaction. If it doesn’t reclaim a single page, the OOM Killer is called to deallocate memory.
The above steps are only a simplified description of the actual workflow. In real practice, it is more complicated and will be different depending on the requested memory order and allocation flags.
The above is why even if you don't get the OOM, having fragmented memory with kernel 3.10 can slow down tasks whilst compaction and reclamation takes place, and why this kernel should be considered bad for database type applications - particularly as the "slow" path of memory reclamation can iterate many times, this had a limit of 16 placed in 4.12. Consider 4.6+