Articles in this section

How to understand free contiguous memory segments

Context

When a server OOM's, has issues allocating Arena segments or thrashes the OS disk due to paging (even without swap enabled) it can be hard to determine if the issues are related to just a pure lack of memory or if what is being experienced is actually a lack of contiguous memory.

Here are some pointers on looking at contiguous free memory segments.


Method

Looking at :-

  • The contents of /proc/pagetypeinfo, if this file exists.
  • The contents of /sys/kernel/debug/extfrag/extfrag_index, if this file exists.

Both allow us to get an understanding of the degree of kernel memory fragmentation. Let’s look at /proc/pagetypeinfo.
 

server:# cat /proc/pagetypeinfo 
...
Free pages count per migrate type at order      0     1     2     3     4     5     6     7     8     9    10 
Node    0, zone   Normal, type    Unmovable  1118   818   393   122    34     5     0     0     0     0     0 
Node    0, zone   Normal, type      Movable  2866   944   334    66     9     1     0     0     0     0     0 
Node    0, zone   Normal, type  Reclaimable     1     1     1     0   115    33     5     0     0     0     0 
...

Each line represents one type of allocation the kernel can make. The columns numbered 0 through 10 represent allocation sizes: 2 ^ 0 = 1 page, 2 ^ 1 = 2 pages, 2 ^ 2 = 4 pages, …, 2 ^ 10 = 1024 pages. The matrix says how many pages of a given size are available for a given kernel allocation type. So, for 2-page allocations of the “zone Normal, type Unmovable” type, 818 pages would still be available.

A *clean* (ie non-fragmented) machine will have high order pages (8,9,10), whilst a machine with fragmentation will only have the lower order pages available.
 

Now let’s look at /sys/kernel/debug/extfrag/extfrag_index.

server:# cat /sys/kernel/debug/extfrag/extfrag_index
...
                           0      1      2      3      4      5      6      7      8      9     10
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000  0.995  0.998

This also offers one line per kernel allocation type, but the grouping is coarser than in /proc/pagetypeinfo. E.g., zone Normal vs. zone Normal, type Unmovable.

The 11 numbers again correspond to allocation sizes of 2 ^ 0 = 1 page, 2 ^ 1 = 2 pages, …, 2 ^ 10 = 1024 pages. The numbers say why and allocation of this size would fail at the current point in time. The numbers mean the following:

  • -1.000 means that an allocation of the size would currently succeed. In the above example, given the current state of physical memory, allocations of sizes 2 ^ 0 through 2 ^ 8, e.g., 1 through 256 pages, would succeed.

  • A value between 0.000 and 1.000 means that an allocation of the size would currently fail. In the above example, given the current state of physical memory, this is the case for sizes 2 ^ 9 (512) and 2 ^ 10 (1024) pages - or 2 MiB and 4 MiB, respectively.

    • When the value is closer to 0.000, then the reason for the allocation failure is a general lack of memory.

    • When the value is closer to 1.000, then the reason for the allocation failure is kernel memory fragmentation.

  • In the above example, the values are 0.995 and 0.998, respectively. So, the reason is fragmentation. It looks like the kernel memory right now doesn’t have any physically contiguous chunks of 512 pages (2 MiB) or 1024 pages (4 MiB). So, if the kernel at this point tried to allocate a contiguous 2-MiB or 4-MiB chunk of physical memory, then this allocation would fail.


Notes

Usually, the page size is 4k (512 pages per block * 64 bits)
The above can be used to see the current state of any system regardless of whether there has been a crash or not.


Memory Types :-
TYPE        Size (x86)    Size (x86_64)    Description
DMA         <16MB         <16MB            For very old devices
DMA32       N/A           16-4096MB        For devices addressing up to 32bits (4GB)
NORMAL      16-896MB      >4096MB          Memory directly mapped by kernel
HIGHMEM     896MB-4096MB  N/A              For devices addressing up to 32bits (4GB)

Allocation types
TYPE           Description
Unmovable      Locked in memory
Reclaimable    Reusable after clean
Movable        Immediately available
Reserve        Last resort reserve
Isolate        Keep on local NUMA node
CMA            Contiguous memory allocator, for DMA devices

Up until Kernel 3.10, various mechanisms were used to reclaim memory, but as of 3.10, they were removed and replaced with the new memory compaction algorithms.
In kernel v3.10, memory compaction is performed under any of the following situations:

The kswapd kernel thread is called to balance zones after a failed high-order allocation.
The khugepaged kernel thread is called to collapse a huge page.
Memory compaction is manually triggered via the /proc interface.
The system performs direct reclaim to meet higher-order memory requirements, including handling Transparent Huge Page (THP) page fault exceptions.

When the kernel allocates pages, if there are no available pages in the free lists of the buddy system, the following occurs:

The kernel processes this request in the slow path and tries to allocate pages using the low watermark as the threshold.
If the memory allocation fails, which indicates that the memory may be slightly insufficient, the page allocator wakes up the kswapd thread to asynchronously reclaim pages and attempts to allocate pages again, also using the low watermark as the threshold.
If the allocation fails again, it means that the memory shortage is severe. In this case, the kernel runs asynchronous memory compaction first.
If the allocation still does not succeed after the async memory compaction, the kernel directly reclaims memory.
After the direct memory reclaim, if the kernel doesn’t reclaim enough pages to meet the demand, it performs direct memory compaction. If it doesn’t reclaim a single page, the OOM Killer is called to deallocate memory.
The above steps are only a simplified description of the actual workflow. In real practice, it is more complicated and will be different depending on the requested memory order and allocation flags.

The above is why even if you don't get the OOM, having fragmented memory with kernel 3.10 can slow down tasks whilst compaction and reclamation takes place, and why this kernel should be considered bad for database type applications - particularly as the "slow" path of memory reclamation can iterate many times, this had a limit of 16 placed in 4.12. Consider 4.6+

 

Applies To Earliest Version

Pre 4.9

Applies To Latest Version

Current Version
Was this article helpful?
0 out of 0 found this helpful