Context
Background
The physical memory in linux is not directly mapped to virtual memory. This is done in ‘pages’. A single page of virtual memory that gets allocated in physical memory is 4KB in size - even if your application performs a smaller allocation. This 4KB page of physical memory is assigned to the virtual memory allocation for the requesting application.
The problem is that with 4KB chunks, having 16GB of RAM creates 4,194,304 pages. This means the kernel must keep a map of over 4 million page allocations. Those page lookups can be expensive in terms of CPU cycles required to perform said lookups.
In order to alleviate the problem, huge pages were introduced. By default, they come in 2 flavours - 2MB and 1GB. Unfortunately that meant the application developers would need to handle huge page allocations and deallocations, keeping track of memory. This wasn’t an easy task, and so, in Linux, Transparent Huge Pages (THP) were invented. The THP allow linux to allocate a 2MB chunk by the linux kernel for the application. The kernel then also handles the deallocation and defragmentation of said pages. With THP of 2MB size, 16GB RAM requires just 8192 pages. Now that is a much smaller map, that’s much faster to traverse. Some limited testing reveals that on huge allocations and deallocations a machine without THP, the CPU can waste up to 10% of its cycles on map lookups, while with THP this becomes just 2% on the same test.
Problem Description
Unfortunately, THP comes with some issues. Firstly, the defrag. If you malloc a 1.1MB chunk, you get a 2MB THP. Now if you malloc another chunk of 1.1MB, you will end up with a second 2MB hugepage. This means you are now holding 4MB in 2 hugepages while you have only malloc 2.2MB. This can be an issue where more memory is used than requested.
The kernel has the ability to transform hugepages to normal pages and to defragment THP. This means that the kernel degragmentation will move smaller allocations which fit into a single THP into that one THP, freeing other small pages and THPs. This comes with a cost in the form of latency spike. The problem is that while the kernel runs a defrag on, say 3 THPs, those pages become locked (therefore locking application call to them) until the defrag is done. This causes applications to have miniature spike should it request memory pages which the kernel is defragmenting.
Another issue is that alternative memory allocators, such as the JEMalloc, don’t play nice with THP. It turns out that jemalloc uses madvise extensively to notify the operating system that it’s done with a range of memory which it had previously malloc’ed. Because the machine used transparent huge pages, the page size was 2MB. As such, a lot of the memory which was being marked with madvise(…, MADV_DONTNEED) was within ranges substantially smaller than 2MB. This meant that the operating system never was able to evict pages which had ranges marked as MADV_DONTNEED because the entire page would have to be unneeded to allow it to be reused. This may result with huge amount of memory being lost that cannot be freed. As such, jemalloc statistics will show much less memory usage than the operating system (resident memory) usage of said process.
Check
To check the process’s RSS (resident) and VSZ (virtual) memory usage:
$ ps -eo rss,comm |grep asd 287039291392 asd
To check the asd process jemalloc statistics:
$ asinfo -v 'jem-stats:'
This will dump the jemalloc statistics to either console (to view: journalctl -u aerospike.service) or to /var/log/(syslog|messages). There will be a printout which will look like this:
Allocated: 136475344544, active: 156852981760, metadata: 7333247744, resident: 174832066560, mapped: 287039291392, retained: 25113395200
The RSS number obtained from the ps command should be the same as the Mapped number obtained from the JEMAlloc dump, or very close. If they differ substantially, you have most likely fallen victim of THP (though a memory-leak cannot be excluded, the THP issue is much more plausible at this stage).
You can also check this by finding the following log line (introduced in version 3.10.1) in your aerospike logs as opposed to running jem-stats (details in the log reference manual):
Jun 14 2018 06:18:54 GMT: INFO (info): (ticker.c:241) system-memory: free-kbytes 259843576 free-pct 98 heap-kbytes (2518411,2584772,3786752) heap-efficiency-pct 66.5
The heap-efficiency would be affected by fragmentation, normally, of secondary indexes. The number we are interested in is heap-kbytes - particularly the third number in the tripplet. This is the heap_mapped_kbytes, and should give you the same number as the jemalloc stats “mapped”.
See Understanding linux memory usage reporting for more details on how to read memory allocation.
Further review of hugepage numbers can be done in details of the jem-stats dump and using the kernel’s nr_hugepages/meminfo:
$ cat /proc/sys/vm/nr_hugepages $ cat /proc/meminfo |grep -i huge
The `/proc/<pid>/smaps output can also be used to capture the THP pages being used by the aerospike process:
cat /proc/`pgrep asd`/smaps|grep AnonH|awk '{sum+=$2;} END{print sum;}'
Method
Since JEMalloc already performs its own memory management, ensuring low fragmentation of memory, delayed deallocations and other improvements, there isn’t much gain (if at all) from the THP itself. With the use of jemalloc, THP is actually more harmful than helpful, since it ends up with the process hugging memory it thinks it released. Therefore, when using applications which rely on not having latency spikes and use alternative memory managements, it is best to just turn THP off.
To disable hugepages at runtime:
RHEL systems:
echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/defrag
Other kernels:
echo "never" > /sys/kernel/mm/transparent_hugepage/enabled echo "never" > /sys/kernel/mm/transparent_hugepage/defrag
Unfortunately, you will need to run this BEFORE aerospike process starts. If it was running before, in order to ensure all previously malloc’d space is freed, you will have to restart asd, preferably with cold start at the very least, preferably, rebooting the whole machine instead. It is therefore best to disable THP at boot time before aerospike starts and then restarting the OS:
In order to disable THP on sysVinit systems (non-systemd):
Create /etc/init.d/disable-transparent-hugepages with the following contents:
#!/bin/bash
### BEGIN INIT INFO
# Provides: disable-transparent-hugepages
# Required-Start: $local_fs
# Required-Stop:
# X-Start-Before: aerospike
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Disable Linux transparent huge pages
# Description: Disable Linux transparent huge pages, to improve
# database performance.
### END INIT INFO
case $1 in
start)
if [ -d /sys/kernel/mm/transparent_hugepage ]; then
thp_path=/sys/kernel/mm/transparent_hugepage
elif [ -d /sys/kernel/mm/redhat_transparent_hugepage ]; then
thp_path=/sys/kernel/mm/redhat_transparent_hugepage
else
return 0
fi
echo 'never' > ${thp_path}/enabled
echo 'never' > ${thp_path}/defrag
re='^[0-1]+$'
if [[ $(cat ${thp_path}/khugepaged/defrag) =~ $re ]]
then
echo 0 > ${thp_path}/khugepaged/defrag
else
echo 'no' > ${thp_path}/khugepaged/defrag
fi
unset re
unset thp_path
;;
esac
Make the file executable:
chmod +x /etc/init.d/disable-transparent-hugepages
Enable startup script:
# on debian/ubuntu update-rc.d disable-transparent-hugepages defaults # on RHEL/centos chkconfig --add disable-transparent-hugepages
In order to disable THP on systemd systems:
First, create a file with the contents of the above startup script, but store it in /usr/local/bin/disable-transparent-huge-pages.sh
Make it executable:
chmod +x /usr/local/bin/disable-transparent-huge-pages.sh
Create /etc/systemd/system/disable-transparent-huge-pages.service with the following:
[Unit] Description=Disable Transparent Huge Pages [Service] Type=oneshot ExecStart=/bin/bash /usr/local/bin/disable-transparent-huge-pages.sh start [Install] WantedBy=multi-user.target
Enable the init script:
systemctl daemon-reload systemctl enable disable-transparent-huge-pages.service
After creating the startup scripts:
In all cases, after creating the startup scripts, it’s best to restart the OS, freeing any and all THPs which may have been hugged by the applications. The THP will be disabled from the time the script runs, so some drivers and applications may still have hold of a few THP, but this won’t be a problem.
Notes
Note about Persistent Huge Pages (PHP)
THP is the kernel’s attempt at transparently being more efficient, which for anything other than a desktop is not advisable. For Aerospike, we suggest not using THP.
With regards to Persistent Huge Pages (PHP), this is something an application needs to request on per-need basis. Much like Aerospike uses JEMalloc for memory management, other applications may be using a different allocator (leveraging the PHP technology).
If an application doesn’t request PHP, they will not get them. If an application requests PHP, they will use them. Only pages used by that application will be huge (and only for that application).
Aerospike simply would not make any use of the Persistent Huge Pages, so allowing them would not help Aerospike performance at all.
To reiterate: from Aerospike’s perspective, having PHP allowed on the machine makes no difference. Aerospike will not use Persistent Huge Pages. Other applications on the same VM may use PHP.