Detail
How are buffers and caches used within Aerospike, and how can they be configured ?Answer
Aerospike Buffers and Caches
Current Write Blocks
When a record is written and needs to be stored on disk, it is put into an in-RAM buffer that holds a current write block — a write block that asd is currently filling up. Aerospike maintains multiple current write blocks simultaneously; this helps optimize the post-write cache by segregating writes into separate blocks (see cache-replica-writes below). Defragmentation also uses its own separate current write block, and defrag writes do not go into the post-write cache.
When a current write block is full, it is persisted to disk and asd starts a new one. All writes to a write block are coalesced into a single write-block-size device write, leading to low write IOPS.
Post-Write Cache
The most recently persisted write blocks are kept in RAM after being written to disk. The idea is that subsequent reads are more likely to hit recently written records than older records. In particular, XDR would hit recently written records. If such a record is read, its data can be retrieved from the write block in the post-write-cache rather than having to be retrieved from the device.
The size of the post-write cache is configurable; increasing it retains more recently-written blocks in memory, which can significantly reduce read I/O for write-heavy workloads or workloads where data is frequently read shortly after being written.
Note: Prior to Aerospike 7.1, this was named post-write-queue. The configuration directive was renamed to post-write-cache in version 7.1.
Page Cache
In general, the Linux kernel performs device I/O via the page cache. When a process writes data, the data goes to the page cache (i.e., to RAM) and the Linux kernel writes it to the device asynchronously. The page cache operates with a granularity of 4-KiB pages. Two small writes to the same 4-KiB page in rapid succession are coalesced into one 4-KiB write. Between data in a page getting modified (in RAM) and the page actually getting written to disk, the page is said to be dirty.
Reads also go through the page cache. When a process reads data from a device, the Linux kernel reads the 4-KiB pages into the page cache, then copies the data to the process's read buffer. The page cache uses least-recently-used eviction, so recently read or written pages are kept in RAM for subsequent reads.
The page cache's lifetime is bounded by the Linux kernel's lifetime. Data will be safe even if a process crashes after writing to the page cache but before the kernel writes to disk. The page cache only loses data during a kernel panic or sudden power loss — a clean OS shutdown will flush all dirty pages.
Hardware Caches
Disk devices and controllers can contain caches to coalesce writes and keep recently accessed data in RAM — the difference being that this RAM sits on the device or controller itself. Behaviour and guarantees vary per device; some caches are battery-backed (safe against sudden power loss), others are not.
Cache Hierarchy
The cache hierarchy consists of three layers:
- The current write block
- The page cache
- The hardware caches
All three can delay the persistence of written record data, temporarily keeping it in RAM where it can be affected by events such as sudden power loss. The post-write cache does not factor into this — it only retains already-written data and does not delay data on its way to persistent storage.
Written data moves through these layers in order:
asdkeeps data in the current write block (unlesscommit-to-deviceistrue).- When
asdflushes a write block, the page cache may further delay persistence. - Once the kernel writes from the page cache to the device, hardware caches may delay further.
- Finally, the device firmware moves data from the hardware cache to persistent media — only then is the data fully safe.
Configuration Options
Aerospike Server 4.3.1 and Above
Default Behaviour — Devices (O_DIRECT + O_DSYNC)
By default, reads from and writes to devices use O_DIRECT and O_DSYNC. This bypasses layers two and three of the hierarchy (the page cache and hardware caches). Data can still be lost in layer one (the current write block), but that loss window is bounded by how often asd flushes partial write blocks — controlled by flush-max-ms.
Default Behaviour — Files (No Flags)
By default, reads from and writes to files do not use any flags. All reads are cached in the page cache, and writes can theoretically be lost in the page cache or hardware caches. While the layer-one loss window is bounded by flush-max-ms, there are no definitive bounds for layers two and three. Data loss in those layers requires a kernel panic or sudden power loss; an asd crash alone will not cause data loss there.
direct-files
Enables O_DIRECT and O_DSYNC for files, bringing files in line with the default behaviour for devices. Reads and writes bypass the page cache and hardware caches. Data can still be lost in layer one (the current write block).
flush-max-ms
Configures the interval (in milliseconds) at which asd writes a partially filled current write block to the device. This bounds the risk of data loss in layer one to the given interval. Note that this only addresses the first layer; if used with files without direct-files, caching still occurs in the page cache and hardware caches.
commit-to-device
Takes flush-max-ms to its logical conclusion: synchronously writes record data to the underlying device during each write transaction. Unlike flush-max-ms, this affects all three cache layers. For files, it also enables O_DIRECT and O_DSYNC, making direct-files redundant when using this directive. The result is that caching is disabled across all three layers.
read-page-cache
Removes O_DIRECT and O_DSYNC for record reads done by transactions, so read data also populates the page cache. A subsequent read of the same record can be served from the page cache without hitting the device. This directive does not affect write behaviour or any write guarantees.
Kubernetes / cgroup consideration: In containerised environments, page cache memory consumed by the Aerospike container counts against the container's cgroup memory limit in both cgroups v1 and v2. If RSS plus page cache exceeds the limit, the OOM killer fires — even if the application's own heap is well within bounds. A known kernel bug on kernels 4.19+ and 5.4+ can cause premature OOM kills when the kernel fails to reclaim reclaimable pages in time. cgroups v2 does not eliminate this risk — the fundamental accounting is unchanged — but it does add a memory.high throttling stage before the hard OOM threshold (which may manifest as latency spikes) and PSI metrics (memory.pressure) for earlier visibility into memory pressure.
When enabling read-page-cache in Kubernetes, ensure the container memory limit has sufficient headroom above the Aerospike process RSS to accommodate page cache growth. Tuning vm.dirty_bytes and vm.dirty_background_bytes to lower values can also reduce risk. Alternatively, avoid read-page-cache in tightly-sized pods and rely on Aerospike's own in-process caching such as the post-write-cache, or use storage-engine memory (which since 7.0 operates within a bounded SHMEM allocation).
cache-replica-writes
Defaults to false. When false, replica writes are directed to a separate current write block from client-originated writes, so replica data does not displace client-written data in the post-write cache. This improves cache efficiency for client read workloads and is the recommended default. Enabling it causes replica writes to share write blocks with client writes, diluting the post-write cache with data less likely to be read locally.
Storage Compression
Aerospike can compress records before writing to disk using LZ4 or zstd. Smaller on-disk records mean fewer bytes read and written per operation, directly reducing device I/O bandwidth. Compression also improves cache behaviour: smaller records mean more fit into the post-write cache and, when read-page-cache is enabled, into the page cache. More records also fit per write block, which improves defrag efficiency. zstd is generally recommended for the best balance of compression ratio to CPU overhead on modern hardware.
defrag-queue-min
While not directly related to buffering or caching, this directive can significantly reduce device I/O. Aerospike reclaims disk space by defragmenting — reading write blocks (wblocks) that have fallen below the defrag-lwm-pct usage threshold, rewriting the remaining live records, and freeing the block. defrag-queue-min sets the minimum number of eligible wblocks that must be queued before defragmentation begins processing. Raising this value lets blocks accumulate longer before defrag starts, giving more records time to expire or be overwritten — meaning blocks will be emptier when finally defragmented, resulting in less write amplification.
Page Cache Coherence
If reads are cached in the page cache but writes bypass it, could reads return stale data? No — the Linux kernel guarantees page cache coherence. Even when writes bypass the page cache, a write to some data invalidates any cached copy of that data. Writes do not overwrite data in the page cache; they invalidate it. A subsequent read of that data is therefore forced to hit the device and fetch the fresh version.
Linux Kernel Tuning for read-page-cache
When read-page-cache is enabled, the Linux kernel page cache is back in play for Aerospike reads. The following kernel and OS settings can help it work effectively:
| Setting | Recommendation |
|---|---|
vm.vfs_cache_pressure | Lower to 50 (from default 100) to bias toward retaining page cache entries longer. |
| I/O scheduler | Use none or mq-deadline for NVMe devices. |
| Transparent Huge Pages | Disable: set to [never]. |
vm.swappiness | Set low: 0 or 1. |
min_free_kbytes | At least 1.1 GB (1.25 GB when using cloud vendor drivers). |
zone_reclaim_mode | Set to 0 on NUMA systems. |