Articles in this section

Read latency with data-in-memory due to disk slowdown

Problem Description

In an Aerospike cluster where a namespace is configured for data-in-memory true the expectation is that reads would not be impacted by a slow down on the disk.  In practice, when extreme slowdowns are observed, significant read latency may be present during disk issues, even though reads are served from RAM not disk.

Explanation

This read latency can be explained by two phenomena that occur as a result of a slow storage device to which writes are being persisted.  These are service thread starvation and, to a far lesser extent, locking of the primary index. 

 

Service thread starvation occurs due to the periodic partial flush of write blocks to persistent storage.  Write blocks are flushed when they are full or, periodically at an interval controlled by the flush-max-ms parameter.  The partial flush serves to ensure that data does not sit in volatile memory for excessive amounts of time when write load is low and blocks are being filled comparatively slowly.  The default interval for flush-max-ms is 1 second meaning that when write load is sufficiently low that it takes longer than 1s to fill them, partial blocks will be flushed every second.  When a partial block is flushed it is locked for writing.

 

If the storage to which the block is being flushed is slow the partial flush can take time and the write block is not available to be written to.  Service threads with write transactions are therefore forced to wait until the lock on the block is released.

 

In a circumstance where a node processes 10,000 tps with a 90/10 read/write split 1000 writes will be processed each second.  If there is a three-second delay on the storage this is sufficient to consume 3000 service threads which exceeds the amount of threads configured on most Aerospike nodes.  This example assumes a single device, as there is a write thread for every device in a namespace the numbers given above will change accordingly.

 

Reviewing those calculations it is simple to see how a delay on the disk, with the associated locked partial block, can burn through all configured service threads in relatively short order.  This is the primary cause of read latency as service threads are responsible for demartialling transactions from the network interface.  Once all threads are stuck waiting for the lock on the partial block to be released, no transactions can be processed.

 

A secondary impact, also caused by the partial flush, is a lock taken on the relevant sprigs (sub-divisions of the Aerospike primary index) which prevents other transactions from using those sprigs.  Though thread starvation is a far more immediate issue, locks will also impact anything attempting to use that sprig.  Operations such as scans or the internal NSUP process, if configured, will be impacted by these locks which are held for the duration of the partial flush.


 

Solution

The partial flush functionality can be switched off by setting flush-max-ms to 0, this completely decouples the storage from the service threads and removes any locking issues.  The trade-off here is that partial blocks will remain uncommitted to disk for longer periods.

 

A longer-term solution would be to review use of persistent storage which introduces delays of a second or more.  Storage with this level of performance will likely exhibit issues elsewhere in the system even when partial flushes are managed.


 

Notes

The following symptoms can be used to spot partial flush issues.

·  High w_await values in the iostat command showing writes to the storage taking excessive time

·  Netstat showing that the network interface receive queue is growing in a disproportionate manner as transactions sit waiting for service threads

·  Connection counts on nodes spiking as clients open more connections to attempt to maintain traffic levels as existing connections are not released in a timely manner.

·  Jump in NSUP execution time (or any other process scanning the primary index such as tomb raider)

·  drv_ssd context log messages stuttering in logs – the thread that writes these is the thread doing the partial flush.  If that thread is stuck the drv_ssd messages will not be written at the normal configured ticker interval (usually 20s)

·  Read latency on namespaces configured with data-in-memory true with any or all of the above conditions satisfied

Applies To Earliest Version

4.9

Applies To Latest Version

Current Version
Was this article helpful?
0 out of 0 found this helpful