Detail
After a node with namespaces running in AP mode has cold started, it can be observed that records that were previously deleted have returned to the cluster, why is that?
Answer
Aerospike supports two methods of deleting records:
- Expunge - The original (and default) delete, sometimes referred to as expunge where records are removed from the index, leaving the corresponding entry on the storage layer (if the namespace is persisted). Such deletes free up the memory immediately (64 bytes per record). The block on the disk containing the value of the record will be eventually defragmented (once its used capacity falls under the defrag-lwm-pct threshold) and made available to be used for new write transactions. Only when new write transactions overwrite such defragmented blocks are the old records values removed from the storage layer.
- Durable delete - Starting with version 3.10, a new client policy is introduced allowing to durably delete records preventing older versions of such records to reappear upon cold restarts (or addition/removal of nodes within a specific period of time). Refer to the following page for details: Documentation | Aerospike .
This document refers to cold restart scenarios where records have been expunged (and not durably deleted).
Record values de-referencing from the primary index
When the server does a cold restart, the storage layer is scanned in order to rebuild the primary index. There are scenarios where the records which are dereferenced from index could get indexed again. Refer to the following page for details on cold restarts: Documentation | Aerospike .
Here are different ways for a record’s value to be obsoleted on the persisted storage layer in Aerospike (dereferenced from the primary index):
i. Application deletes (including removal of the last bin of a record).
ii. Expirations (For records with ttl set).
iii. Evictions (Records getting deleted due to breaching either disk or memory high water mark). The records need to have a ttl set to get evicted, if there is no ttl for a record it will neither get expired nor get evicted.
iv. Updates to the existing records, since Aerospike does not do in-place updates (it always writes new records as a whole in the current streaming write buffer (swb) which always starts in a completely empty block).
Last Update Time
Starting version 3.8.3, Aerospike added the last update time of a record as part of its metadata, to be used for conflict resolution during cold restart. Before the introduction of last-update-time, the conflict resolution was done based on generation during cold restart.
With the introduction of last update time as part of a record’s metadata in version 3.8.3, generation is replaced for conflict resolution during cold restart.
Let’s go over the same examples.
1- Record updated
- Record created (gen-1)
- Record updated (gen-2)
- Record updated (gen-3)
Since the version with gen-3 will be the one with the latest last-update-time it will be the one prevailing. In the case of generation wrap-around, the correct version of the record will still prevail given the last update time which is absolute and guarantees the most recent version of the record to win any conflict resolution.
2- Record deleted and re-created
- Record created (gen-1)
- Record deleted
- Record re-created (gen-1)
In this case, the last update time-based conflict resolution guarantees that the most recent version will be re-indexed, despite potentially having 2 versions of the record with the same generation (if the initial one had not been overwritten by new write transactions).
3- Record updated several times and then deleted
- Record created (gen-1)
- Record updated (gen-2)
- Record updated (gen-3)
- Record deleted
Very similar to example 3. prior to version 3.8.3, based on the versions of the record still present on the persisted layer, the version with the most recent last update time will end up being re-indexed upon cold restart.
4- Record created without a ttl but then updated with a ttl
- Record created without a ttl (gen-1 / no ttl)
- Record updated with a ttl (gen-2 / ttl set)
- Record expires
Again, this is very similar to example 4. prior to version 3.8.3. The order in which the different versions are scanned determines the version that will be re-indexed, if any.
5- Record created with a ttl but then updated with a ttl that would make it expire sooner
- Record created with ttl1 (gen-1 / ttl1 - void time t1)
- Record updated with ttl2 (gen-2 / ttl2 - void time t2 < t1)
- Record expires
If the record with gen-1 is still on the disk and a cold restart happens after the gen-2 version of the record has expired, if the gen-2 record is not on disk anymore (overwritten by new records after defragmentation) or is scanned first (and will be skipped since it has expired), record with gen-1 will be resurrected.