Detail
When a delete transaction is durable, a tombstone is written. The tombstone-write is similar to a record-update in that -
- It continues to occupy entry in the index, together with other record entries in index.
- It is persisted on disk, together with previous copies of the record on disk.
- It has the same meta-data as any other record.
- last-update-time – just like normal update.
- expiry-time – this is set to 0 (never expire).
- generation – increments just like normal update.
- It is replicated at the same replication factor specified on the namespace.
- It is migrated the same way current records are migrated.
- It is conflict resolved the same way as data-records.
Answer
Tombstone on Cold Start
Tombstones are simply records without any bins.
- It contains all meta data, including key.
On a cold start, disk is scanned to rebuild the in-memory index tree for the records. Version of the records are compared. The version with the most recent last-update-time (with tiebreak using record generation) is brought back.
For a record that is durably deleted, the tombstone is just a version that participates in the comparison, and can prevent any older versions of the record from returning. If a tombstone is the most recent version, it will be reloaded into index.
Tombstone Management
Similar to data records, tombstones are also reclaimed when it is removed. When removed, it is removed from the in-memory index, and the on-disk copy is eligible for de-fragmentation. Index memory is immediately re-usable. The storage is re-usable based on when the space is defragmented.
A special background mechanism ("Tomb-Raider") is used to remove no-longer needed tombstones -
- There are no previous copies of the record on disk.
- This condition assures that a cold start will not bring back any older copy.
- The tombstone's last-update-time is before a configured time T.
- This condition prevents a node that's been apart from the cluster for time T, to rejoin and re-introduce an older copy.
- The node is not waiting for any incoming migration.
If both conditions are satisfied, the tombstone will be reclaimed.
The actual background thread is split into roughly the following steps
- Iterating through index to mark all tombstones as potential cenotaphs.
- Scan each disk block for records, un-mark cenotaph for each record.
- Iterate through index again. All cenotaphs remaining are candidate for permanent removal.
For non-persisted namespace, tombstone removal is separate and only requires one index iteration for tombstone removal.
Cold start also removes unneeded tombstones as part of the disk reading -
- All tombstones are marked as cenotaphs on initial bring-up.
- If a subsequent live record which the tombstone covers is read, cenotaph will be unmarked, and tombstone stays.
- Otherwise, at end of cold start, all cenotaphs will be deleted.
The conflict resolution policy affects durable delete behaviour across cluster state changes. To guarantee correct propagation of durable deletes, the resolution should be "Resolve by last-update-time".
This is because of a scenario like this.
- A record is durably deleted.
- A node goes away. That will have the tombstone.
- On the existing cluster, the eligible age is passed and the tombstone is removed.
- On the existing cluster, the record get's written now but the generation will start with 1. Now if the node comes back with the tombstone, then tombstone will win if the conflict resolution is generation which is not an expected outcome.
Effects of Clock
Both expiry-time comparison and last-update-time comparison are susceptible to mis-behavior when local clock changes or inter-node clocks are skewed.
A extreme example -
- Local clock moving back in time can result in a cold-started tombstone (written in original time) incorrectly covering a more recent update (written in moved-back time).
Replica records are persisted with a last-update-time and expiry-time that is originated at the master. However, in any comparison, the timestamps are compared against the replica's local clock. The comparison would only be as good as the limitation of the clock skew.
- An example of where the clock skew can affect delete durability.
- Node A's clock is ahead of node B's clock by dT.
- At time T1, recordX is written to nodeA (master) then nodeB (replica)
- At time T2, nodeA goes down. nodeB becomes master for recordX.
- at time T3, recordX' is written to nodeB.
- If dT is > T3-T1, then T1 can be overtake T3.
- This is mitigated by ensuring that on an update, update time will always be greater or equal to the previous version of the record update.
Another example is when the node is away for more than tomb-raider-eligible age.
- A node goes away.
- On the existing cluster, a record durably deleted, tombstone is created, tomb-raider-eligible-age has passed and the tombstone is removed.
- The node comes back now, and since the tombstone is not there any more, the record is back in the system.
This is the reason why we say if a node is away for more than tomb-raider-eligible-age, then we should wipe the disk before adding the node back in to the cluster.
Other dangers :-
Under certain cluster state conditions, durably deleted records can be unintentionally resurrected.
This can occur when:
-
A node cold restarts while temporarily owning partitions it did not own during normal steady-state.
-
Those partitions had previously been owned by the same node in an earlier transient state, during which the node received writes for records that were later durably deleted after partition ownership reverted to its usual state.
In this scenario, records that were durably deleted will still exist on disk on the node from its earlier ownership period until the relevant blocks are defragmented and overwritten with newer data. If that node cold restarts while again temporarily owning those same partitions, such records will be scanned in and resurrected.
IMPORTANT NOTE – There is one more mechanism at play: the 6bit tree_id. So we would therefore need to have a node give up and later take ownership of a partition 64 times before we reuse a tree_id and resurrect records from a partition we previously owned.