Problem Description
In some older versions of Aerospike, it is possible that some records on an XDR destination cluster may have their value reverted to an older version after nodes at the source cluster are restarted.
Explanation
When a node with XDR enabled is restarted, it will always resume and re-process the last 5 minutes in its digest log. Log messages similar to the following will be observed:
Aug 14 2019 13:17:28 GMT: INFO (xdr): (xdr.c:837) Starting XDR with resume ... to ship 12 outstanding log records
...
...
...
Aug 14 2019 13:17:29 GMT: INFO (as): (as.c:445) service ready: soon there will be cake!
Aug 14 2019 13:17:30 GMT: INFO (xdr): (xdr_serverside.c:153) XDR last ship time of this node for DC 0 went back to 1565788311404 from 1565788649324
Aug 14 2019 13:17:30 GMT: INFO (xdr): (xdr_handlers.c:190) replication service ready: and now you have icing!
It is therefore possible for records that were updated while the restarted node was not in the cluster to have a previous version shipped if the restarted node re-processes digests (in the digestlog) of such records prior to migrations completing.
The number of affected records could be much higher if there is lag when the node is restarted.
Solution
There are 2 potential approaches to workaround this behavior:
- Stop the Aerospike process on the node, wait for the
failed node processingto finish on the other nodes in the cluster, delete the digest log, and, finally restart the Aerospike process. - Set xdr-shipping-enabled to
falsein the config file on the node which is being restarted, and then dynamically set it to true once migrations have completed.