Detail
When an XDR cluster ships records in a normal state, it uses in-memory queues to track which records should be shipped. Records are added to those queues as they are updated by the client application and are therefore naturally ordered by Last Update Time (LUT). In certain circumstances, these queues are dropped and all records after a given LUT are shipped by iterating through a partition. This is called recovery. Can migration within an Aerospike cluster trigger XDR recovery?
Answer
For a partition to get into recovery as a result of migrations depends on whether the in-memory queue can be “trusted” (to make sure no updates are missed). For a given partition, the node acting as master during the migration process would be shipping records for that partition until the migration completes (again, for that specific partition). Upon completion of the migration for that partition, the node taking master ownership would look at the Last Ship Time (LST) for the partition. If that LST is newer than the LUT of the first element in the queue, then the queue is trusted. At this point, normal shipping continues as the new master node can be sure that nothing is missing from the queue it has received.
If the queue is not trusted, meaning that the LST is not newer than the LUT of the partition, then a recovery is scheduled for that partition. The recovery is not scheduled until the migrations have completed.
If there is no write load and the XDR transaction queue is empty, a pessimistic approach is taken and a recovery is scheduled for when migrations are complete.
As recoveries will not be scheduled until migrations have completed, the recoveries_pending will not show these recoveries until migrations complete (for the concerned partitions).
Therefore, it is completely normal to see a number of recoveries start as soon as migrations complete.
Notes on Strongly Consistent Namespaces
If a cluster has a strongly consistent namespace and a cluster event leads to partitions being unavailable, nodes holding those unavailable partitions will immediately drop their XDR transaction queues. When those unavailable partitions become available again they will schedule recoveries.
While partitions are unavailable they will not count against any calculated lag within the cluster. Lag is only calculated based on the last ship time of available partitions. Once the recoveries for a newly available partition start, the lag will include this partition’s state (in terms of last ship time).
When a recovery for a partition starts, the last ship time used is dependent on what has happened within the cluster. If sufficient nodes have left so that there is no node holding any data for the partition, then the global last ship time from the SMD is used as a basis for recovery. If there was at least one node with some data for the partition remaining in the cluster, even if that was not sufficient for the partition to be available, it will retain the partition level last ship time and once recovery starts it is that last ship time that will be used.