Detail
What are some reasons why XDR could get stuck in recovery mode?
Answer
The following reasons are some of the more common reasons why XDR could be stuck in recovery mode.
1) Shipping a record that is too big to be shipped within the transaction-max-ms window. This could cause the transaction to time out and get consistently retried. Some symptoms that would indicate this is:
- retry_dest happening more frequently and constantly on the source cluster
- Destination cluster could have tsvc timeouts on the node(s) receiving the record(s)
- Increased write latency on the source side that doesn’t go above what transaction-max-ms is set to
- Increasing XDR lag
- in_queue build up and/or breaching of transaction-queue-limit
Solution
A way to resolve this is to either find and remove the record from the source cluster or you could try increasing transaction-max-ms. The max value it can be set to in this case is 10 seconds as the XDR response time is 10s.
For namespaces that are data-in-memory, record sizes can be as big as 128MiB. For Aerospike server versions 5.7+, you could also put a threshold on the record size by setting up max-record-size at both source and destination if your business model permits.
2) Authentication issues on the destination. This could cause the destination to reject the record that is getting shipped to it. Some symptoms that would indicate this is:
- retry_dest happening more frequently and constantly on the source cluster
- Little to no relevant XDR warnings on both the source and destination
- Increasing XDR lag
- Authentication failing in audit logs
Solution
If your source and destination cluster is setup 1:1, meaning that they both have the same number of nodes and share the same node-ids, you could try to quiesce the problematic node on the destination side. You could determine which node is problematic by checking the source cluster(s) for which node-id is stuck in recoveries.