Problem Description
After updating a bin on the source cluster, we sometimes see the bin value revert back to the original value.
Setup: bi-directional XDR using bin convergence, both DCs are using strong consistency
Explanation
When running on Strong Consistency Mode, records can end up in an unreplicated state. When we read or update an unreplicated record, it would trigger re-replication on the record. If we are using XDR, this would ship that record.
Using the example setup from above, the following scenario could occur:
-
Update bin1, value1 on DC2
-
The record containing bin1, value1 on DC2 ends up in an unreplicated state
-
Update bin1, value2 on DC1
-
DC1 ships bin1, value2 to DC2
-
Upon receiving bin1, value2, DC2 triggers re-replication on the record since it was in an unreplicated state
-
The record ships bin1, value1 back to DC1 since the re-replication triggered the record to get shipped.
-
DC2 ends up with the old value for bin1
The LUT of a record does get updated if the record was unreplicated and gets re-replicated, so the original write with bin1=value1 would not have an LUT that was 5 days before the write of value2. value1 could get propagated in preference to value2.
Solution
To prevent this corner case, we recommend clearing unreplicated records by triggering re-replication. For methods on how to trigger re-replication, please see our KB on How to trigger re-replication for unreplicated records within a strongly consistent Aerospike namespace.
Notes
Please note that when re-replication is triggered, those records would get shipped by XDR, so it can be possible to end up shipping stale data.