Articles in this section

Why do Queries fail when a node with data re-joins a cluster?

Detail

A node rejoining a cluster with data in AP mode can cause queries to fail prior to migrations completion (or while migrate-fill-delay is delaying fill migrations) with the following error seen on the client logs:
Error 11  ... Partition xxxx Unavailable 

sample exception:
[2024-08-07 15:46:34] Aerospike error: Error 11,6,0,30000,0,5,BB91500800A0142 10.128.0.21 3000: Partition 4034 unavailable
sub-exceptions:
Error 11,1,BB96E00800A0142 10.128.0.110 3000: Partition 3692 unavailable
Server error name:  AS_ERR_UNAVAILABLE 


 

Answer

When a node is restarted and joins the cluster back with data, the current master claims ownership of both master and replica partitions.(Otherwise there would not be any node claiming ownership of the replica until we get a full replica through migrations.)

Clients that were still running would continue to have the restarted node in their partition map (as inactive when it went out, but then back active when it comes back) but newly started clients would not. When the node comes back, the client's partition map would still be pointing to the node that took over master partition ownership and was previously replica owner. (now both owner of master and replica partitions)


Consider the scenario where node A is the master and node B is the replica for a partition P. If node A leaves and then rejoins, it must migrate metadata to node B, followed by metadata migration from node B back to A.
(When A joins back, it will cause 2 way migrations.. From B to A first, and then from A to B, as A joins back with data and has a different family.)


Node A just returned


This bidirectional metadata migrations leaves a window where Node B is in Pending Migrations state.
(
When migrations complete from B to A, B joins A 's family and becomes a subset in that family, A claims ownership of master and replica but the client will not realize this until it tends to A and queries would still hit B until then, causing them to fail on older releases (see AER-6709 and AER-6708 for how this was addressed))

A returns


During this migration period, when a client accesses node B for a query request, it will receive an error code  -11  (partition unavailable) because the partition is in a pending migration state on the node waiting for incoming migrations. (Issue would still occur with replication factor > 2) 

This issue does not occur in SC mode (strong consistency) that would have only one way migration when a node returns. Additionally, the issue will not happen on single transactions or batch transactions as the cluster will perform duplicate resolution or proxy to another node.

Workarounds:
  • Bring back the node empty or using cold-start-empty .
  • Redirect Traffic to a backup cluster during maintenance.
  • Use relax short and long query policies in server 7.1 and above. This policy allows queries in AP mode to be relaxed even when the partition is not full.
  • Short query relax policies were implemented starting with version 6.4.0.12 (and above)

Notes

In server 7.1 (and hotfixed in 6.4) the client is able to use the relax policy and reserve partitions that aren’t full for queries. If the customer is doing a slow rolling upgrade (waiting for migrations to complete), this will prevent short queries from timing out completely, and If there is no split brain occurs short queries should not miss records.
 

  • Long queries (both PI and SI) cannot reserve a partition that has incoming migrations, unless they are specified to be ‘relaxed’ by the client in the QueryDuration policy and have Aerospike Server Version 7.1.0.0+.
  • This is not the case for short queries on the versions below or later. This setting must be specified by the client in its policy.
    • Versions:
      • 6.1.0.30
      • 6.2.0.25
      • 6.3.0.18
      • 6.4.0.12
      • 7.0.0.5



The above issue is on a per partition basis, and not all partitions would be impacted.

Reference:

  • https://aerospike.com/blog/database-7-1-enhancements/#developer_api_changes

Applies To Earliest Version

5.0

Applies To Latest Version

Current Version
Was this article helpful?
0 out of 0 found this helpful