Articles in this section

Why do Queries fail on a cluster change?

Detail

During a cluster change  the client running queries can exhaust all of its configured retries and fail.
This can occur due to a tending time gap of the clients or unavailability of partitions that are classified as subsets on a node returning with data under migrations. 

Client logs shows following sub-exceptions: 

Error 11  ... Partition xxxx Unavailable 

and exceptions errors:

MAX_RETRIES_EXCEEDED


 


Answer

Details:

As of Aerospike version 6.0, clients can issue queries against specific partitions, allowing any query to be retried on a per partition basis, if needed

This allows partitions that are changing ownership ('in flight’) during cluster changes to be retried to make sure they are not missed.

Tending time gap:

In most cases, a partition that is ‘missed’ on a node would be successfully retried against the next replica (when running with replication factor 2 or more), however, there are some edge cases where a node can temporarily claim ownership of multiple replicas. In some of those situations, as the client tends each node sequentially for the partitions they own, there can be a time gap where a client will exhaust all of its retries prior to tending one of the rightful owner for a given partition that was in flight.

Sample client logs with a Query policy maxRetries=5 hitting node A1:

com.aerospike.client.AerospikeException: 
Error 11,6,0,30000,0,5,A1 192.168.1.2 4333: Partition 1234 unavailable 
sub-exceptions: Error 11,1,A1 192.168.1.2 4333: Partition 1234 unavailable 
Error 11,2,A1 192.168.1.2 4333: Partition 1234 unavailable 
Error 11,3,A1 192.168.1.2 4333: Partition 1234 unavailable 
Error 11,4,A1 192.168.1.2 4333: Partition 1234 unavailable 
Error 11,5,A1 192.168.1.2 4333: Partition 1234 unavailable 
Error 11,6,A1 192.168.1.2 4333: Partition 1234 unavailable

 

To cover for those situations, the client query policy (maxRetries / sleepBetweenRetries) should be adjusted to cover for the tend interval gap.

 

 

Strict Query reading policy: 

 

  • A query hitting a node that was added back to a cluster with data (no longer having full partitions) will return with its partitions as subsets. If a query hits this node and does not have the relax policies enabled, the queries will fail.
  • Long queries (both PI and SI) cannot reserve a partition that has incoming migrations, unless they are specified to be ‘relaxed’ by the client in the QueryDuration policy and have Aerospike Server Version 7.1.0.0+.
  • Support for Relax Policies on Short Queries started with versions below. This setting must be specified by the client in its policy.
    • Versions:
      • 6.1.0.30
      • 6.2.0.25
      • 6.3.0.18
      • 6.4.0.12
      • 7.0.0.5

Notes

Reference:


    Applies To Earliest Version

    6.0

    Applies To Latest Version

    Current Version
    Was this article helpful?
    0 out of 0 found this helpful