Articles in this section

What is the cause of AEROSPIKE_ERR_RECORD_NOT_FOUND when migrating

Problem Description

When a cluster is migrating, simple get commands, either via AQL or any other client fail with AEROSPIKE_ERR_RECORD_NOT_FOUND. What is the reason for this?

Explanation

This is expected behavior if the following scenario is occurring. This is not a generalized issue during migration but an issue that can occur when a rolling restart is being conducted without waiting for migrations to complete between node restarts.

By default, in AP mode (non strong-consistency enabled namespace) reads will not duplicate resolve during migration and so unless the client or server is configured to do duplicate resolution on reads the get command might return AEROSPIKE_ERR_RECORD_NOT_FOUND for records that are actually in the cluster.

To explain this issue further, consider a cluster where a partition exists on node A as master and node B as replica.

  1. Node A is shutdown, node B becomes master and has a full copy of the partition, another node, node C is now the replica.
  2. A record, X is written to the partition, the record exists on node B and node C.
  3. Before migrations are completed, node A returns to the cluster and node B is shut down. This is a typical scenario during a rolling restart. At that point neither node A nor node C has a full copy of the partition and so they have subset partitions. Node A has a partial copy of the partition and node C also has a partial copy but only node C has a copy of record X.
  4. When both copies of the partition are subsets the node which is first in the succession list (left most node) becomes the master, in this case, node A. In default configuration the get transaction will go to the master only. In this instance the master does not have the record and will return AEROSPIKE_ERR_RECORD_NOT_FOUND.

It is then evident that this is not a generalized issue during migration but an issue that can occur when a rolling restart is being conducted without waiting for migrations to complete between node restarts.

It should be noted that waiting for migrations to complete prior to shutting down node B would have caused node C to have a full copy of the partition. Node C would have then become the master node when node B was shutdown. This would avoid any duplicate resolution and prevent any AEROSPIKE_ERR_RECORD_NOT_FOUND for records in the cluster.


Solution

  • Duplicate resolution can be configured at either a server or client policy level.
  • The server level configuration parameter is read-consistency-level-override. It overrides the client policy configuration.
  • The client policy control is called data-consistency-level.
  • In both client and server side controls, the setting to resolve duplicates would be ALL.

Applies To Earliest Version

Pre 4.9

Applies To Latest Version

Current Version
Was this article helpful?
0 out of 0 found this helpful