How to ensure availability in Strong Consistency with equally sized racks?

Lucien

Updated May 04, 2026 21:39

Problem Description

A cluster configured for Strong Consistency mode and using rack awareness with two equally sized racks could become 50% available if one of those racks should fail or get isolated by a network split. Due to the Strong Consistency rules , the described scenario would have the effect of 50% of partitions being active in each racks.

Explanation

When doing a planned maintenance of a 2 equally sized racks cluster, the active-rack feature can be leveraged to ensure availability of that specific rack when the other passive rack is taken down or fails. The active-rack feature allows for the remaining cluster to be fully available when the passive rack is gracefully shutdown.

This feature would not address an unexpected network zone or AZ failure where a rack that is the designated 'active-rack' in this 2 rack cluster suddenly goes down and the cluster becomes unavailable.

We discuss recommended best practices, alternatives, maintenance and failure scenarios below.

Solution

Best Practice and Alternatives details:

A possible solution is a deployment that includes a three racks deployment with an equal number of nodes in each rack with replication-factor=3 (where losing a rack leaves the other two with fully available partitions)
A less efficient option is a two-rack (non equally sized) deployment where one rack has an extra node and can survive the smaller rack going down.
Starting with Aerospike Database 7.2, the active-rack configuration mentioned above allows two equally sized racks (*with no tie-breaker) to survive a rack-down event with all partitions available in the designated active-rack when the passive rack is down.
If on the other hand, the designated active-rack itself goes down, the surviving rack will be in unavailable state. In order to recover, the roster would need to be reset to use the remaining rack. This rostering down would also trigger migrations.

Maintenance Scenario:

Planned rack-by-rack maintenance ( no tie-breaker node needed):

Configure a rack as the designated active-rack
Quiesce the nodes of the passive rack.
Take down the nodes of the passive rack and do the needed maintenance.
Cluster is fully available. (assuming number of nodes remaining in the cluster is greater or equal to the Replication Factor)

AZ/Zone Failure Scenarios:

The designated ‘active-rack’ has a failure event that takes its nodes offline abruptly:

The remaining rack becomes unavailable.
Change the active-rack configuration dynamically to point to the remaining rack
Re-Roster the Strong Consistency configured cluster to use the remaining rack nodes in its roster.
Remediate issue on the affected rack and bring the downed nodes back in the cluster as soon as possible
Re-Roster back with the two racks.

The passive rack has a failure event that takes its node offline abruptly:

The only remaining rack is the designated ‘active-rack’
Cluster is fully available. (assuming number of nodes remaining in the cluster is greater or equal to the Replication Factor)
Remediate issue on the affected rack and bring the downed nodes back in the cluster as soon as possible

Notes

*Aerospike 5.2 introduced the stay-quiesced feature. This allows us to add a permanently quiesced node to the cluster. A quiesced node does not own data but may still participate in partition availability decisions. This is a less recommended approach due to the caveats described below.

If one of the standard racks goes down, there would be no data unavailability as each rack contains a full copy of the data. If this event were a network partition, making one rack unreachable to the other, using the tie-breaker node on a third rack would again help ensure availability. The cluster will automatically reconfigure to serve all writes and reads from the available rack thanks to the extra vote from the tie-breaker node, serving to make a majority cluster.

Before deploying with a tie-breaker be aware of the following considerations:

1. The tie-breaker node needs to live in its own physical rack

In terms of zone failure fault tolerance, there is no advantage to using a tie-breaker node in a rack that is physically deployed on the same availability zone (AZ) as one of the active racks in the cluster. If this AZ were to go down, the tie-breaker node would go down with it, and the remaining rack would have all its partitions become unavailable.

As a result of deploying on three separate zones, the number of bridges between the racks is tripled (from A⟺B to A⟺B, A⟺C, B⟺C), therefore more chances of 'cross zone’ failures.

2. Diverging configuration across the nodes

The nodes in the two active racks have roughly the same configuration, with the difference of rack-id and node-id, but the tie-breaker will have a distinctly different configuration.

3. The tie-breaker node will need to handle a network connection load

The tie-breaker node will need to handle heartbeat connections from all the cluster nodes, and a cluster tend connection from every client. It is simplest to use a node of the same class as the nodes in the active racks, though the tie-breaker will not require the same data storage capacity (it does however need to hold the same shared metadata (SMD) as other cluster nodes). The tie-breaker node may also temporarily issue proxy calls to other nodes on behalf of clients.

See the KB article Max FD reached on Tie-breaker.

4. If the tie-breaker becomes the cluster principal node, it will be responsible for propagating SMD files

The principal cluster node (the node with the highest alphabetically ordered node-id) is responsible for propagating SMD files to new cluster nodes, and SMD changes to existing cluster nodes. You should select a node-id for the tie-breaker that will prevent it from becoming the principal.

5. Delaying fill-migrations may not work for some partitions when a tie-breaker is used

If one or more nodes are lost from a cluster where the whole roster was present, migrate-fill-delay should work as expected. But, in the case of some nodes missing from the roster, meaning we have had some extra shuffle beyond the shuffle of partitions already caused by the tie-breaker node, quiescing or removing other nodes from the cluster would lead to some partitions immediately ‘fill migrating’ despite migrate-fill-delay being configured.

6. An edge case where the tie-breaker may need to take writes

Tie-breaker nodes are not intended to take reads or writes from the clients. There is an important edge case to consider, where if the cluster is very small, such that one rack plus the tie-breaker node does not add up to more nodes than the replication-factor (i.e., 1-node racks with RF=2, 2-node racks with RF=3), then the tie-breaker node will take writes when one rack goes out, whether or not it is quiesced. This is because a quiesced node is moved to the end of the succession list, not removed from it completly, so it is still eligible to take writes if the replication-factor is high enough compared to the number of nodes in the cluster. Furthermore, as a quiesced node will not drop its data, if you do end up in this situation, you will have to manually remove the data after the missing rack has been restored, migrations have completed, and the cluster is stable again.

If all strong consistency namespaces are configured data storage in-memory on the tie-breaker node, then performing a cold restart of the tie-breaker node will be enough to clear out the data.
If any of the strong consistency namespaces are using persistent storage, you must add the configuration parameter cold-start-empty to the namespace stanzas of the node's configuration parameter before performing a cold start.

More details

More details on our blog on tie-breaker nodes for clusters using strong consistency explains how a permanently quiesced tie-breaker node works in a cluster with data across two racks.

Applies To Earliest Version

5.2

Applies To Latest Version

Current Version