Context
Older version of Aerospike had an issue were rackware rules for master and replica could get broken. In this scenario master and prole replica for a specific partition could end up on the same rack. (FIX in AER-6726)Method
We can get a dump of partition-info from each node to verify partition-id ownership.
Steps:
1) Get partition-info from each node and save to individual file. File should have the rack-id number of the node.
2) Run following script to count duplicate partition IDs, per namespace, per rack.
$ egrep 'S:2:1|S:2:0' T* |awk -F ':' '{print $3,$1,$2,$4":"$5":"$6}'|grep rack_id_2|awk '{print $1" "$4}'|sort|uniq -c|sort -k 2 -n > rack_id_2_partitions.txt
$ egrep 'S:2:1|S:2:0' T* |awk -F ':' '{print $3,$1,$2,$4":"$5":"$6}'|grep rack_id_1|awk '{print $1" "$4}'|sort|uniq -c|sort -k 2 -n > rack_id_1_partitions.txt
or using namespace count number:
egrep 'S:2:1|S:2:0' TIGER* |awk -F ':' '{print $3,$1,$2,$4":"$5":"$6}'|grep rack_id_1|awk '{print $1" "$4}'|sort|uniq -c|sort -k 2 -n |awk '{if ($1 > NAMESPACECOUNT) print $0}'
egrep 'S:2:1|S:2:0' TIGER* |awk -F ':' '{print $3,$1,$2,$4":"$5":"$6}'|grep rack_id_2|awk '{print $1" "$4}'|sort|uniq -c|sort -k 2 -n |awk '{if ($1 > NAMESPACECOUNT) print $0}'
The first column count should be equal to the namespace count. (ie: 4 for four namespaces)
Notes
Internal KB for AER-6726Reproduction environment:
All is needed is a 5 node cluster with 3 racks:
Rack 1 node-ids: A1,A2,A3
Rack 2 node-id : B1
Rack 3 stay-quiesced true : C1
- No data is needed in the cluster.
- quiesced A2 and waited for migrations while C1 is permanently quiesced.
- At the end 269 partitions had both master and replica on Rack 1.