Detail
Some occurrences, or a continuous stream of the following message may be observed in the logs:
found redundant connections to same node, fds 210 209 - choosing at random
Answer
By itself, this message indicates that the Aerospike heartbeat protocol (HB) has found two connections open for the same destination node. Normally there should be only one connection open to every node. In order to deal with this, Aerospike prints the message above and chooses an fd or connection handle randomly from a list of available descriptors.
In order to see why this might happen, a wider range of messages must be investigated, typically associated with the HB protocol.
The below example shows messages that may be associated with having redundant connections:
Jun 04 2019 21:15:42 GMT: WARNING (socket): (socket.c:959) Error while connecting socket to 172.17.0.7:3002
Jun 04 2019 21:15:42 GMT: WARNING (hb): (hb.c:4882) could not create heartbeat connection to node {172.17.0.7:3002}
Jun 04 2019 21:15:42 GMT: WARNING (socket): (socket.c:891) Timeout while connecting
These messages indicate a communication issue between the nodes. This issue is strong enough to affect heartbeats between them; connectivity must therefore be checked and investigated. Note that this does not mean the nodes are unable to communicate at all, but may rather indicate that the nodes intermittently lose their connections and/or TCP packets.
The redundant connection message in and of itself indicates that some connectivity issues have recovered while Aerospike was creating new connections which resulted in Aerospike holding more than one working connection to a destination node. This is most often caused by intermittent connectivity issues, such as delayed packets.
Some things to be initially investigated include:
- Packet loss, dropped packets and overrun packet statistics on linux, most commonly using
ifconfigorip link ls. - Packet loss and sustained connectivity tests using most commonly
iperf. - Any evidence of packet loss or other issues on routers and switches between the nodes.
- Kernel-level messages which may relate to the issue, in
dmesg. - Available bandwidth between the nodes.
Note that a connectivity issue may not necessarily indicate network problems. Whilst this is the most common root cause, the issue could be within just one node.
In a scenario where only a single node has connectivity problems, that node could have a range of problems resulting in missed heartbeats. These include, but are not limited to:
- hardware or driver faults - most commonly found by checking
dmesg - overloaded hard disks resulting in CPU interrupts hanging on requests and affecting the network, most commonly checked using
iostat - misconfigured kernel parameters or Aerospike node