Articles in this section

Observing XDR lag on a single node with dest_retry count increasing due to one rogue large record

Problem Description

With XDR configured between two clusters, the lag is observed on one of the source cluster nodes and retry_dest count was increasing throughout. But there are no errors reported in the logs even with the detail logging enabled.


Explanation

Observed XDR lag on a single node with increasing retry_dest metric. This statistic represents the number of records retried due to a temporary error returned by destination node. The destination node has responded with a specific error code; therefore, such errors are not related to the network. Such errors include key busy and device overload. In this case, there was no error returned by the destination even with detail logging enabled for XDR.

The in_queue on that node was going up and crossing the transaction-queue-limit of 16K and dropping to zero. This can be a single record or few records on that node causing the issue as we continuously see retries due to error on destination.

Following command is used to enable TCP dump to capture the digests which appear frequently on the network:
 

sudo tcpdump -n -K -vvv -XX -s0 port 3000 and greater 100000 -i any|egrep -v 'IP|length'|cut -d ' ' -f 3-10|xargs|sed -e 's/ //g'|grep -oE '\00001504[[:alnum:]]{40}'

 One record was appearing frequently which was larger in size. 


Solution

To resolve the lag issue, Identify the digest of the issue record using TCP dump captured and delete that particular record.
XDR lag will come down and the cluster will work normally as expected.

Applies To Earliest Version

Pre 4.9

Applies To Latest Version

Current Version
Was this article helpful?
0 out of 0 found this helpful