Context
XDR lag is a popular metric to track XDR shipping performance. This article reviews how to troubleshoot XDR lag in Aerospike 4.9 and earlier.
Method
Let’s first revisit the definition of the XDR time lag:
The xdr_timelag is the delta between the timestamp of the last record which was attempted to ship and the current time.
Another way of wording this is:
The xdr_timelag 17 is the time a record is waiting at the source before being attempted to be shipped.
There are a variety of reasons that would lead to the lag increasing. Picking an example, we can review the things to check.
Aug 13 2016 06:25:41 GMT: INFO (xdr): (xdr.c:2027) sh 31465282087 : ul 208 : lg 20736463319 : rlg 18682259 : lproc 20715167324 : rproc 1173447 : lkdproc 10519322 : errcl 3174416 : errsrv 15529572 : hkskip 1906225345 1107652918 : flat 0 ... Aug 13 2016 06:36:33 GMT: INFO (xdr): (xdr.c:2027) sh 31465291084 : ul 98 : lg 20736883719 : rlg 18682834 : lproc 20715173224 : rproc 1173447 : lkdproc 10519322 : errcl 3174416 : errsrv 15530147 : hkskip 1906225404 1107652977 : flat 0 ... Aug 13 2016 06:38:13 GMT: INFO (xdr): (xdr.c:2027) sh 31465292421 : ul 23 : lg 20736913519 : rlg 18682925 : lproc 20715173924 : rproc 1173447 : lkdproc 10519322 : errcl 3174416 : errsrv 15530238 : hkskip 1906225410 1107652982 : flat 0
In those log lines, it can be seen that rlg and errsrv are increasing. These metrics record the number of records being relogged into the digest log and the number of errors generated when the remote cluster rejects the write from XDR. An error logged under errsrv indicates the error has been reported after the C client used by XDR has executed the put or delete operation on the remote cluster. It is also apparent that the delta between rlg values (575 and 91) is the same as the delta between errsrv. The following conclusions can be drawn:
- XDR write operations are being rejected by the remote server
- There is a 1:1 correlation between these rejected operations and relogged records
The question then becomes, why are XDR writes being rejected? An examination of the logs on the remote cluster is necessary. The first stage of any log check should be a simple check for warnings.
$grep -i warn aerospike.log
aerospike.log:Aug 13 2016 06:20:12 GMT: WARNING (namespace): (namespace.c:440) {xdr_target} hwm_breached false, stop_writes true (memory), memory sz:108724019840 (108724019840 + 0) hwm:130459631616 sw:108716359680, disk sz:2073798377600 hwm:3006477107200
The answer to the question of why XDR lag is increasing is immediately evident. There are nodes in the remote cluster that have hit stop_writes and as a result, when XDR tries to write to that cluster, the write is rejected. XDR then increments errsrv (xdr_ship_destination_error stat) and relogs the record in the digest log, incrementing rlg (dlog_relogged stat). The XDR lag increases as the timestamp of the last successfully shipped record remains constant while the current time moves forward, and lag represents the difference between these two times.
In this case, the solution is to correct the issue with the remote cluster, by looking into why stop_writes is true. Although, in certain cases, monitoring lag can be a useful metric, in this instance the issue would have been more apparent if there was tracking of xdr_ship_destination_error and dlog_relogged metrics, these would have pointed to the real root issue much more efficiently possibly. XDR lag is - at best - a top-level metric and in some cases is actively disingenuous.