Articles in this section

What are the different delays affecting an XDR transaction?

Detail

XDR has the following parameters influencing how fast records get asynchronously shipped (some are not configurable in version 5.0 but would be in the upcoming version 5.1 (at the time of writing this article) .


Answer

1) period-ms – DC thread’s period

The period-ms delay is the run period of the XDR DC Manager thread. Its default is set to 100ms. The XDR DC thread is responsible for sequentially processing all the partitions at the source node for a specific destination cluster (referred to as DC or data center). Processing a partition typically consists of processing all pending entries in the XDR transaction queues (one per partition) but could also process entries from retry queues (if any pending in those) or instructing the DC recovery thread to proceed with the recovery process for relevant partitions (if recoveries are ongoing – when XDR is not able to keep up for a potential variety of reasons).

This would be the typical source of any delay in shipping a write transaction. As the DC thread goes sequentially through each partition, based on when in a cycle a write transaction occurs, the DC thread could process the transaction quasi instantly (based on the other parameters discussed in this article) but could also process it only on the next cycle which should be up to period-ms.

Ignoring other parameters at this point, the time for a write transaction to be processed by the DC thread would be: between 0 and period-ms milliseconds (100ms by default)

Notes:

  • This is only valid if the DC thread is able to process all partitions faster than the configured period-ms. If the DC thread lap_us (which measures, in microseconds, how long a full lap takes) is higher than the period-ms, it would cause extra delays. Being in recoveries_pending  mode or being throttled due to entries pending in the retry queue would also cause extra delays.

  • Having period-ms configured too low may cause too many passes without much actual processing being done and could adversely impact the overall performance of a node due to spending extra unnecessary CPU cycles. A minimum of 25ms would typically be recommended for use cases requiring less than 100ms.

  • When measuring the overall time it takes for a record to reach a destination after it has been written at the source (based on the last update time (LUT), there may be additional latency also noticed in cases of low throughput. XDR may not have yet established connections for all the service threads which would add to the delay in shipping. This latency may be more prevalent in a test environment with low throughput than production with ongoing high throughput traffic. A workaround, for testing, would be have a warm up workload to eliminate this connection initialization latency.

  • In version 5.0, the period-ms configuration parameter can only be set dynamically at runtime. It is statically configurable in version 5.1 and above . Example asinfo command to set period-ms to 25ms:

asinfo -v 'set-config:context=xdr;dc=<DC_NAME>;period-ms=25'
  • period-ms could be tuned to speed up XDR recoveries. When XDR is in recovery mode, the recovery thread goes and gets the digests and places them on a recovery queue per partition.  The DC manager controlled by the period-ms then picks up those digests and we generate ship requests for the service threads to ship them.  


2)
delay-ms – XDR ‘artificial’ delay

The delay-ms, also called ‘artificial delay’ can be set between 0 and 5000ms. It defaults to 0 and must also be lower than the hot-key-ms. The delay-ms, when configured, applies to all entries to be processed by a DC thread and forces those to not be processed until they have been waiting for the configured delay-ms. This could be useful to prevent the same record to be shipped too frequently by forcing entries to stay in the XDR transaction queue longer. This gives a chance to potential subsequent write transactions to not have to be separately processed if a predecessor is still in the XDR transaction queue.

Time for a write transaction to be processed by the DC thread: delay-ms (0ms by default) + anywhere between 0 and period-ms milliseconds (100ms by default)

For example, with a delay-ms of 50 and the default period-ms of 100, records would take anywhere between 50 and 150ms to be processed by the DC thread. For the time it would take a record to reach its destination, this time would be added to the time it takes to read the record locally (typically negligible if reading from the post-write-queue as well as the link latency to the remote cluster.



3)
sc-replication-wait-ms – XDR SC delay

The sc-replication-wait-ms only applies to strong-consistency enabled (SC) namespaces. This is hard coded to 100ms. This SC specific delay is to prevent records from being shipped before fully replicated. In order to prevent unnecessary attempts to process a record that has not replicated yet, this delay is always added. Its mechanism is identical to the delay-ms configuration parameter. The main difference being that delay-ms is set by default to 0 whereas for an SC namespace it would be necessary and always beneficial to have the sc-replication-wait-ms configured to a few milliseconds at least in order to maximize the chances of replication to have completed before processing a record.

Time for a write transaction in an SC namespace to be processed by the DC thread: max(delay-mssc-replication-wait-ms) + anywhere between 0 and period-ms milliseconds (100ms by default)

Therefore, by default, in an SC namespace, records would take anywhere between 100ms and 200ms to be processed (under normal circumstances, i.e, no recoveries or flooding retries, etc.).

Notes:

  • Recap on SC transactions: write transactions in SC must be acknowledged by all replicas before a node acks back to the client. In an SC namespace, records have 2 bits of the primary index reserved for the replication flag with the following possible states: OKReplicatingRe-replicating and Unreplicated. In the case of a normal SC write (RF=2) and no timeouts or re-replications:

    • starting state on master is OK.
    • state on master goes to Replicating when in process of writing to the replica(s).
    • replica(s) send an ack to master when the record is successfully written which would switch the state on the master copy to OK.
    • master sends an OK state to the replica(s).
    • in case of failure to replicate, the state would switch to Unreplicated.
  • If a record is still in Replicating state, it will be moved the retry queue and XDR would keep trying it indefinitely from there (every DC thread’s lap).

  • If a record is in Unreplicated state, XDR will trigger re-replication. The re-replication will be like a new write (with a new LUT) which would trigger this whole process again, as a new write.

  • It is important to avoid having XDR attempt to process records which are in the Replicating state unless there was an unexpected slow down or disruption causing records to stay in the Replicating state for longer than expected. Encountering a record in a ‘natural’ Replicating state, meaning it is still replicating under normal circumstances, there is a very good chance that the next record in the queue will also be in the Replicating state (assuming there is a continuous flow of writes). This would end up wasting CPU on retrying a lot of records. It is therefore necessary to judiciously configure the sc-replication-wait-ms if the default is not adequate.

  • This setting would not play much of a role during XDR recovery. If an unreplicated record is encountered during the Primary Index  scan from the XDR recovery thread,  it will be queued up for re-replication and go into the XDR retry queue and transaction queue after it replicates successfully. The XDR Transaction Queue would not get processed until the XDR Recoveries have been completed for the partition. The XDR Retry Queue would get processed during recoveries.

4) hot-key-ms – XDR Hotkey separation time

The hot-key-ms parameter is actually very different from the previous three as it does not involves a delay for shipping but controls the frequency at which hot-keys are processed by XDR. It is set to 100ms by default and dictates how deep to check for an existing entry in the XDR transaction queue for an incoming write transaction. This configuration parameter would help situations where the XDR transaction queue is potentially large and avoid having to check across the whole queue for a potential previous entry corresponding to the current incoming transaction. Such check could be slow and the hot-key-ms provides a limit for how far in the past to check in the current XDR transaction queue. This parameter can also be summarized as ‘the maximum frequency at which a hot-key would be shipped’.


Notes

Always test prior to tuning and changing default values that may impact a production use case.


Applies To Earliest Version

5.0

Applies To Latest Version

Current Version
Was this article helpful?
0 out of 0 found this helpful