Tailslayer: Library for reducing tail latency in RAM reads
TL;DR Highlight
This C++ library implements the hedged read technique, which reduces the worst-case latency (tail latency) of RAM reads caused by DRAM refresh timing conflicts by replicating data to independent DRAM channels and writing the result from the first responding channel.
Who Should Read
Developers building high-performance systems (HFT, real-time processing, interrupt handlers, etc.) in C++ where nanosecond to microsecond latency is critical. Low-level systems engineers who tune DRAM-level memory access patterns or are interested in memory architecture.
Core Mechanics
- DRAM periodically performs a 'refresh' operation to maintain data, and if a read request overlaps with this timing, an additional delay (stall) of hundreds of nanoseconds occurs. This is one of the main causes of tail latency.
- Tailslayer replicates the same data across multiple independent DRAM channels and, when a read request arrives, sends reads to all channels simultaneously (hedged read). It reduces the probability of refresh stalls by using the result from the first responding channel.
- The key technique leverages the fact that the refresh schedules between DRAM channels are independent (uncorrelated). Even if one channel is refreshing, other channels are likely to respond normally.
- It reverse engineers 'undocumented' channel scrambling offsets that work on AMD, Intel, and AWS Graviton hardware to control the placement of data in truly independent channels. This is the most technically challenging core aspect.
- The publicly available library currently only supports 2-way replication, but the benchmark code implements N-way replication. Usage involves passing a signal function (returning the index to read) and a work function (processing the read value) as template parameters.
- It features a C++ template-based API, where you create `HedgedReader<T, signal_fn, work_fn>`, insert data, and then call `start_workers()` to have background workers handle the hedged reads. Core pinning is also supported.
- There are clear trade-offs. Replicating data to N channels increases memory usage by up to N times. One comment mentioned that the base load latency itself increases to around 800 cycles, sacrificing median latency in favor of reducing tail latency.
Evidence
- There was feedback that the detailed explanation of how memory addresses are mapped to DRAM channels, ranks, and banks was good, as this low-level information is rarely covered.
- One comment pointed out the cache hit rate issue. Replicating data increases the working set, leading to more cache misses, and questioned whether the performance degradation due to this would offset the benefits of hedged read.
- There was a sharp criticism that the README and header files do not mention the trade-offs at all. It pointed out that the fact that the base load latency is around 800 cycles, meaning the median latency increases significantly, is being hidden. There was also a rebuttal that the statement in the video that 'Graviton has no performance counters' was completely false.
- The IBM zEnterprise platform uses a method of steering loads to non-refreshing banks to completely hide refresh latency, with a space overhead of only 50%. This was a comparative criticism that the Tailslayer approach can waste up to 92% of the space.
- The possibility of applying it to ML model inference was mentioned. Partitioning multiple parallel ML models by channel could guarantee that certain models always read fast data and others always read slow data. However, it was also noted that ML model weights often reside in the L3 cache, which may limit its effectiveness.
How to Apply
- If you are developing a high-frequency trading (HFT) system or a real-time interrupt handler in C++ where nanosecond-level tail latency is critical, you can copy `include/tailslayer` to your project and wrap frequently read small lookup tables with `HedgedReader`. It is suitable for workloads where a 2x memory usage increase is acceptable and reducing tail latency is more important than increasing median latency.
- Before applying it, you should first verify whether DRAM refresh stalls are the cause of tail latency in your actual workload. Most applications have high L1/L2/L3 cache hit rates, so DRAM access is rare. To see the effect of this library, you need a pattern of random access to data that does not fit in the cache.
- For platforms other than AMD/Intel/Graviton, the private channel scrambling offsets may be different. It is safe to first explore the channel mapping of the hardware using the code in the `discovery/` directory and then decide whether to apply it.
Code Example
snippet
// #include <tailslayer/hedged_reader.hpp>
// [[gnu::always_inline]] inline std::size_t my_signal() {
// // Return the index to read after waiting for an event
// return index_to_read;
// }
// template <typename T>
// [[gnu::always_inline]] inline void my_work(T val) {
// // Process the read value
// }
// int main() {
// using T = uint8_t;
// // Pin the current thread to the main core
// tailslayer::pin_to_core(tailslayer::CORE_MAIN);
// // Provide the signal function and work function as template parameters
// tailslayer::HedgedReader<T, my_signal, my_work<T>> reader{};
// // Replicate the same data to two DRAM channels
// reader.insert(0x43);
// reader.insert(0x44);
// // Start background workers (handle hedged reads)
// reader.start_workers();
}Terminology
tail latencyThe worst-case response time, not the average. For example, p99 latency is the time of the slowest 1 out of 100 requests, and a spike in this can degrade the user experience.
DRAM refreshDRAM maintains data by storing charge in capacitors, and periodically recharges (refreshes) before the charge naturally discharges. During this moment, access to that row is temporarily stalled.
hedged readA technique that replicates the same data in multiple places and sends read requests simultaneously, using the result from the first responding one. The remaining requests are canceled. This is also used in cloud storage.
channel scramblingThe way the memory controller maps physical addresses to DRAM channels. It is usually not publicly documented and differs between hardware, and you need to know it to place data in the intended channels.
working setThe set of memory pages that a program actually accesses during a specific time period. Performance degrades if its size exceeds the cache capacity.
core pinningFixing a specific thread to a specific CPU core. This eliminates context switch overhead and maintains consistent latency.