2026-05-18

Same queue API. Different tail by orders of magnitude.

End-to-end enqueue→dequeue, market-data thread to strategy thread. Lock-free SPSC vs mutex + condvar. At moderate offered load, all three variants look similar at the median. The tail and the breakdown rate are where they diverge.

A market-data thread produces 16-byte ticks; a strategy thread consumes them. End-to-end enqueue-to-dequeue across three implementations of the same queue API — a hand-rolled lock-free ring, the same ring from Boost, and a std::queue protected by a mutex and condvar. Same hardware, same item, same threading topology. At moderate offered load all three look similar at the median; the tail and the breakdown rate are where they diverge by orders of magnitude.

Setup

Three variants on the same 16-byte MarketTick. We measure latency from the moment the producer stamps the item to the moment the consumer has it in hand — the full journey across the queue:

  • lockfree-handrolled — ~50-line ring buffer, 1024 entries, power-of-2 bitmask indexing. std::atomic head and tail, each on its own 64-byte cache line — a PaddedAtomic<T> wrapper with static_assert(alignof(PaddedAtomic<T>) == 64) makes the false-sharing cost measured in demo 2 a compile-time invariant rather than a discipline to remember. acquire/release ordering only — no seq_cst in the hot path. Non-blocking try_push/try_pop; the harness spins on false.
  • lockfree-boostboost::lockfree::spsc_queue<MarketTick, capacity<1024>>, compile-time fixed size. Header-only. Exists as a sanity check on the hand-rolled implementation, not a competing entry.
  • mutex-condvarstd::queue (bounded to 1024 entries via a paired cv_not_full condition variable) + std::mutex + std::condition_variable. Producer waits on cv_not_full while the queue is at capacity, then locks, pushes, and calls notify_one. Consumer waits on a predicate, pops, unlocks, then timestamps — symmetric with the lock-free path.

All three variants use a 1024-slot bounded queue. The mutex variant uses a paired condvar for not-full back-pressure, matching the lock-free back-pressure semantics.

Producer pinned to core 4, consumer to core 5 — both on the same CCX of the Ryzen 7 3800X, sharing an L3. Same-CCX only; cross-CCX is deferred.

The producer is paced at 1 M items/sec for the headline measurements, well below the sustainable throughput of any variant. The lock-free variants leave the queue effectively empty at this load; the mutex variant runs about 1.8 items deep on average. Reported latency reflects per-op queue cost plus (for the mutex variant) condvar wake-up time — depth contribution is under 2 µs even in the mutex case.

The code

Mutex + condvar (bounded)
Lock-free SPSC
// Producer
enq_ts[i] = rdtscp_ordered();
{
std::unique_lock<std::mutex> lk(mtx);
cv_not_full.wait(lk, [&]{ return q.size() < N; });
q.push(tick);
}
cv_not_empty.notify_one();

// Consumer
{
std::unique_lock<std::mutex> lk(mtx);
cv_not_empty.wait(lk, [&]{ return !q.empty(); });
tick = q.front();
q.pop();
}
cv_not_full.notify_one();
deq_ts[tick.seq] = rdtscp_ordered(); // after unlock
// Producer
enq_ts[i] = rdtscp_ordered();
while (!q.try_push(tick)) { /* spin */ }

// Consumer
MarketTick t;
while (!q.try_pop(t)) { /* spin */ }
deq_ts[t.seq] = rdtscp_ordered();

// spsc_queue.h hot path:
bool try_push(const T& item) noexcept {
const size_t tail = tail_.value.load(relaxed);
const size_t head = head_.value.load(acquire);
if (tail - head >= N) return false;
buffer_[tail & MASK] = item;
tail_.value.store(tail + 1, release);
return true;
}

Both sides timestamp before the push and after the pop, outside any synchronisation primitive. The mutex side unlocks before timestamping the dequeue — symmetric with the lock-free path where the timestamp follows try_pop.

The Boost variant is one line: boost::lockfree::spsc_queue<MarketTick, capacity<1024>> q. It's measured but not shown — the hot path is structurally identical to the hand-rolled version.

Headline — tail latency at 1 MHz offered load (CCDF)

Read right-to-left: a lower curve at any given latency means fewer samples are that slow or slower. The lock-free curves collapse toward the left; the mutex curve has a long tail extending right. At 1 MHz offered load the lock-free variants leave the queue effectively empty (≈0.1 items by Little's law). The mutex variant runs about 1.8 items deep on average — each item's wake-up cycle takes slightly longer than the inter-arrival period, so a small amount of residency accumulates. Even so, the latency gap is dominated by kernel-boundary cost rather than depth: queue residency contributes under 2 µs while the p99.9 difference between mutex and lock-free is tens of microseconds.

p99.9 is the number that matters for a latency-sensitive strategy. That's the sample that arrives during a fast market move, when being slow costs the most.

Distribution shape (PDF)

The lock-free distribution is sharply peaked — most items transit at nearly the same latency. The 16 KB ring buffer fits in each core's 32 KB L1d, and the inter-core handoff is serviced by the shared CCX L3 without crossing the Infinity Fabric. The mutex distribution is bimodal: a fast path when the consumer is already spinning on the condvar and wakes immediately, and a condvar-wake-up tail where the kernel becomes involved. The fast mode looks deceptively similar to lock-free; the tail is what separates them.

What happens as offered load rises

Each line shows p50, p99, and p99.9 across twelve log-spaced offered-load points from 100 kHz to 50 MHz. There are two different saturation mechanisms visible on this chart, and they're worth separating.

Two failure modes. Mutex and Boost fail by latency wall: queue fills toward capacity, residency time becomes queue_depth × consumer_period, p50 jumps by orders of magnitude. The hand-rolled variant fails by throughput ceiling: its producer caps at 14.7 M/s — below Boost's 28 MHz saturation point — and the queue never gets a chance to fill. Hand-rolled staying low at the right of the chart isn't because it's uniformly faster on every operation; its bottleneck moved sides. Same root cause in all three cases (consumer can't drain at the offered rate), three different surfaces it shows up on.

Mutex saturates around 9 MHz the way classical queueing theory predicts. The consumer can't drain fast enough, the queue fills toward its 1024-entry capacity, and latency climbs to queue_depth × consumer_period — about 190 µs once depth pins at the ceiling. The mutex p50 line above 9 MHz is a near-horizontal plateau because depth is also a near-horizontal plateau at capacity.

Boost saturates around 28 MHz by the same mechanism, just further to the right because its consumer is faster. (That 28 MHz is offered load — the rate the pacer requests; it's higher than the ~14.9 M/s Boost actually delivers at saturation, because past the saturation point the producer spends its time blocked on a full queue rather than pushing items through.) Queue depth climbs to ~200 items at the top of the swept range; p50 climbs to ~10 µs. The Boost consumer is roughly 3× slower than the hand-rolled one — the standard gap between a portable header-only library and ~50 lines tuned for one machine. boost::lockfree::spsc_queue templates on arbitrary T and is defensive about ordering. The hand-rolled variant pads head_ and tail_ to separate 64-byte cache lines (so the consumer's tail-load doesn't trigger coherence traffic from the producer's tail-write), picks acquire/release semantics with exactly one writer per atomic, uses a mask instead of modulo for power-of-two capacity, and assumes POD items. A 2-3× consumer-period gap between a portable lock-free ring and a hand-tuned one is the standard result in SPSC microbenchmarks, not a Boost defect.

The hand-rolled variant doesn't saturate this way at all. Throughput caps at ~14.7 M/s starting around 16 MHz offered load, but p50 stays near its 132 ns floor across the entire sweep — at 50 MHz target, p50 is still 264 ns and queue depth is still ~4. Its consumer drains items so quickly that the queue never deepens. The bottleneck is the producer side: each try_push plus the surrounding spin and timestamp costs more than the inter-arrival period the pacer is requesting, so the achieved rate caps at the producer's own throughput. Same outcome as the others — the variant can't keep up with the offered load — but expressed as a throughput ceiling rather than a latency wall.

The practical implication is the same in all three cases: offering more than the system can sustain produces dropped or delayed items. The mutex and Boost variants signal it through latency that climbs by orders of magnitude. The hand-rolled variant signals it through a hard throughput cap with latency that barely moves. Either way, sizing the consumer below the actual offered load matters more than implementation choice.

One artefact worth naming: the hand-rolled p99.9 line has a single spike at the 16 MHz sweep point — about 42 µs, against a baseline of ~200 ns either side. This is the transition load where the producer occasionally hits queue-full and spins on try_push long enough to be counted in the tail; at lower loads the queue never fills, at higher loads the producer is consistently pacer-limited by its own cost and the queue stays small. The spike is reproducible at this offered rate but transient across the sweep — present at one point, gone at the next.

One wrinkle visible on the left of the chart: the mutex p99.9 curve has a hump at very low offered load — its tail is worse at 100 kHz than at 1 MHz. This is the consumer sleeping deeper between items. Longer idle periods let the kernel deschedule the consumer thread further (and on Zen 2 with performance governor, even let it slide into shallower C-states), so wake-up cost on the next item is larger. It's a known property of condvar-based queues and an additional argument against them for sporadic workloads.

Throughput

Saturated throughput: the producer runs flat-out with no pacing. Hand-rolled and Boost cap at near-identical rates — 14.7 M/s and 14.9 M/s respectively, within 1.4% — but for opposite reasons, and the near-match is a coincidence. Hand-rolled is producer-bound: its consumer drains faster than the producer can push, the queue stays near empty, and the ceiling is the producer's own per-iteration cost (timestamp, try_push, spin). Boost is consumer-bound: its consumer is the ~3× slower one (see the load-sweep section above), so the queue fills, try_push starts failing and spinning, and the producer is throttled down to Boost's consumer drain rate. The 14.7-vs-14.9 agreement is two unrelated numbers — one producer rate, one consumer rate — happening to land close together, not evidence of a shared bottleneck. That difference is exactly what the queue-depth gap shows: Boost runs with its queue near capacity at saturation, hand-rolled with it near empty, so an item resident in Boost's queue waits ~200× longer end-to-end. Same throughput, very different residency — and very different reason for the ceiling.

Mutex caps lower, at ~5.7 M/s — about 2.6× slower than lock-free. That's the "small multiple" rate gap: small enough that a throughput-only view would undersell how different these queues actually are. The tail-latency view above is where the gap explodes by orders of magnitude.

A nuance for the Boost number: the steady-state ~14.9 M/s shown here is slightly below its transient peak — the load sweep shows Boost briefly reaching ~19.8 M/s when the queue is at intermediate depth, before settling lower as the queue fills. Hand-rolled doesn't show this gap because its queue never fills. Mutex doesn't show it because both steady-state and transient rates are bound by condvar wake-up cost, which is depth-independent.

The lock-free win is about determinism, not raw rate.

Why the tail collapses

At low offered load, the mutex cold path fires the kernel when the consumer must wait. When cv.wait blocks:

  1. A futex syscall from notify_one — kernel entry and return.
  2. A scheduler decision: when does the consumer thread actually run?
  3. Context restoration on the consumer core.
  4. Cache cold effects if the consumer has been executing other work.

None of that happens on the lock-free path. The consumer spins until it sees the producer's tail_ store, then loads the item. On Zen 2, with both cores on the same CCX sharing an L3, that cache-coherence round-trip plus the buffer load and store completes in roughly 130 ns — the floor visible in the p50 numbers above.

At high offered load, the tail grows further for a different reason: once the queue saturates, the effective latency is queue_depth × consumer_period. This is the mechanism behind the hockey-stick in the load-sweep chart above, not a change in the per-op cost. The two saturation regimes are genuinely different and worth labelling separately.

Boost comparison

boost::lockfree::spsc_queue is a production-quality implementation with the same structural properties as the hand-rolled version: power-of-2 capacity, acquire/release atomics, cache-line-aware layout. Under 1 MHz paced load, the hand-rolled implementation comes in at 132 ns p50 vs Boost's 122 ns — within 8%. The p99.9 gap is wider — 172 ns vs 148 ns, about 16% — but both are still orders of magnitude inside the mutex variant's tail. That's the credibility check: the hand-rolled version isn't leaving anything obvious on the table relative to a battle-tested production library, and conversely Boost isn't doing anything clever in its hot path that the 50-line version misses.

What this doesn't show

  • Cross-CCX: producer and consumer on separate CCX slices. Cache-coherence traffic crosses the Infinity Fabric; latency increases. Deferred to a future topology post.
  • MPSC / MPMC: single-producer single-consumer only.
  • Offered loads near saturation exhaustively: the load-sweep section covers a 500× range from well below to well past saturation. Behaviour between sweep points is interpolated, not measured.
  • Fixed item size: 16-byte MarketTick. Queue behaviour may differ with larger items (cache-line straddling, prefetch patterns).
  • The spin-check-without-condvar mutex variant: separating wake-up cost from lock cost is a different experiment and a different post.

Takeaway

The lock-free SPSC ring buffer wins on the shape of the tail. At 1 MHz offered load the lock-free variants leave the queue effectively empty; even the mutex variant only carries ~1.8 items on average, so the tens-of-microseconds latency gap is kernel-boundary cost, not queue depth. The mutex fast path is fast — its cold path involves a syscall and a scheduler decision, and those outliers appear precisely when the market is moving fastest. The load-sweep chart shows where that breakdown begins: the mutex p99 turns upward well before the lock-free variants, because the scheduler can't keep pace with the producer long before the lock-free consumer reaches its own ceiling.


Percentile values shown in charts above are computed from raw histograms in the corresponding JSON entries: log₂-subbucket-16 binning, bucket-midpoint percentile convention. Top-bucket counts and TSC drift across all 5 runs were zero. See Methodology for the rdtscp calibration path.

AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 0–7 isolated. Producer core 4 + consumer core 5 (same CCX1). Headless Ubuntu 24.04. GCC 13.3, -O3 -march=native. 5 outer runs × 1M timed samples per run, percentiles from merged histograms (tail-latency-distribution convention).

Methodology →