Same queue API. Different tail by orders of magnitude.

A market-data thread produces 16-byte ticks; a strategy thread consumes them. End-to-end enqueue-to-dequeue across three implementations of the same queue API — a hand-rolled lock-free ring, the same ring from Boost, and a std::queue protected by a mutex and condvar. Same hardware, same item, same threading topology. At 1 MHz offered load, the lock-free and mutex medians are already ~13× apart — ~130 ns vs ~1.7 µs — and diverge to ~20× at p99.9. Past saturation the gap becomes orders of magnitude and bistable.

Setup

Three variants on the same 16-byte MarketTick. We measure latency from the moment the producer stamps the item to the moment the consumer has it in hand — the full journey across the queue:

lockfree-handrolled — ~50-line ring buffer, 1024 entries, power-of-2 bitmask indexing. std::atomic head and tail, each on its own 64-byte cache line — a PaddedAtomic<T> wrapper with static_assert(alignof(PaddedAtomic<T>) == 64) makes the false-sharing cost measured in demo 2 a compile-time invariant rather than a discipline to remember. acquire/release ordering only — no seq_cst in the hot path. Non-blocking try_push/try_pop; the harness spins on false.
lockfree-boost — boost::lockfree::spsc_queue<MarketTick, capacity<1024>>, compile-time fixed size. Header-only. Exists as a sanity check on the hand-rolled implementation, not a competing entry.
mutex-condvar — std::queue (bounded to 1024 entries via a paired cv_not_full condition variable) + std::mutex + std::condition_variable. Producer waits on cv_not_full while the queue is at capacity, then locks, pushes, and calls notify_one. Consumer waits on a predicate, pops, unlocks, then timestamps — symmetric with the lock-free path.

All three variants use a 1024-slot bounded queue. The mutex variant uses a paired condvar for not-full back-pressure, matching the lock-free back-pressure semantics.

Producer pinned to core 4, consumer to core 5 — both on the same CCX of the Ryzen 7 3800X, sharing an L3. Same-CCX only; cross-CCX is deferred.

The producer is paced at 1 M items/sec for the headline measurements, well below the sustainable throughput of any variant. The lock-free variants leave the queue effectively empty at this load; the mutex variant runs about 1.7 items deep on average. Reported latency reflects per-op queue cost plus (for the mutex variant) condvar wake-up time — depth contribution is under 2 µs even in the mutex case.

The code

Mutex + condvar (bounded)

Lock-free SPSC

// Producer
enq_ts[i] = rdtscp_ordered();
{
std::unique_lock<std::mutex> lk(mtx);
cv_not_full.wait(lk, [&]{ return q.size() < N; });
q.push(tick);
}
cv_not_empty.notify_one();

// Consumer
{
std::unique_lock<std::mutex> lk(mtx);
cv_not_empty.wait(lk, [&]{ return !q.empty(); });
tick = q.front();
q.pop();
}
cv_not_full.notify_one();
deq_ts[tick.seq] = rdtscp_ordered(); // after unlock

// Producer
enq_ts[i] = rdtscp_ordered();
while (!q.try_push(tick)) { /* spin */ }

// Consumer
MarketTick t;
while (!q.try_pop(t)) { /* spin */ }
deq_ts[t.seq] = rdtscp_ordered();

// spsc_queue.h hot path:
bool try_push(const T& item) noexcept {
const size_t tail = tail_.value.load(relaxed);
const size_t head = head_.value.load(acquire);
if (tail - head >= N) return false;
buffer_[tail & MASK] = item;
tail_.value.store(tail + 1, release);
return true;
}

Both sides timestamp before the push and after the pop, outside any synchronisation primitive. The mutex side unlocks before timestamping the dequeue — symmetric with the lock-free path where the timestamp follows try_pop.

The Boost variant is one line: boost::lockfree::spsc_queue<MarketTick, capacity<1024>> q. It's measured but not shown — the hot path is structurally identical to the hand-rolled version.

Headline — tail latency at 1 MHz offered load (CCDF)

Lock-free SPSC vs mutex queue: end-to-end tail latency

Read right-to-left: a lower curve at any given latency means fewer samples are that slow or slower. The lock-free curves collapse toward the left; the mutex curve has a long tail extending right. At 1 MHz offered load the lock-free variants leave the queue effectively empty (≈0.1 items by Little's law). The mutex variant runs about 1.7 items deep on average — each item's wake-up cycle takes slightly longer than the inter-arrival period, so a small amount of residency accumulates. Even so, the latency gap is dominated by kernel-boundary cost rather than depth: queue residency contributes under 2 µs while the p99.9 difference between mutex and lock-free is ~3.6 µs.

p99.9 is the number that matters for a latency-sensitive strategy. That's the sample that arrives during a fast market move, when being slow costs the most.

Distribution shape (PDF)

Lock-free SPSC vs mutex queue: end-to-end tail latency

The lock-free distribution is sharply peaked — most items transit at nearly the same latency. The 16 KB ring buffer fits in each core's 32 KB L1d, and the inter-core handoff is serviced by the shared CCX L3 without crossing the Infinity Fabric. The mutex distribution is bimodal: a fast path when the consumer is already spinning on the condvar and wakes immediately, and a condvar-wake-up tail where the kernel becomes involved. The fast mode looks deceptively similar to lock-free; the tail is what separates them.

What happens as offered load rises

Lock-free SPSC vs mutex queue: end-to-end tail latency

Each line shows p50, p99, and p99.9 across twelve log-spaced offered-load points from 100 kHz to 50 MHz (x-axis: offered/requested load). The curve is drawn only through the sub-saturation regime, where achieved throughput tracks the offered rate and latency is a single well-defined function of load. Below saturation the story is simple and reproducible: the lock-free pair sit near their per-op floor — 120–150 ns, essentially flat across the sweep — while the mutex runs between ~0.9 and ~3 µs, highest at the lowest offered loads (the wake-up effect discussed below). Past the saturation knee the chart stops on purpose, because past saturation there is no single line to draw.

Past saturation: two equilibria, not one curve

When the offered rate exceeds what a variant can drain, the queue does not settle into one steady state. It settles into one of two, and which one a run lands in is fixed when the process starts and holds for that run. Repeating each over-saturation point 50 times (10 processes × 5 timed runs) makes the split explicit:

variant	offered	shallow p50	deep p50	processes deep / shallow
mutex-condvar	28 MHz	1.3 µs	225 µs	~5% deep
lockfree-boost	28 MHz	0.38 µs	42 µs	~40% deep
lockfree-boost	saturated	—	68 µs	~95% deep

In the shallow equilibrium the queue stays a handful of items deep: producer and consumer hand off in lock-step, the bottleneck is per-operation cost, and p50 sits near the sub-saturation floor. In the deep equilibrium the queue fills toward its 1024-entry capacity and residency becomes queue_depth × consumer_period — hundreds of microseconds for the mutex, tens for Boost. The two are separated by ~100× with nothing between them; a run is in one or the other.

The selection is per process. At moderate over-saturation load, 8–10 of every 10 processes are internally uniform — all five timed runs in the same mode — so a process picks an equilibrium in the first rounds of the producer/consumer race and stays there. Boost at 28 MHz is the clearest case: roughly 40% of processes run deep and 60% shallow at the same offered load.

This is the methodological point of the demo. The published percentile for an over-saturation point comes from one process — one draw from this distribution. Capture the benchmark once and mutex's overload p50 is 1 µs; capture it again and it is 200 µs. Neither is wrong; neither is the number. The split itself is not even fixed: changing the CPU governor from performance to schedutil flips mutex from mostly-shallow to mostly-deep. The honest statement is that past saturation these queues are bistable, and any benchmark reporting a single over-saturation latency is reporting a coin toss.

The hand-rolled variant has a third behaviour that is not one of these equilibria. Its over-saturation runs are never internally uniform — the latency flips item-to-item within a single run — because hand-rolled's deep excursions are not a queue equilibrium but an intermittent consumer stall: a tens-of-microseconds gap on the dequeue side with no matching gap on the producer side, the producer stamping items while the consumer briefly goes dark. On a pinned, spinning, isolated core that is an external event — a cross-core IPI or an SMI — and it does not care about offered load. It surfaces as a tail spike only near saturation, where the producer is flat-out and the stalled interval backs up hundreds of items before the consumer resumes; below ~9 MHz the same stall clears before it matters. Across repeated runs it appears in roughly half of them, landing at whatever sweep point happens to catch it — which is why one capture showed it at one offered load and another showed it elsewhere. It is real, environmental, and not a property of any rate.

One wrinkle visible on the left of the chart: the mutex p99.9 curve has a hump at very low offered load — its tail is worse at 100 kHz than at 1 MHz. This is the consumer sleeping deeper between items. Longer idle periods let the kernel deschedule the consumer thread further (and on Zen 2 with performance governor, even let it slide into deeper C-states), so wake-up cost on the next item is larger. It's a known property of condvar-based queues and an additional argument against them for sporadic workloads.

Throughput

Saturated throughput (ops/sec, producer flat-out — higher is better)

Saturated throughput, producer flat-out. The lock-free pair cap in the same neighbourhood — hand-rolled ~14–16 M/s, Boost ~15–18 M/s — but both figures are mode-dependent draws from the same bistability as the latency: a process that settles deep and one that settles shallow report different achieved rates, so a single saturated-throughput number carries the same coin-toss caveat as the over-saturation latency. Mutex caps lower, ~4–5 M/s, bound by condvar wake-up cost.

A throughput-only view misleads in both directions. It makes the lock-free pair look interchangeable — they are not; their tails differ by ~100× and their overload behaviour is bistable — and it undersells the mutex gap, which is a small multiple in rate but orders of magnitude in tail.

The lock-free win is about determinism, not raw rate.

Why the tail collapses

At low offered load, the mutex cold path fires the kernel when the consumer must wait. When cv.wait blocks:

A futex syscall from notify_one — kernel entry and return.
A scheduler decision: when does the consumer thread actually run?
Context restoration on the consumer core.
Cache cold effects if the consumer has been executing other work.

None of that happens on the lock-free path. The consumer spins until it sees the producer's tail_ store, then loads the item. On Zen 2, with both cores on the same CCX sharing an L3, that cache-coherence round-trip plus the buffer load and store completes in roughly 120 ns — the floor visible in the p50 numbers above.

At high offered load the tail is the bistable picture above: a run that settles into the deep equilibrium has latency queue_depth × consumer_period — the hundreds-of-microseconds residency — while a run that settles shallow stays near the per-op floor. A hockey-stick in any single sweep is whichever equilibrium that capture's process happened to occupy, not a smooth function of load.

Boost comparison

boost::lockfree::spsc_queue is a production-quality implementation with the same structural properties as the hand-rolled version: power-of-2 capacity, acquire/release atomics, cache-line-aware layout. Under 1 MHz paced load — the queue stays effectively empty, so latency is dominated by per-op cost rather than drain rate — the hand-rolled implementation comes in at 122 ns p50 against Boost's 140 ns, and 164 ns vs 172 ns at p99. At p99.9 the two are identical at 180 ns. These percentiles are quantised to the histogram's log-spaced buckets (4–8 ns wide in this range), so the p99 gap is a single bucket and the p50 gap two to three — a small, consistent edge to the hand-rolled version, not a decisive one, and below what the load-sweep chart can resolve on a log axis. That's the credibility check: at this load the 50-line hand-rolled queue matches a battle-tested production library across the distribution, edging it at the median and tying it in the tail. Neither leaves anything obvious on the table, and neither is doing anything clever in its hot path that the other misses.

What this doesn't show

Cross-CCX: producer and consumer on separate CCX slices. Cache-coherence traffic crosses the Infinity Fabric; latency increases. Deferred to a future topology post.
MPSC / MPMC: single-producer single-consumer only.
Over-saturation as a single curve: past the saturation knee the per-point behaviour is bistable (see above), so the over-saturation region is characterised by its mode distribution, not one interpolated curve. Sub-saturation behaviour between sweep points is interpolated, not measured.
Fixed item size: 16-byte MarketTick. Queue behaviour may differ with larger items (cache-line straddling, prefetch patterns).
The spin-check-without-condvar mutex variant: separating wake-up cost from lock cost is a different experiment and a different post.

Takeaway

The lock-free SPSC ring buffer wins from the median outward. At 1 MHz offered load the median is already ~13× apart — lock-free ~130 ns, mutex ~1.7 µs — driven entirely by kernel-boundary cost: the mutex carries only ~1.7 items on average, so queue depth contributes under 2 µs. At p99.9 the gap is ~20×: 180 ns vs ~3.8 µs. The mutex fast path is fast — its cold path involves a syscall and a scheduler decision, and those outliers appear precisely when the market is moving fastest. The load-sweep chart shows where that breakdown begins — and past saturation it shows something subtler: the queue is bistable, settling deep or shallow per run, so a single overload number is a coin toss. The reproducible result is the sub-saturation distribution: at moderate load the lock-free tail runs ~20× tighter than the mutex's, every run.

Percentile values shown in charts above are computed from raw histograms in the corresponding JSON entries: log₂-subbucket-16 binning, bucket-midpoint percentile convention. Zero top-bucket counts; TSC drift ≤0.0001% across all 5 runs. See Methodology for the rdtscp calibration path.

AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 1–7 isolated (cpu0 cannot be kernel-isolated and carries housekeeping). Producer core 4 + consumer core 5 (same CCX1). Headless Ubuntu 24.04. GCC 13.3, -O3 -march=native. 5 outer runs × 1M timed samples per run, percentiles from merged histograms (tail-latency-distribution convention).

Methodology →

Setup#

The code#

Headline — tail latency at 1 MHz offered load (CCDF)#

Distribution shape (PDF)#

What happens as offered load rises#

Past saturation: two equilibria, not one curve#

Throughput#

Why the tail collapses#

Boost comparison#

What this doesn't show#

Takeaway#

Setup

The code

Headline — tail latency at 1 MHz offered load (CCDF)

Distribution shape (PDF)

What happens as offered load rises

Past saturation: two equilibria, not one curve

Throughput

Why the tail collapses

Boost comparison

What this doesn't show

Takeaway