2026-05-21
Allocators: cross-thread order pipeline
Cross-thread Order lifecycle benchmarked across three allocator strategies — malloc, freelist with return queue, and arena with batch handoff. The result that survives contact with the data isn't the one the design discussion would lead you to expect.
A market-data thread hands off 64-byte Orders to a risk thread. The risk thread frees them after checking positions, limits, and velocity. This is the pattern every real trading system uses, and it is the one that breaks naïve thread-local pool allocators (those whose pool can only return memory on the allocating thread) — the free happens on a different thread from the alloc.
The thesis
In a cross-thread trading pipeline, the allocator design is a derivative of the threading model. Malloc's median is fine; its tail drifts under background heap pressure — other subsystems sharing the heap, fragmentation accumulating, arena locks contending. The size of that drift depends on topology: modest (1.1–1.3× p99.9) when producer and consumer share a CCX, substantial (2.4×) when they don't.
This post measures three strategies for the cross-thread free pattern: the baseline that most real systems actually ship, and two honest domain-specific designs. The result that survives contact with the data isn't the one the design discussion would lead you to expect: the bump-pointer arena's theoretical fast-path advantage doesn't materialise against the freelist's amortised return queue, and the gap between the pool variants and new/delete is much smaller on same-CCX than the design framing implies.
Setup
The Order struct
struct alignas(64) Order {
uint64_t ts_create_tsc; // TSC at producer construction
uint64_t seq;
int64_t price;
int32_t qty;
uint32_t client_id;
uint32_t symbol_id;
uint32_t risk_seq;
Side side;
uint8_t arena_idx; // used by variant 3; zero otherwise
uint8_t _pad[18];
};
static_assert(sizeof(Order) == 64);
static_assert(alignof(Order) == 64);Fixed 64 B, exactly one cache line. The size is representative of an internal order representation — production orders carry variable-length FIX fields, but the allocator story is independent of payload contents.
Three variants
cross-thread-malloc — baseline. new Order on the producer, delete on the consumer. No domain-specific allocator. Every real system starts here.
freelist-return-queue — producer keeps a thread-local freelist seeded from a pre-allocated slab of 4096 slots. Consumer pushes freed Orders back to the producer via a second SPSC queue (C→P, depth 1024). Producer drains the return queue in batches of 32 when its local freelist runs empty. The "magazine" pattern, simplified for the 1P/1C case.
arena-batch-handoff — producer bump-allocates from one of four rotating arenas (4096 slots each, 16K total in-flight capacity). Consumer increments a per-arena drain counter. Producer reuses an arena only when consumer_pos >= producer_pos — the consumer has acknowledged every order from the arena's last fill. No freelist, no return queue: the producer's fast path is one bounds check and one pointer increment.
All three use the same SPSC queue primitive from Demo 4 for the main forward path (producer → consumer). freelist-return-queue also uses it for the return path (consumer → producer).
Threading model
| Thread | Core | CCX |
|---|---|---|
| Producer (T_p) | 4 | CCX1 |
| Consumer (T_c) | 5 | CCX1 |
| Background heap (T_bg) | 6 | CCX1 |
T_p and T_c are on the same CCX, sharing an L3 — matching Demo 4's topology so queue-crossing costs are comparable. T_bg is also on CCX1 so it shares the L3 cache, which is the realistic "other subsystem on the same NUMA domain" model.
The simulated risk check
The consumer does real work: three table lookups (positions by symbol, limits by client, velocity by client), position delta arithmetic, limit checks, and a velocity-window update. All tables are sized to fit in L1d (~34 KB total). The calibrated consumer work weight is approximately 200 ns wall-clock — enough to make allocator differences visible in the latency distribution without making the consumer the dominant bottleneck.
Background heap pressure
T_bg runs a tight loop of mixed-size malloc/free calls against six size classes (32–1024 B), prefilling 8192 live allocations at thread start to create fragmentation from t=0. The default rate for headline measurements is 1 M ops/sec of churn, with bg-threads=1. Both values come from the calibration ladder in bench/calibration-notes/README.md; the lock decision is documented there.
Headline latency (CCDF)
All three variants run at 1 MHz offered load with 1 M/s background heap pressure, 5 × 1M items per variant. Histograms merged across iterations.
The CCDF reads right-to-left: a lower curve at any latency means fewer samples are that slow or slower. At the median, malloc and freelist tie at 172 ns; arena sits 32 ns slower at 204 ns. Moving right toward p99 and p99.9, the picture sorts cleanly: freelist holds its lead (p99 = 220, p99.9 = 312), arena trails by 24 ns at p99 and 32 ns at p99.9 (244 and 344), and malloc's tail stretches out under background pressure (296 and 392) — about a 1.35× gap to the freelist at p99 and 1.26× at p99.9.
The numbers are modest in absolute terms because both producer and consumer share a CCX; the cross-CCX section near the end of the post shows what happens to malloc's tail when the queue traversal stops being free. The one statistic where malloc separates dramatically even on same-CCX is its max: 46,710 ns versus the pool variants' ~10,000–15,000 ns. That's a single sample out of five million, so don't read it as a robust property, but it is the only same-CCX statistic where the design-discussion framing materialises.
The bump-pointer that doesn't win
This is the result that contradicts the design discussion above. The arena's steady-state allocate(), in 99.97% of calls, is:
Order* o = &arenas_[current_].slots[orders_in_current_++];
o->arena_idx = static_cast<uint8_t>(current_);
return o;Two instructions on the hot path — one indexed load to compute the slot address, one store to set arena_idx. The [[unlikely]] rotation branch fires roughly every 4096 orders. Against that, the freelist's hot path is a vector pop (load size, decrement, load tail element) plus an amortised return-queue drain (one branch on size, batch-of-32 every ~4 ms at this rate). On any theoretical instruction count, the arena should be at least as fast.
It isn't. The freelist wins at every percentile by 24–32 ns at the headline measurement, and its max (10,860 ns) beats the arena's max (14,780 ns) too. Three plausible mechanisms, none of them definitive without targeted microbenchmarks the post doesn't include:
- The per-Order
arena_idxwrite is a store the freelist doesn't do. It goes into the store buffer, doesn't block the return, but it does occupy a store-buffer slot during a phase when the producer is also issuing the SPSC queue'sproducer_pos.store(release)for the same order. Two stores in flight versus one is exactly the kind of small overhead that shows up at every percentile. - The arena's rotation branch isn't free even when it doesn't fire. It's a compare-and-branch on
orders_in_current_ < 4096every allocation; the freelist's analogous "is the local vector empty?" check is one register comparison the compiler can hoist out of the inner loop body. Branch prediction handles both, but predicted branches still consume issue slots. - The arena's slot address has a longer dependency chain.
&arenas_[current_].slots[orders_in_current_++]requires two dependent index loads against the freelist's single pointer-load from the back of a contiguous vector. At p50 this matters less; at p99 and p99.9 it stacks with whatever else is going on in the pipeline.
The arena's design has one property the freelist's doesn't: its p99.9 is flat across the entire pressure sweep at 344 ns — every single one of the nine sweep points lands on that number, including the zero-pressure baseline. That kind of variance suppression is sometimes the property a tail-latency SLA actually wants. If predictability matters more than absolute magnitude, the arena's stability is worth a separate look. But on the headline question — which variant has the lowest tail — the freelist wins.
Throughput
At 1 MHz paced load the throughput numbers reflect how faithfully each variant sustains the offered rate over 5M items. All three variants sustain close to 1 M/s; the freelist variant shows slightly tighter variance than malloc or arena because its amortised return-queue drain smooths over per-allocation cost spikes. Throughput differences are secondary to latency at this load level.
Background pressure sweep
The sweep runs each variant across 9 background-pressure levels: a no-T_bg baseline (faint dashed horizontal reference lines) plus 8 log-spaced levels from 100 k/s to 10 M/s. Producer paced at 1 MHz throughout; 1M items per sweep point.
Three different shapes in this chart.
Arena is flat. Every sweep point, including the zero-pressure baseline, lands at p99.9 = 344 ns. The arena's tail is decoupled from background heap pressure entirely. The bound is set by the rotation infrastructure cost, not by anything the background thread is doing.
Freelist is nearly flat. It hovers around p99.9 = 312 ns across most of the sweep, dropping slightly to 296 at moderate pressure and rising to 328 at the highest pressure level. The return-queue's batched drain absorbs background-induced variance the same way the arena's rotation does, with slightly less perfect suppression.
Malloc is non-monotonic. Its p99.9 rises from 328 (no pressure) to a peak of 424 at 372 k/s, then declines as pressure increases further — back to 344 at 2.68 M/s and stable from there. This isn't a measurement artefact; it's a cache-locality effect: at high churn rates, malloc's recently-freed blocks get re-touched fast enough to stay warm in L1/L2, and the fragmentation tail that exposes itself at moderate rates gets masked. The worst case for malloc isn't peak background activity. It's the kind of background activity a real system would happily run at.
The faint horizontal reference lines show each variant's p99.9 with no background pressure at all. Arena's reference is on top of its sweep line — no separation. Freelist's reference (312) and its sweep line are within ±16 ns at every point. Only malloc's sweep diverges materially from its no-pressure reference, and only in a specific pressure regime.
Cross-CCX side note
The headline runs place producer and consumer on the same CCX (cores 4 and 5, sharing L3 on CCX1). In a real system, producer and consumer are sometimes on different CCX slices — the queue crossing traverses the Infinity Fabric rather than the shared L3. This case isn't the headline because most tightly-coupled trading pipelines run same-CCX by design; the side experiment shows what happens when you don't get that choice.
Cross-CCX configuration: producer on core 4 (CCX1), consumer on core 1 (CCX0), T_bg on core 6 (CCX1). The p50 floor rises to 408 ns across all three variants — the Infinity Fabric round-trip is the new baseline, identical for everyone.
The tail picture is where the variants part. The freelist and arena both top out at p99.9 = 720 ns — within sample-noise of each other, both about 2.1–2.3× their same-CCX baselines. Malloc reaches p99 = 1120 ns and p99.9 = 1760 ns. That's 1.63× the pool variants at p99 and 2.44× at p99.9 in absolute terms, and a 4.5× expansion of malloc's same-CCX baseline (392 → 1760) versus the pools' 2.1–2.3× expansion. The cross-CCX environment amplifies malloc's allocator-overhead tail disproportionately — the lock-contention paths and arena coordination that malloc has to do internally pay an extra Infinity-Fabric round-trip every time they cross between threads on different L3 domains.
Single-sample max values cross-CCX don't track the percentile ordering: freelist 7,200, malloc 10,460, arena 12,200. Same caveat as the same-CCX headline — one sample out of five million doesn't carry a robust ordering, and these numbers reflect where an interrupt or scheduling event happened to land rather than a structural property of the variant.
This is the gap the design discussion at the top was implicitly pointing at. It happens to live in the cross-CCX corner, not the headline.
What this doesn't show
- No jemalloc / mimalloc / tcmalloc comparison. Drop-in allocator replacement (via
LD_PRELOADor a tcmalloc CMake option) is a separate future post scoped to "standard library vs drop-in general allocator." This post is scoped to "standard library vs domain-specific." - Strictly 1P/1C. Multi-producer or multi-consumer patterns break the SPSC queue contract and require different allocator designs. Not measured here.
- Fixed 64 B Order size. Production orders carry variable-length FIX fields, optional tags, and sometimes strategy-level metadata. Larger or variable-size orders change the bump-pointer fill-rate, pool sizing, and fragmentation dynamics. That's a separate post.
- No variable-length payload. Following from the above — the allocator story for variable-size objects (slab per size class, buddy allocator, region-based) is different from the 64 B fixed-size case.
- No NUMA crossing. The reference machine is single-NUMA-node. Cross-NUMA Order pipelines introduce remote-memory latency into every alloc/free on the home node of the other thread.
Takeaway
For cross-thread Order lifetimes on this CPU, the freelist-return-queue variant wins at every percentile measured: matching malloc's median and beating both alternatives at p99 and p99.9. The arena variant's theoretical advantage — a two-instruction bump-pointer fast path — doesn't materialise against a freelist whose pop is similarly cheap and whose return-queue drain amortises into 32-order batches.
What the arena gives up in absolute latency, it returns in predictability: its p99.9 is dead-flat at 344 ns across all nine pressure-sweep points, no-pressure baseline included. If the latency SLA is written in terms of jitter or worst-case-under-any-load rather than absolute tail magnitude, that flat line is a real property and worth a separate weighing against the freelist's 32-ns p99.9 advantage.
new/delete is the only variant whose tail responds to background heap pressure, and the response isn't where intuition puts it — moderate pressure (a few hundred k/s background allocations) hurts malloc more than peak pressure does, because at peak churn the recently-freed blocks stay warm in cache. The same-CCX gap is real but modest (~1.26× p99.9). Cross the CCX boundary and the gap opens to 2.44× at p99.9 and 4.5× as an expansion of malloc's own same-CCX baseline, with malloc's tail compounding the Infinity-Fabric round-trip with its own allocator coordination overhead. If your producer and consumer don't share an L3, the case for replacing new/delete is much sharper than the headline numbers suggest.
Percentile values shown in charts above are computed from raw histograms in the corresponding JSON entries: log₂-subbucket-16 binning, bucket-midpoint percentile convention. See Methodology for the rdtscp calibration path.
AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 0–7 isolated (core 0 carries unavoidable kernel housekeeping; benchmarks pinned to 4–7). Producer on core 4, consumer on core 5, T_bg on core 6 (all CCX1). Headless Ubuntu 24.04. GCC 13.3, -O3 -march=native -std=c++20. 5 outer runs × 1M timed samples per run, percentiles from merged histograms (tail-latency-distribution convention).