Allocators: cross-thread order pipeline

A market-data thread hands off 64-byte Orders to a risk thread. The risk thread frees them after checking positions, limits, and velocity. This is the pattern every real trading system uses, and it is the one that breaks naïve thread-local pool allocators (those whose pool can only return memory on the allocating thread) — the free happens on a different thread from the alloc.

The thesis

In a cross-thread trading pipeline, the allocator design is a derivative of the threading model. Malloc's median is fine; its tail drifts under background heap pressure — other subsystems sharing the heap, fragmentation accumulating, arena locks contending. The size of that drift depends on topology: modest (~1.15–1.3× p99.9) when producer and consumer share a CCX, substantial (~2.3–2.5×) when they don't.

This post measures three strategies for the cross-thread free pattern: the baseline that most real systems actually ship, and two honest domain-specific designs. The result that survives contact with the data isn't the one the design discussion would lead you to expect: the bump-pointer arena's theoretical fast-path advantage doesn't materialise against the freelist's amortised return queue, and the gap between the pool variants and new/delete is much smaller on same-CCX than the design framing implies.

Setup

The Order struct

struct alignas(64) Order {
    uint64_t ts_create_tsc;  // TSC at producer construction
    uint64_t seq;
    int64_t  price;
    int32_t  qty;
    uint32_t client_id;
    uint32_t symbol_id;
    uint32_t risk_seq;
    Side     side;
    uint8_t  arena_idx;      // used by variant 3; zero otherwise
    uint8_t  _pad[18];
};
static_assert(sizeof(Order)  == 64);
static_assert(alignof(Order) == 64);

Fixed 64 B, exactly one cache line. The size is representative of an internal order representation — production orders carry variable-length FIX fields, but the allocator story is independent of payload contents.

Three variants

cross-thread-malloc — baseline. new Order on the producer, delete on the consumer. No domain-specific allocator. Every real system starts here.

freelist-return-queue — producer keeps a thread-local freelist seeded from a pre-allocated slab of 4096 slots. Consumer pushes freed Orders back to the producer via a second SPSC queue (C→P, depth 1024). Producer drains the return queue in batches of 32 when its local freelist runs empty. The "magazine" pattern, simplified for the 1P/1C case.

arena-batch-handoff — producer bump-allocates from one of four rotating arenas (4096 slots each, 16K total in-flight capacity). Consumer increments a per-arena drain counter. Producer reuses an arena only when consumer_pos >= producer_pos — the consumer has acknowledged every order from the arena's last fill. No freelist, no return queue: the producer's fast path is one bounds check and one pointer increment.

All three use the same SPSC queue primitive from Demo 4 for the main forward path (producer → consumer). freelist-return-queue also uses it for the return path (consumer → producer).

Threading model

Thread	Core	CCX
Producer (T_p)	4	CCX1
Consumer (T_c)	5	CCX1
Background heap (T_bg)	6	CCX1

T_p and T_c are on the same CCX, sharing an L3 — matching Demo 4's topology so queue-crossing costs are comparable. T_bg is also on CCX1 so it shares the L3 cache, which is the realistic "other subsystem on the same NUMA domain" model.

The simulated risk check

The consumer does real work: three table lookups (positions by symbol, limits by client, velocity by client), position delta arithmetic, limit checks, and a velocity-window update. All tables total ~34 KB — sized to sit within L1d plus a slice of L2 rather than strictly inside the 32 KB L1d. The calibrated consumer work weight is approximately 200 ns wall-clock — enough to make allocator differences visible in the latency distribution without making the consumer the dominant bottleneck.

Background heap pressure

T_bg runs a tight loop of mixed-size malloc/free calls against six size classes (32–1024 B), prefilling 8192 live allocations at thread start to create fragmentation from t=0. The default rate for headline measurements is 1 M ops/sec of churn, with bg-threads=1. Both values come from the calibration ladder in bench/calibration-notes/README.md; the lock decision is documented there.

Headline latency (CCDF)

Allocators: cross-thread order pipeline

All three variants run at 1 MHz offered load with 1 M/s background heap pressure, 5 × 1M items per variant. Histograms merged across iterations.

The CCDF reads right-to-left: a lower curve at any latency means fewer samples are that slow or slower. At the median the three variants sit within two histogram buckets of each other — 172, 188, and 204 ns in this capture — and that ordering should not be trusted: a re-capture of the same code after a clean rebuild reordered them (the May capture put malloc at 172 and arena at 204; this one puts arena at 172 and malloc at 204). Differences of one or two log-spaced buckets at the median are below what this harness resolves across rebuilds. The tail is where the signal is, and it sorts the same way in every capture: the two pool variants travel together (p99 = 220, p99.9 = 328 here, within one bucket of each other in every dataset; the pressure-sweep section's separately-captured baseline puts the same pools at 344 — one log₂-subbucket-16 bucket up, i.e. within resolution), while malloc's tail stretches under background pressure to p99 = 312 and p99.9 = 376 — about 1.4× the pools at p99 and 1.15× at p99.9 in this capture.

The numbers are modest in absolute terms because both producer and consumer share a CCX; the cross-CCX section near the end of the post shows what happens to malloc's tail when the queue traversal stops being free. One cautionary footnote on single-sample statistics: the May capture's headline featured a dramatic malloc max of 46,710 ns against the pools' 10–15 µs. In this capture all three maxes land between 14 and 16 µs and malloc's is unremarkable. A max is one sample out of five million; it carried no structural information then and carries none now.

The ordering the harness can't resolve

The May capture of this benchmark put the arena last at the median, 32 ns behind malloc and the freelist, and this section originally explained why a two-instruction bump pointer was losing. The arena's steady-state allocate(), in 99.97% of calls, is:

Order* o = &arenas_[current_].slots[orders_in_current_++];
o->arena_idx = static_cast<uint8_t>(current_);
return o;

Two instructions on the hot path; the [[unlikely]] rotation branch fires roughly every 4096 orders. The freelist's hot path is a vector pop plus an amortised return-queue drain. On instruction count the arena should be at least as fast — and in the May capture it wasn't, by a consistent 24–32 ns at every percentile.

Then the recapture inverted it. Same code, same kernel, same machine state, a clean rebuild and a reboot apart: arena 172 ns at p50, malloc 204 — the May ordering exactly swapped, with the freelist's tail converging onto the arena's (both 328 ns at p99.9). Three mechanisms were drafted to explain the arena's loss; all three were plausible; none of them was the cause, because there was no stable effect to cause. Orderings at the ±30 ns scale on a ~200 ns pipeline are hostage to binary layout, page placement, and alignment luck that a rebuild reshuffles.

That is the honest result of the pool-vs-pool comparison: within this harness's resolution, the freelist and the arena are the same speed. Any benchmark that hands you a winner at this scale from a single build is handing you a coin toss — the mechanism-shaped explanations write themselves either way, which is exactly why they should be distrusted without a stable effect to explain.

What does separate the designs is variance, not magnitude: the arena's p99.9 is flat at 344 ns across the entire pressure sweep — all nine points, zero-pressure baseline included, in both captures. That reproducibility is the strongest result in this post. If a tail-latency SLA is written in jitter terms, the arena's flat line is a real, capture-stable property.

Throughput

Paced throughput at 1 MHz offered load (ops/sec — higher is better)

At 1 MHz paced load the throughput numbers reflect how faithfully each variant sustains the offered rate over 5M items. All three variants sustain close to 1 M/s (1,000,246–1,000,247 ops/s in this capture). Throughput differences are secondary to latency at this load level.

Background pressure sweep

05-allocators — background pressure vs p99_9 latency

The sweep runs each variant across 9 background-pressure levels: a no-T_bg baseline (plotted as each line's leftmost point, left of the log scale) plus 8 log-spaced levels from 100 k/s to 10 M/s. Producer paced at 1 MHz throughout; 1M items per sweep point.

Three different shapes in this chart.

Both pools are flat. The arena lands at p99.9 = 344 ns at every sweep point, zero-pressure baseline included — and it did exactly that in the May capture too, making it the most reproducible number in this demo. The freelist sits on the same 344 ns line at every point in this capture (in May it ranged 296–328). Whatever background heap churn does to the system, the pool designs don't transmit it to the pipeline's tail.

Malloc is the only variant that responds, and not where intuition puts it. Its p99.9 rises from 344 with no background pressure to 376 across the 100 k/s–1.4 M/s band, then settles back to 360 from 2.7 M/s up. The May capture showed the same shape with a larger swing (328 → 424 near 372 k/s → 344). The mechanism is cache locality: at high churn rates malloc's recently-freed blocks get re-touched fast enough to stay warm in L1/L2, while moderate rates expose the fragmentation tail. Two captures agree on the shape — moderate pressure is malloc's worst case, not peak pressure — and disagree on the amplitude, so treat the shape as the result and the amplitude as capture-specific.

Each line's leftmost point, set apart from the log scale, is that variant's p99.9 with no background pressure — all three sit at 344 ns, so the lines start together. Only malloc departs from it, climbing to 376 ns under moderate pressure before settling at 360; the pools stay on the 344 ns line across the whole sweep.

Cross-CCX side note

The headline runs place producer and consumer on the same CCX (cores 4 and 5, sharing L3 on CCX1). In a real system, producer and consumer are sometimes on different CCX slices — the queue crossing traverses the Infinity Fabric rather than the shared L3. This case isn't the headline because most tightly-coupled trading pipelines run same-CCX by design; the side experiment shows what happens when you don't get that choice.

Allocators: cross-thread order pipeline

Cross-CCX configuration: producer on core 4 (CCX1), consumer on core 1 (CCX0), T_bg on core 6 (CCX1). The p50 floor more than doubles for all three variants — 408 ns for the arena and 488 ns for malloc and the freelist in this capture, with the same bucket-level reordering caveat as the same-CCX medians (the May capture put all three on 408). The Infinity Fabric round-trip is the new baseline.

The tail is where the variants part, and it parts the same way in both captures. The pools top out within two buckets of each other — p99.9 = 720 and 784 ns here, both 720 in May — about 2.2–2.4× their same-CCX baselines. Malloc reaches p99 = 1184 ns and p99.9 = 1824 ns: roughly 1.7× the pools at p99 and 2.3–2.5× at p99.9, and a ~4.9× expansion of its own same-CCX p99.9 (376 → 1824) against the pools' ~2.2–2.4×. The cross-CCX environment amplifies malloc's allocator-overhead tail disproportionately — the lock-contention and arena-coordination paths malloc runs internally pay an extra Infinity-Fabric round-trip every time they cross between threads on different L3 domains.

Single-sample max values don't track the percentile ordering and reshuffle between captures (this capture: malloc 8,370, arena 8,930, freelist 8,990; May: freelist 7,200, malloc 10,460, arena 12,200). One sample out of five million reflects where an interrupt landed, not a property of the variant.

This is the gap the design discussion at the top was implicitly pointing at. It happens to live in the cross-CCX corner, not the headline.

What this doesn't show

No jemalloc / mimalloc / tcmalloc comparison. Drop-in allocator replacement (via LD_PRELOAD or a tcmalloc CMake option) is a separate future post scoped to "standard library vs drop-in general allocator." This post is scoped to "standard library vs domain-specific."
Strictly 1P/1C. Multi-producer or multi-consumer patterns break the SPSC queue contract and require different allocator designs. Not measured here.
Fixed 64 B Order size. Production orders carry variable-length FIX fields, optional tags, and sometimes strategy-level metadata. Larger or variable-size orders change the bump-pointer fill-rate, pool sizing, and fragmentation dynamics. That's a separate post.
No variable-length payload. Following from the above — the allocator story for variable-size objects (slab per size class, buddy allocator, region-based) is different from the 64 B fixed-size case.
No NUMA crossing. The reference machine is single-NUMA-node. Cross-NUMA Order pipelines introduce remote-memory latency into every alloc/free on the home node of the other thread.

Takeaway

For cross-thread Order lifetimes on this CPU, the reproducible result is not a pool-vs-pool winner — it's that there isn't one. Across two captures of identical code, the freelist and arena traded places at the median and converged at the tail; their differences sit below what this harness resolves across rebuilds. Both reliably beat new/delete where it matters: the tail under pressure, and everything cross-CCX.

What the arena offers that nothing else here does is reproducible flatness: its p99.9 sat at exactly 344 ns across all nine pressure-sweep points in both captures, no-pressure baseline included. If the latency SLA is written in jitter or worst-case-under-load terms, that capture-stable flat line — not a ±30 ns median ordering — is the property to buy.

new/delete is the only variant whose tail responds to background heap pressure, and the response isn't where intuition puts it: moderate pressure hurts malloc more than peak pressure, because at peak churn recently-freed blocks stay warm in cache. Both captures agree on that shape. The same-CCX gap is real but modest (~1.15–1.3× p99.9 across captures). Cross the CCX boundary and the gap opens to ~2.3–2.5× at p99.9 — roughly a 5× expansion of malloc's own same-CCX tail against the pools' ~2.2–2.4× — with malloc compounding the Infinity-Fabric round-trip with its own coordination overhead. If your producer and consumer don't share an L3, the case for replacing new/delete is much sharper than the same-CCX numbers suggest.

Percentile values shown in charts above are computed from raw histograms in the corresponding JSON entries: log₂-subbucket-16 binning, bucket-midpoint percentile convention. See Methodology for the rdtscp calibration path.

AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 1–7 isolated (cpu0 cannot be kernel-isolated and carries housekeeping; benchmarks pinned to 4–7). Producer on core 4, consumer on core 5, T_bg on core 6 (all CCX1). Headless Ubuntu 24.04. GCC 13.3, -O3 -march=native -std=c++20. 5 outer runs × 1M timed samples per run, percentiles from merged histograms (tail-latency-distribution convention).

Methodology →

The thesis#

Setup#

The Order struct#

Three variants#

Threading model#

The simulated risk check#

Background heap pressure#

Headline latency (CCDF)#

The ordering the harness can't resolve#

Throughput#

Background pressure sweep#

Cross-CCX side note#

What this doesn't show#

Takeaway#

The thesis

Setup

The Order struct

Three variants

Threading model

The simulated risk check

Background heap pressure

Headline latency (CCDF)

The ordering the harness can't resolve

Throughput

Background pressure sweep

Cross-CCX side note

What this doesn't show

Takeaway