False sharing: a 13.6× throughput gap from 56 missing bytes

Same algorithm. Same fill stream. Same machine. Two versions of a single struct — the only difference is 56 bytes of padding. At 8 threads, the unpadded version runs at 3.36 ns/op; the padded version reaches 0.25 ns/op. A 13.6× gap from a layout choice the compiler never warned about.

The setup

Each thread owns a slot in a shared array of P&L accumulators. The benchmark repeatedly walks a pre-generated fill stream and adds each value to the thread's slot. All threads run concurrently, targeting different slots — there is no logical sharing of data, only incidental sharing of hardware.

Unpadded

Padded

// Adjacent slots share cache lines.
// 8 bytes per slot → 8 slots per 64-byte line.
struct Strategy {
  volatile double pnl;
};

Strategy strategies[8];  // threads 0–7

// Each slot owns its cache line.
// alignas(64) guarantees no neighbours.
struct alignas(64) PaddedStrategy {
  volatile double pnl;
  char pad[64 - sizeof(double)];
};

PaddedStrategy strategies[8];

The fill stream is 1,024 doubles (8 KB), sized to sit comfortably in the 32 KB L1d cache. It is generated once at startup with a fixed seed and shared read-only across all threads, so every thread reads identical data — the only writes are the per-slot pnl accumulations. Each thread owns a slot; threads never write to each other's slots logically. The only sharing is incidental — adjacent slots happen to land on the same cache line.

The mechanism

Every write to an unpadded slot invalidates the entire cache line, which holds seven neighbouring slots belonging to other threads. Each thread must re-fetch the line before it can write. Accesses that previously resolved in L1d (~4 cycles) now require L3 round-trips (~35 cycles on Zen 2). With four threads simultaneously invalidating each other, the CPU spends most of its time waiting.

The IPC counter and cache-miss-ratio counter make this visible:

False sharing: padded vs unpadded P&L accumulators

Two counters surface the contention. At 1 thread the layouts are indistinguishable: IPC sits near 0.58, miss ratio near 19–22% (the steady-state miss rate of the inner loop, dominated by the volatile reload of pnl). At 2 threads the unpadded miss ratio jumps to 94% — once two cores write to the same line, nearly every access misses — and IPC collapses to 0.20. At 4 threads the miss ratio barely moves (still ~95%, already saturated), but IPC drops further to 0.11. The signal there is subtle: the rate of misses can't go much higher than ~100%, but the cost of each miss grows as more cores compete for the same line. Padded holds at IPC ~0.55 and miss ratio under 30% throughout — no shared line, no coherency traffic, no penalty.

The padded IPC of 0.55 is not itself a problem — it is the architectural floor of this accumulator pattern. volatile on the pnl slot is necessary to defeat register allocation and surface the false-sharing effect, but it forces every loop iteration through a store-to-load forwarding round-trip on the L1 store buffer (~5 cycles), capping IPC near 0.8 regardless of how much parallelism the rest of the machine could offer. The relevant signal is the drop under contention — 0.55 → 0.11 at 4t — not the absolute value. A read-modify-write accumulator in a real trading engine has the same ceiling; that is exactly why false sharing on this pattern is so destructive.

The wall-clock cost

The counter collapse translates directly into wall-clock time. At 4 threads on a single CCX, unpadded is 3.61 ns/op against padded's 0.71 ns/op — a 5× wall-clock penalty inside one core complex, lockstep with the 5× IPC collapse. The shared L3 doesn't rescue you — every false-sharing round-trip still costs you instructions you could have been executing.

A note on the chart units: ns/op below is reported as wall-clock time per operation aggregated across all participating threads — system throughput, not per-thread latency. For padded this falls roughly linearly with thread count because the work parallelises cleanly; for unpadded it stays approximately constant (or worsens) because the threads cancel each other out via cacheline ping-pong. Per-thread latency for padded sits near 2.85 ns/op across all intra-CCX thread counts (the inner loop's architectural floor); the system-level number falls because more threads share the wall-clock budget.

False sharing: padded vs unpadded P&L accumulators

Crossing the Fabric

Zen 2 is a chiplet architecture. Two 4-core CCXs sit on a compute die (CCD), connected via Infinity Fabric to a separate I/O die (IOD) that holds the memory controller. Coherency traffic between two cores on the same CCX travels through their shared 16 MB L3. Coherency traffic between CCX0 and CCX1 has to cross the Infinity Fabric — higher latency, no shared cache to absorb the round-trips.

False sharing: padded vs unpadded P&L accumulators

At 4 threads, the cross-CCX unpadded result is 4.38 ns/op against intra-CCX 4t unpadded at 3.61 ns/op — a 1.21× additional penalty from crossing the Fabric, on top of the 5× false-sharing penalty already paid. The cross-CCX 4t result has IQR/median under 0.4% across 20 repetitions — reproducible to two significant figures.

13.6× at 8 threads cross-CCX

At 8 threads spanning both CCXs, the gap widens. The unpadded variant settles at 3.36 ns/op; the padded variant reaches 0.25 ns/op — a 13.6× throughput gap from a single struct layout decision.

Configuration	Median ns/op	Throughput (ops/sec)	IQR/median
Intra-CCX 4t padded	0.71	1.40 G/s	0.06%
Intra-CCX 4t unpadded	3.61	277 M/s	0.10%
Cross-CCX 4t padded	0.71	1.40 G/s	0.18%
Cross-CCX 4t unpadded	4.38	228 M/s	0.18%
Cross-CCX 8t padded	0.25	4.04 G/s	26%
Cross-CCX 8t unpadded	3.36	298 M/s	1.0%

The 8-thread configuration is the only one that includes cpu0, which the kernel refuses to isolate (it's the boot CPU and handles unmigratable housekeeping). Run-to-run variance is correspondingly higher there. The unpadded number remains robust — false-sharing cost dwarfs cpu0 background noise — but the padded baseline has IQR/median around 26% rather than the sub-1% seen everywhere else. The padded 8t median also scales better than perfectly against 4t (2.9× throughput from 2× threads), which is not physical for this workload — the denominator is optimistic. Under a perfect-scaling assumption the gap is still 9.4×; the measured-median figure is 13.6×. Treat the headline as "order 10×, plausibly 13×", not a third-significant-figure claim.

What this means in practice

False sharing is easy to introduce accidentally in systems with per-thread state:

Per-strategy accumulators in trading engines — exactly the pattern modelled here. A common layout mistake is double pnl[N_STRATEGIES] with one thread per strategy.
Per-worker statistics counters — hit counts, error rates, queue depths. If each counter is a single int64_t in a contiguous array, adjacent counters share lines.
Market-data fan-out structs — if writers update sequence numbers on the same cache line as fields read by consuming threads, readers stall on coherency traffic that has nothing to do with their data.

The fix is straightforward: align each per-thread slot to the cache-line size. alignas(64) is sufficient on x86-64 (and most ARM64 deployments). The trade-off is memory: 8 threads × 64 bytes = 512 bytes instead of 64 bytes. For hot-path accumulators the trade-off is almost always worthwhile. The lock-free SPSC queue in demo 4 ships this pattern productionised — a PaddedAtomic<T> template with a static_assert that catches any layout regression at compile time, rather than relying on a test to notice the throughput collapse.

What this doesn't show

Read-write sharing is different. This benchmark measures write-write false sharing. Writer-reader and reader-writer patterns (e.g., one thread updating a sequence number while another reads the payload on the same line) have different magnitudes — usually less severe, since reads don't invalidate.
The result is hardware-specific. Zen 2's 16 MB shared L3 within a CCX and the Infinity Fabric latency to the other CCX are the specific topology measured. Intel monolithic-die designs absorb intra-socket false sharing in the shared LLC; multi-socket NUMA systems behave more like cross-CCX. The mechanism is universal; the magnitudes depend on the chip.
cpu0 cannot be kernel-isolated. On the reference machine the 8t result includes cpu0 in the worker pool. The headline number is robust in direction and order because the false-sharing signal dominates; smaller effects measured at 8t would need different methodology.

Takeaway

False sharing is a 13.6× throughput collapse from 56 missing bytes of padding — a layout regression the compiler will never warn about. The mechanism is universal (write-write coherency traffic on a shared cache line) but the magnitudes are hardware-specific: 5× within a Zen 2 CCX, 13.6× once the Infinity Fabric is in the loop, milder on Intel monolithic dies. The fix is a single alignas(64) — and a static_assert that turns the discipline into a compile-time invariant, as demo 4's PaddedAtomic<T> does.

When per-thread state lives in an array, the default layout is almost always wrong. Pad to a cache line, assert it, move on.

AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 1–7 isolated (cpu0 cannot be kernel-isolated); thread placement per configuration — intra-CCX runs on cores 4–7, cross-CCX spans both CCXs, the 8t run includes cpu0. Headless Ubuntu 24.04. GCC 13.3, -O3 -march=native. 20 outer repetitions, median reported (throughput convention).

Methodology →

The setup#

The mechanism#

The wall-clock cost#

Crossing the Fabric#

13.6× at 8 threads cross-CCX#

What this means in practice#

What this doesn't show#

Takeaway#

The setup

The mechanism

The wall-clock cost

Crossing the Fabric

13.6× at 8 threads cross-CCX

What this means in practice

What this doesn't show

Takeaway