AoS vs SoA: bandwidth amplification, not a crossover

A trading-shop hot path scanning a wide struct across a working-set sweep: the layout question is where this post lives. The folklore answer is SoA wins when you touch few fields, AoS wins when you touch many, with a crossover somewhere in between. There's no crossover. There's a cliff at the L3 boundary, and SIMD inverts the usual intuition about where vectorisation pays.

The hot path

A trading shop's hot path is rarely "do something to every field of every quote." It's "scan a million quotes for the one field that matters right now." Filter by symbol. Re-mark by mid. Recompute Greeks across the chain. Compress a snapshot for the wire. Each pass touches a small subset of fields, scans a lot of objects, and lives or dies on memory bandwidth.

Array-of-structs vs struct-of-arrays is the layout decision under all of this. The folklore answer is that SoA is the modern, SIMD-friendly choice and AoS is what you do when you don't know any better — except real codebases stay AoS because the struct is also written, copied, serialised, and logged, and AoS keeps "an Order" or "a Quote" as one contiguous thing the rest of the program can reason about.

So when does the layout actually matter, and by how much?

A reasonable prior on this is a crossover: AoS wins when you touch most fields per element, SoA wins when you touch few, with a turning point somewhere in the middle of K. The data has no crossover. The mechanism is different.

The thesis

Two things are happening on a Zen 2 core scanning a 128 B struct:

Bandwidth amplification. When the inner loop touches one 8 B field per element, AoS pays for the full 64 B cache line and uses 8 B of it. SoA pays for the 8 B and skips the rest. That's an ~8× difference in bytes-from-memory per useful byte. It shows up as ~7× in wall time, and it doesn't move with anything else you do at the layout level.
A cache cliff, not a staircase. The "cache hierarchy" on this machine is binary in practice: things either fit in L3 or they don't. Crossing the L3 boundary at AoS K=1 multiplies cost by 4× (the L2-floor 1.31 → DRAM band 5.37 step). Crossing the L2 boundary on the way out costs ~14% (the L2-floor 1.31 → L3-plateau 1.50 step) and is barely visible against the cliff that follows.

The result is a layout decision that lives at the L3 boundary. Below L3, layout doesn't matter much — the cache absorbs the bandwidth waste. Above L3, layout is the difference between an L3-resident scan and a DRAM-bound one, and you pay the cliff in full.

SIMD is its own twist on top — discussed in §SIMD — and inverts the usual intuition about where vectorisation pays.

The setup

bench/demos/06-aos-vs-soa/ sweeps three variants across two axes:

struct alignas(64) Quote {
    int64_t timestamp_ns;   // field 0  — hot in tick replay
    int64_t symbol_id;      // field 1
    double  bid_px, ask_px; // fields 2, 3
    int64_t bid_sz, ask_sz; // fields 4, 5
    double  last_px;        // field 6
    int64_t last_sz;        // field 7
    int64_t seq;            // field 8
    int64_t venue_id;       // field 9
    double  mid_px;         // field 10
    int64_t flags;          // field 11
    char    pad[24];        // pad to 128 B
};
static_assert(sizeof(Quote) == 128, "Quote must span exactly two cache lines");

Twelve named fields, two 64 B cache lines per element, sequential access. The hot-field count K is a benchmark parameter from 1 to 16 — when K > 12 the inner loop loops back to field 0; what matters for the comparison is the byte-budget the loop touches per element, not the field semantics.

The variants:

aos-scalar — std::vector<Quote>, inner loop reads K fields per element. Compiled -O3 -march=znver2 -fno-tree-vectorize. This is the layout the rest of the program sees.
soa-scalar — std::array<std::vector<…>, 12>, one tightly packed array per field. Same outer loop, same reduction. Same -fno-tree-vectorize flag.
soa-autovec — identical source to soa-scalar minus the -fno-tree-vectorize. The GCC 13 autovectoriser turns the inner loops into AVX2 (Zen 2 ISA). The point of this variant isn't "look, SIMD is faster"; it's the question §SIMD answers.

Each call computes a real reduction (a floating-point sum of the scanned fields) so the optimiser can't drop the load. The reduction is cheap enough not to dominate the timing at any working-set size; it's there to keep the loop honest. Each cell is ≥5 outer repetitions, median of ns_per_op reported. iterations and items_measured are recorded in the JSON.

Working-set sweep N ∈ {4k, 8k, 16k, 32k, 65k, 131k, 262k, 524k, 1 048 576} elements (so 128 B × N covers 512 KB through 128 MB — the L2 boundary through deep DRAM). K ∈ {1, 2, 4, 8, 16}. Three variants. 135 cells total.

The reference machine, headless boot, core isolation, and statistical reporting (median ns_per_op across 5 outer repetitions per cell) follow the conventions documented at Methodology. One known gap: l1d_misses_per_op is null in this capture — the capture pipeline's original L1D event name (l1d.replacement) was never valid on this AMD part; this capture predates the corrected event, so the field is null. The L1D story below leans on the LLC and instructions-per-cycle counters, which are populated everywhere.

The headline picture: a cliff, not a staircase

The cleanest view of the headline is AoS at K=1 across N. One field per element, one struct layout, scan from L1-resident to deep-DRAM:

AoS scalar, K=1 — one field per element, across the working set

The shape:

Working set	ns/element	Notes
N = 4 096 (≈ 512 KB, L2)	1.31	L2-resident floor
N = 8 192 → 65 536 (L3)	1.49–1.50	L3 plateau, near-flat across the whole tier
N = 131 072 (16 MB, L3 capacity)	3.07	L3 boundary, mid-cliff
N = 1 048 576 (128 MB)	5.37	DRAM-bound

L2-resident floor 1.31 → DRAM band 5.37 = 4.08× cliff. There is no visible step within L3 — the four-point plateau from N=8 K to N=65 K is near-flat. The chart looks like a flat low region, a near-vertical cliff at the L3 boundary, and a noisy DRAM band on the right. (The L1→L2 transition isn't measured; the smallest working set in the sweep, N=4 096, already fills L2.)

The DRAM band has capture-dependent structure. In this capture it is nearly flat — N = 262 144 at 5.36 ns, N = 524 288 at 5.32 ns, N = 1 048 576 at 5.37 ns. The May capture of the same code showed a pronounced dip instead: 4.17 and 3.99 ns at the two middle points, with within-capture IQR/median of at most 1.7% — tight enough that, read alone, the dip looked like a real microarchitectural signal. The recapture says otherwise: a clean rebuild and reboot later, the dip is gone. The likely mechanism is the one that makes it non-reproducible — page placement and transparent-huge-page promotion are decided per process at allocation time, and TLB reach and DRAM bank parallelism follow from that draw. Within-capture tightness is not across-capture stability. None of this touches the cliff: the band should be read as "above L3 capacity, AoS-K=1 costs roughly 4–5.5 ns/element, with structure inside that band that belongs to the capture, not the code," and the N = 1 048 576 anchor itself is capture-stable at 5.37 ns.

Then the same trace for SoA, K=1:

SoA scalar, K=1 — same one field, contiguous

SoA at K=1 doesn't see the cliff. At N = 1 048 576 the working set for one field is 8 MB — half the L3 capacity per CCX. The scan never reaches DRAM. SoA K=1 at the largest N lands at 0.77 ns/element, against AoS's 5.37 ns. The ratio at the headline working set:

AoS K=1 / SoA K=1, N = 1 048 576: 6.97×

Predicted by the bytes-touched ratio (64/8 = 8×), observed at 7×. The two-point gap between model and measurement is real — it's a combination of partial cache-line utilisation by SoA (the column at K=1 still loads 64 B lines, just with 8 useful bytes each, same as AoS) and AoS-side prefetcher behaviour rescuing some of the wasted bandwidth. It does not change the conclusion.

Bandwidth amplification, properly stated

The temptation is to call this "AoS wastes bandwidth." That framing is right in spirit and wrong in mechanism. Both layouts read whole 64 B cache lines from DRAM. The difference isn't that one wastes lines and the other doesn't; it's which working sets are DRAM-bound in the first place.

At K=1, N = 1 048 576:

AoS touches 128 MB of struct memory. L3 is 16 MB per CCX. The scan is DRAM-bound. LLC misses: ~0.23 per element.
SoA touches 8 MB of one field's array. L3 is 16 MB per CCX. The scan is L3-resident. LLC misses: ~0.0022 per element.

That's a ~100× ratio in LLC misses against a ~7× ratio in wall time. The two numbers tell a more complete story than either alone:

SoA's win at K=1 isn't "less wasted bandwidth at the same memory tier." It's moving the workload up a tier. AoS at K=1 is a DRAM-bound problem; SoA at K=1 is an L3-bound one. The 7× wall-clock gap is the bandwidth ratio between those two tiers, attenuated by the AoS prefetcher being more effective than nothing.

This is the framing that matters for the engineering decision. If your hot scan touches a small subset of fields and the working set is large enough to push AoS into DRAM, SoA doesn't just save bandwidth on a hot tier — it changes which tier you're on.

What happens when you actually use the fields

Run the same comparison with K walked from 1 to 16, at the headline DRAM-bound N:

AoS vs SoA at N=1M across K — the gap closes as you touch more fields

The K sweep at N = 1 048 576:

K	AoS ns/elem	SoA ns/elem	AoS / SoA	Bytes-touched model
1	5.37	0.77	6.97×	8.00×
2	(cliff)	(sub-L3)	3.38×	4.00×
4	(cliff)	(cliff)	1.31×	2.00×
8	6.55	6.30	1.04×	1.00×
16	12.60	12.61	1.00×	1.00×

(The K=2 and K=4 rows: SoA crosses the L3 boundary somewhere between K=2 and K=4 at this N — that's why its column also enters the cliff. AoS is past the cliff from K=1; SoA falls off it as the number of arrays it streams grows.)

By K=8 the gap is a hairline; by K=16 the two layouts are indistinguishable. Both are streaming every byte of every cache line, both are DRAM-bound, neither benefits from being SoA. The bytes-touched model predicts the trend and slightly overstates the gap — the measured ratios sit a notch below the model column because the AoS prefetcher reduces the effective bandwidth penalty.

The headline finding is not "use SoA." The headline finding is layout matters at K=1 and converges to not-mattering by K=8. The decision criterion for a real codebase is: how many fields per element does the hot pass actually use?

SIMD escapes compute, not memory

The third variant — SoA with autovectorisation enabled — answers a question the K sweep raises. If SoA at K=16 is bandwidth-saturated, what does vectorising the inner loop buy you?

SoA scalar vs autovec at N=1M — K=1 (left) and K=16 (right)

At N = 1 048 576:

K = 16, DRAM-saturated: SoA scalar 12.61 ns/element; SoA autovec 4.99 ns. Speedup 2.53×.
K = 1, L3-resident: SoA scalar 0.77 ns/element; SoA autovec 0.19 ns. Speedup 3.99×.

This inverts the intuition that SIMD wins more when the workload is wider. The autovectoriser wins more on the narrower scan.

The mechanism is clean. At K=16 SoA is bandwidth-bound at ~5 ns/element — that's the DRAM-bound floor the L3 cliff established for SoA at this byte budget. SIMD reduces compute, but compute isn't the bottleneck anymore. The vectoriser shaves the loop down to its memory-stall floor and then runs into a wall. At K=1 the workload fits in L3, bandwidth is plentiful, the scalar loop is compute-bound, and SIMD wins essentially its full theoretical width-times-FMA factor (~4× here).

The framing line worth pulling out:

SIMD escapes compute, not memory. When the bottleneck is bandwidth, vectorising the inner loop buys you nothing — except, sometimes, the speedup of getting back to the bandwidth floor faster.

There's an obvious next question — what does hand-tuned SSE/AVX2 with FMA do on the same workload? That's the territory of demo 3, which isolates the compute-bound case on a different kernel and reports a 10× spread between scalar libm and tuned AVX2+FMA. The chain of reasoning between the two posts: demo 3 establishes what SIMD does when compute is the bottleneck; demo 6 establishes when compute is the bottleneck in the first place.

What this doesn't show

This is the section worth taking seriously. The post's claims are bounded.

Random access. Every measurement above is a sequential scan. Random access on this workload would change every number — the prefetcher would lose, the TLB miss rate would explode, AoS's "wasted" bandwidth would matter less because random AoS and random SoA both read full lines and use what they touch. That's a different demo.
Hand-tuned SIMD with intrinsics. The soa-autovec variant is what GCC 13 chose; it isn't what a careful intrinsic-and-FMA implementation would do. Demo 3 shows the gap between autovec and hand-tuned on a compute-bound kernel — it's substantial.
Mixed read/write workloads. The reduction here is read-only. Write-heavy passes (mark-to-market, snapshot serialisation) introduce store-buffer and write-combining behaviour the loads-only timing doesn't capture.
Multi-thread / cross-CCX. Single thread, pinned to core 4, CCX1, 16 MB L3 to itself. The cross-CCX picture is in demo 5's side note — the Infinity-Fabric round-trip changes which working sets are bandwidth-bound in the first place.
L1D miss counters. The l1d_misses_per_op field is null in this capture — l1d.replacement was never a valid event name on this AMD part; this capture predates the corrected event (L1-dcache-load-misses). The cache-tier story above relies on LLC misses (populated everywhere) and on the qualitative L1 → L2 transition being too small to be visible in ns_per_op. The L1D counter would have made the smallest of the three transitions checkable. It isn't in this capture.
The 128 B struct. Wider struct ⇒ steeper bandwidth-amplification curve; narrower struct ⇒ shallower. The qualitative story (cliff at L3, gap closes by K = struct-width / 8) is robust; the specific 7× number is not portable to a 64 B struct.

Takeaway

For a trading-shop pipeline scanning a chain of quotes or orders:

AoS is fine when your hot pass touches most of the fields per element. By K = 8 on a 128 B struct, layout is invisible.
SoA matters when your hot pass touches one or two fields and the working set is large enough to push AoS past L3. "Large enough" on this machine is ~16 MB of struct memory — L3 capacity: the measured cliff sits between 8 MB (still on the L3 plateau) and 32 MB (fully DRAM-bound). Below the cliff the layout gap is the cache-resident ~1.5–2×, not the 7×.
The cache hierarchy on Zen 2 is binary in practice, not a staircase. L1, L2, and L3 plateaus aren't distinguishable in scan timing for this workload; what matters is L3 capacity. Plan for "fits in L3" or "doesn't."
SIMD is a compute optimisation, not a memory one. If your hot loop is bandwidth-bound, vectorising won't save you. Find the bandwidth-bound boundary first; vectorise on the right side of it.

The right layout in real code is rarely a layout decision in isolation — it's a derivative of which fields your hot path touches. The benchmark above is one way to make "hot path" empirical instead of intuitive.

Where this connects: demo 1 established what the cache hierarchy looks like on this CPU; this post fills in the layout response to it. Demo 3 takes over from §SIMD for the compute-bound side of the SIMD question. The cache-staircase pattern — and the question of whether it creates a "small-N crossover" — repeats with map data structures in demo 7. Demo 8 reuses the same cache-tier band markers on the radix sort line, where the L3→DRAM transition is the primary cost driver rather than an algorithmic ramp.

AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 1–7 isolated (cpu0 cannot be kernel-isolated and carries housekeeping), single thread pinned to core 4 (CCX1), headless Ubuntu 24.04. GCC 13.3, -O3 -march=znver2 -fno-tree-vectorize. 5 outer repetitions per cell, median ns_per_op reported (working-set-sweep convention).

Source: bench/demos/06-aos-vs-soa/ · JSON.

Methodology →

The hot path#

The thesis#

The setup#

The headline picture: a cliff, not a staircase#

Bandwidth amplification, properly stated#

What happens when you actually use the fields#

SIMD escapes compute, not memory#

What this doesn't show#

Takeaway#