2026-05-24
AoS vs SoA: bandwidth amplification, not a crossover
Scanning a 128 B option-quote struct across the Zen 2 cache hierarchy. The AoS-vs-SoA gap is 7× when one field per element is hot and vanishes when every field is — and the SIMD win lands largest where memory bandwidth is smallest.
A trading-shop hot path scanning a wide struct across a working-set sweep: the layout question is where this post lives. The folklore answer is SoA wins when you touch few fields, AoS wins when you touch many, with a crossover somewhere in between. There's no crossover. There's a cliff at the L3 boundary, and SIMD inverts the usual intuition about where vectorisation pays.
The hot path
A trading shop's hot path is rarely "do something to every field of every quote." It's "scan a million quotes for the one field that matters right now." Filter by symbol. Re-mark by mid. Recompute Greeks across the chain. Compress a snapshot for the wire. Each pass touches a small subset of fields, scans a lot of objects, and lives or dies on memory bandwidth.
Array-of-structs vs struct-of-arrays is the layout decision under all of this. The folklore answer is that SoA is the modern, SIMD-friendly choice and AoS is what you do when you don't know any better — except real codebases stay AoS because the struct is also written, copied, serialised, and logged, and AoS keeps "an Order" or "a Quote" as one contiguous thing the rest of the program can reason about.
So when does the layout actually matter, and by how much?
A reasonable prior on this is a crossover: AoS wins when you touch most fields per element, SoA wins when you touch few, with a turning point somewhere in the middle of K. The data has no crossover. The mechanism is different.
The thesis
Two things are happening on a Zen 2 core scanning a 128 B struct:
- Bandwidth amplification. When the inner loop touches one 8 B field per element, AoS pays for the full 64 B cache line and uses 8 B of it. SoA pays for the 8 B and skips the rest. That's an ~8× difference in bytes-from-memory per useful byte. It shows up as ~7× in wall time, and it doesn't move with anything else you do at the layout level.
- A cache cliff, not a staircase. The "cache hierarchy" on this machine is binary in practice: things either fit in L3 or they don't. Crossing the L3 boundary at AoS K=1 multiplies cost by 4× (the L2-floor 1.31 → DRAM band 5.37 step). Crossing the L2 boundary on the way out costs ~15% (the L2-floor 1.31 → L3-plateau 1.48 step) and is barely visible against the cliff that follows.
The result is a layout decision that lives at the L3 boundary. Below L3, layout doesn't matter much — the cache absorbs the bandwidth waste. Above L3, layout is the difference between an L3-resident scan and a DRAM-bound one, and you pay the cliff in full.
SIMD is its own twist on top — discussed in §SIMD — and inverts the usual intuition about where vectorisation pays.
The setup
bench/demos/06-aos-vs-soa/ sweeps three variants across two axes:
struct alignas(64) Quote {
int64_t timestamp_ns; // field 0 — hot in tick replay
int64_t symbol_id; // field 1
double bid_px, ask_px; // fields 2, 3
int64_t bid_sz, ask_sz; // fields 4, 5
double last_px; // field 6
int64_t last_sz; // field 7
int64_t seq; // field 8
int64_t venue_id; // field 9
double mid_px; // field 10
int64_t flags; // field 11
char pad[24]; // pad to 128 B
};
static_assert(sizeof(Quote) == 128, "Quote must span exactly two cache lines");Twelve named fields, two 64 B cache lines per element, sequential access. The hot-field count K is a benchmark parameter from 1 to 16 — when K > 12 the inner loop loops back to field 0; what matters for the comparison is the byte-budget the loop touches per element, not the field semantics.
The variants:
aos-scalar—std::vector<Quote>, inner loop readsKfields per element. Compiled-O3 -march=znver2 -fno-tree-vectorize. This is the layout the rest of the program sees.soa-scalar—std::array<std::vector<…>, 12>, one tightly packed array per field. Same outer loop, same reduction. Same-fno-tree-vectorizeflag.soa-autovec— identical source tosoa-scalarminus the-fno-tree-vectorize. The GCC 13 autovectoriser turns the inner loops into AVX2 (Zen 2 ISA). The point of this variant isn't "look, SIMD is faster"; it's the question §SIMD answers.
Each call computes a real reduction (a floating-point sum of the scanned fields) so the optimiser can't drop the load. The reduction is cheap enough not to dominate the timing at any working-set size; it's there to keep the loop honest. Each cell is ≥5 outer repetitions, median of ns_per_op reported. iterations and items_measured are recorded in the JSON.
Working-set sweep N ∈ {4k, 8k, 16k, 32k, 65k, 131k, 262k, 524k, 1 048 576} elements (so 128 B × N covers 512 KB through 128 MB — the L2 boundary through deep DRAM). K ∈ {1, 2, 4, 8, 16}. Three variants. 135 cells total.
The reference machine, headless boot, core isolation, and statistical reporting (median ns_per_op across 5 outer repetitions per cell) follow the conventions documented at Methodology. One known gap: l1d_misses_per_op is null in this capture — perf_capture.sh doesn't subscribe to l1d.replacement on this kernel. The L1D story below leans on the LLC and instructions-per-cycle counters, which are populated everywhere.
The headline picture: a cliff, not a staircase
The cleanest view of the headline is AoS at K=1 across N. One field per element, one struct layout, scan from L1-resident to deep-DRAM:
The shape:
| Working set | ns/element | Notes |
|---|---|---|
| N = 4 096 (≈ 512 KB, L2) | 1.31 | L2-resident floor |
| N = 8 192 → 65 536 (L3) | 1.48–1.51 | L3 plateau, near-flat across the whole tier |
| N = 131 072 (16 MB, L3 capacity) | 3.41 | L3 boundary, mid-cliff |
| N = 1 048 576 (128 MB) | 5.37 | DRAM-bound |
L2-resident floor 1.31 → DRAM band 5.37 = 4.11× cliff. There is no visible step within L3 — the eight-point plateau from N=8 K to N=65 K is near-flat. The chart looks like a flat low region, a near-vertical cliff at the L3 boundary, and a noisy DRAM band on the right. (The L1→L2 transition isn't measured; the smallest working set in the sweep, N=4 096, already fills L2.)
The DRAM band is non-monotonic — N = 262 144 lands at 4.17 ns, N = 524 288 at 3.99 ns, N = 1 048 576 at 5.37 ns. IQR/median is at most 1.7% at any of the three points — well below the swings of up to 35% between adjacent points, so the shape is a real signal rather than measurement noise. The most likely cause is a TLB / transparent-huge-page / DRAM bank-parallelism interaction: at certain working-set sizes the page table fits in the L1 TLB, at others it spills; transparent-huge-page promotion is probabilistic at this scale. None of that changes the cliff. The chart should be read as "above L3 capacity, AoS-K=1 costs 4–5.5 ns/element with some microarchitectural variation," not as a clean rising line.
Then the same trace for SoA, K=1:
SoA at K=1 doesn't see the cliff. At N = 1 048 576 the working set for one field is 8 MB — half the L3 capacity per CCX. The scan never reaches DRAM. SoA K=1 at the largest N lands at 0.77 ns/element, against AoS's 5.37 ns. The ratio at the headline working set:
AoS K=1 / SoA K=1, N = 1 048 576: 6.98×
Predicted by the bytes-touched ratio (64/8 = 8×), observed at 7×. The two-point gap between model and measurement is real — it's a combination of partial cache-line utilisation by SoA (the column at K=1 still loads 64 B lines, just with 8 useful bytes each, same as AoS) and AoS-side prefetcher behaviour rescuing some of the wasted bandwidth. It does not change the conclusion.
Bandwidth amplification, properly stated
The temptation is to call this "AoS wastes bandwidth." That framing is right in spirit and wrong in mechanism. Both layouts read whole 64 B cache lines from DRAM. The difference isn't that one wastes lines and the other doesn't; it's which working sets are DRAM-bound in the first place.
At K=1, N = 1 048 576:
- AoS touches 128 MB of struct memory. L3 is 16 MB per CCX. The scan is DRAM-bound. LLC misses: ~0.23 per element.
- SoA touches 8 MB of one field's array. L3 is 16 MB per CCX. The scan is L3-resident. LLC misses: ~0.0022 per element.
That's a ~100× ratio in LLC misses against a ~7× ratio in wall time. The two numbers tell a more complete story than either alone:
SoA's win at K=1 isn't "less wasted bandwidth at the same memory tier." It's moving the workload up a tier. AoS at K=1 is a DRAM-bound problem; SoA at K=1 is an L3-bound one. The 7× wall-clock gap is the bandwidth ratio between those two tiers, attenuated by the AoS prefetcher being more effective than nothing.
This is the framing that matters for the engineering decision. If your hot scan touches a small subset of fields and the working set is large enough to push AoS into DRAM, SoA doesn't just save bandwidth on a hot tier — it changes which tier you're on.
What happens when you actually use the fields
Run the same comparison with K walked from 1 to 16, at the headline DRAM-bound N:
The K sweep at N = 1 048 576:
| K | AoS ns/elem | SoA ns/elem | AoS / SoA | Bytes-touched model |
|---|---|---|---|---|
| 1 | 5.37 | 0.77 | 6.98× | 8.00× |
| 2 | (cliff) | (sub-L3) | 3.37× | 4.00× |
| 4 | (cliff) | (cliff) | 1.31× | 2.00× |
| 8 | 6.49 | 6.30 | 1.03× | 1.00× |
| 16 | 12.63 | 12.62 | 1.00× | 1.00× |
(The K=2 and K=4 rows: SoA crosses the L3 boundary somewhere between K=2 and K=4 at this N — that's why its column also enters the cliff. AoS is past the cliff from K=1; SoA falls off it as the number of arrays it streams grows.)
By K=8 the gap is a hairline; by K=16 the two layouts are indistinguishable. Both are streaming every byte of every cache line, both are DRAM-bound, neither benefits from being SoA. The bytes-touched model predicts the trend and slightly overstates the gap — the measured ratios sit a notch below the model column because the AoS prefetcher reduces the effective bandwidth penalty.
The headline finding is not "use SoA." The headline finding is layout matters at K=1 and converges to not-mattering by K=8. The decision criterion for a real codebase is: how many fields per element does the hot pass actually use?
SIMD escapes compute, not memory
The third variant — SoA with autovectorisation enabled — answers a question the K sweep raises. If SoA at K=16 is bandwidth-saturated, what does vectorising the inner loop buy you?
At N = 1 048 576:
- K = 16, DRAM-saturated: SoA scalar 12.62 ns/element; SoA autovec 5.01 ns. Speedup 2.52×.
- K = 1, L3-resident: SoA scalar 0.77 ns/element; SoA autovec 0.19 ns. Speedup 3.99×.
This inverts the intuition that SIMD wins more when the workload is wider. The autovectoriser wins more on the narrower scan.
The mechanism is clean. At K=16 SoA is bandwidth-bound at ~5 ns/element — that's the DRAM-bound floor the L3 cliff established for SoA at this byte budget. SIMD reduces compute, but compute isn't the bottleneck anymore. The vectoriser shaves the loop down to its memory-stall floor and then runs into a wall. At K=1 the workload fits in L3, bandwidth is plentiful, the scalar loop is compute-bound, and SIMD wins essentially its full theoretical width-times-FMA factor (~4× here).
The framing line worth pulling out:
SIMD escapes compute, not memory. When the bottleneck is bandwidth, vectorising the inner loop buys you nothing — except, sometimes, the speedup of getting back to the bandwidth floor faster.
There's an obvious next question — what does hand-tuned SSE/AVX2 with FMA do on the same workload? That's the territory of demo 3, which isolates the compute-bound case on a different kernel and reports a 10× spread between scalar libm and tuned AVX2+FMA. The chain of reasoning between the two posts: demo 3 establishes what SIMD does when compute is the bottleneck; demo 6 establishes when compute is the bottleneck in the first place.
What this doesn't show
This is the section worth taking seriously. The post's claims are bounded.
- Random access. Every measurement above is a sequential scan. Random access on this workload would change every number — the prefetcher would lose, the TLB miss rate would explode, AoS's "wasted" bandwidth would matter less because random AoS and random SoA both read full lines and use what they touch. That's a different demo.
- Hand-tuned SIMD with intrinsics. The
soa-autovecvariant is what GCC 13 chose; it isn't what a careful intrinsic-and-FMA implementation would do. Demo 3 shows the gap between autovec and hand-tuned on a compute-bound kernel — it's substantial. - Mixed read/write workloads. The reduction here is read-only. Write-heavy passes (mark-to-market, snapshot serialisation) introduce store-buffer and write-combining behaviour the loads-only timing doesn't capture.
- Multi-thread / cross-CCX. Single thread, pinned to core 4, CCX1, 16 MB L3 to itself. The cross-CCX picture is in demo 5's side note — the Infinity-Fabric round-trip changes which working sets are bandwidth-bound in the first place.
- L1D miss counters. This capture's
perf_capture.shdoesn't subscribe tol1d.replacement, so the L1D row in the JSON is null. The cache-tier story above relies on LLC misses (populated everywhere) and on the qualitative L1 → L2 transition being too small to be visible inns_per_op. The L1D counter would have made the smallest of the three transitions checkable. It isn't in this capture. - The 128 B struct. Wider struct ⇒ steeper bandwidth-amplification curve; narrower struct ⇒ shallower. The qualitative story (cliff at L3, gap closes by K = struct-width / 8) is robust; the specific 7× number is not portable to a 64 B struct.
Takeaway
For a trading-shop pipeline scanning a chain of quotes or orders:
- AoS is fine when your hot pass touches most of the fields per element. By K = 8 on a 128 B struct, layout is invisible.
- SoA matters when your hot pass touches one or two fields and the working set is large enough to push AoS past L3. "Large enough" on this machine is somewhere between 128 KB and 16 MB of struct memory. Below 128 KB even AoS at K=1 fits in L1 and the layout question evaporates.
- The cache hierarchy on Zen 2 is binary in practice, not a staircase. L1, L2, and L3 plateaus aren't distinguishable in scan timing for this workload; what matters is L3 capacity. Plan for "fits in L3" or "doesn't."
- SIMD is a compute optimisation, not a memory one. If your hot loop is bandwidth-bound, vectorising won't save you. Find the bandwidth-bound boundary first; vectorise on the right side of it.
The right layout in real code is rarely a layout decision in isolation — it's a derivative of which fields your hot path touches. The benchmark above is one way to make "hot path" empirical instead of intuitive.
Where this connects: demo 1 established what the cache hierarchy looks like on this CPU; this post fills in the layout response to it. Demo 3 takes over from §SIMD for the compute-bound side of the SIMD question. The cache-staircase pattern — and the question of whether it creates a "small-N crossover" — repeats with map data structures in demo 7.
AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 0–7 isolated (core 0 carries unavoidable kernel housekeeping), single thread pinned to core 4 (CCX1), headless Ubuntu 24.04. GCC 13.3, -O3 -march=znver2 -fno-tree-vectorize. 5 outer repetitions per cell, median ns_per_op reported (working-set-sweep convention).
Source: bench/demos/06-aos-vs-soa/ · JSON.