Posts Methodology

Home Projects Contact

crucible.garethcooke.com

C++ performance engineering,
measured.

Each post is a focused optimisation problem — naive vs tuned implementations, real hardware measurements on a documented reference machine, and visualisations of system behaviour rather than algorithm steps.

Posts

01
Sorted vs unsorted: a branch-prediction deep dive
Same code. Same data. Three variants. Why does one run 7× faster — and why does the branchless version close most of the gap without sorting anything?
2026-05-17
02
False sharing: a 13.6× throughput gap from 56 missing bytes
Same algorithm. Same fill stream. Same machine. The only variable is struct layout — and it dominates.
2026-05-16
03
Black-Scholes: same model, four implementations, ~10× spread
Polynomial approximations buy 11% on their own — but they're the gate to a 9× SIMD win that libm's erfc can't reach.
2026-05-14
04
Same queue API. Different tail by orders of magnitude.
End-to-end enqueue→dequeue, market-data thread to strategy thread. Lock-free SPSC vs mutex + condvar. At 1 MHz offered load, the median is already ~13× apart: lock-free ~130 ns, mutex ~1.7 µs. The tail diverges to ~20× at p99.9; past saturation, orders of magnitude and bistable.
2026-06-05
05
Allocators: cross-thread order pipeline
Cross-thread Order lifecycle benchmarked across three allocator strategies — malloc, freelist with return queue, and arena with batch handoff. The result that survives contact with the data isn't the one the design discussion would lead you to expect.
2026-06-05
06
AoS vs SoA: bandwidth amplification, not a crossover
Scanning a 128 B option-quote struct across the Zen 2 cache hierarchy. The AoS-vs-SoA gap is 7× when one field per element is hot and vanishes when every field is — and the SIMD win lands largest where memory bandwidth is smallest.
2026-06-05
07
No crossover: absl::flat_hash_map wins at every N and workload mix
Five map implementations swept from N=8 to N≈4M on a Zen 2 CCX. The 'small map → flat structure' folklore expected a crossover; the data found none. absl::flat_hash_map is fastest at every N, in both pure-lookup and modify-mixed workloads, and the sorted-vector primitives collapse under any modify load at non-trivial N.
2026-06-05
08
The comparison lower bound is a wall. Integer keys let you walk around it.
std::sort is the default everyone reaches for, and on fixed-width integer keys it's rarely the fastest. A look at why a radix sort sidesteps the Ω(n log n) wall, why input shape decides the comparison-sort race, and why a data-independent sort is the one you want when a worst-case input on a hot path is what you have to bound.
2026-06-05
09
NEON on ARM: vector width is not a given
The demo 3 Black-Scholes kernel on a Cortex-A76. NEON delivers ~4.3× over scalar at 16k, rising to ~4.8× at 1M — and stops there. Where Zen 2 had AVX2 to reach for, the Pi 5 doesn't. Vector width is a property of the silicon.
2026-06-02

Special editions

outside standard methodology

★
Measuring the gap
Take a problem with a famous theoretical quantum speedup, run it both ways, and measure the gap between the promise and the silicon. Grover's algorithm on IBM quantum hardware vs classical linear search. Classical wins decisively; this post shows exactly why.
In Progressquantum2026-06-02