crucible.garethcooke.com
C++ performance engineering,
measured.
Each post is a focused optimisation problem — naive vs tuned implementations, real hardware measurements on a documented reference machine, and visualisations of system behaviour rather than algorithm steps.
Posts
- 01
Sorted vs unsorted: a branch-prediction deep dive
Same code. Same data. Three variants. Why does one run 7× faster — and why does the branchless version close most of the gap without sorting anything?
2026-05-17 - 02
False sharing: a 15× throughput gap from two missing bytes
Same algorithm. Same fill stream. Same machine. The only variable is struct layout — and it dominates.
2026-05-16 - 03
Black-Scholes: same model, four implementations, ~10× spread
Polynomial approximations buy 12% on their own — but they're the gate to a 9× SIMD win that libm's erfc can't reach.
2026-05-14 - 04
Same queue API. Different tail by orders of magnitude.
End-to-end enqueue→dequeue, market-data thread to strategy thread. Lock-free SPSC vs mutex + condvar. At moderate offered load, all three variants look similar at the median. The tail and the breakdown rate are where they diverge.
2026-05-18 - 05
Allocators: cross-thread order pipeline
Cross-thread Order lifecycle benchmarked across three allocator strategies — malloc, freelist with return queue, and arena with batch handoff. The result that survives contact with the data isn't the one the design discussion would lead you to expect.
2026-05-21 - 06
AoS vs SoA: bandwidth amplification, not a crossover
Scanning a 128 B option-quote struct across the Zen 2 cache hierarchy. The AoS-vs-SoA gap is 7× when one field per element is hot and vanishes when every field is — and the SIMD win lands largest where memory bandwidth is smallest.
2026-05-24 - 07
No crossover: absl::flat_hash_map wins at every N and workload mix
Five map implementations swept from N=8 to N≈4M on a Zen 2 CCX. The 'small map → flat structure' folklore expected a crossover; the data found none. absl::flat_hash_map is fastest at every N, in both pure-lookup and modify-mixed workloads, and the sorted-vector primitives collapse under any modify load at non-trivial N.
2026-05-26 - 08
The sorting shootout: going around the comparison wall
In Progress2026-05-29