Methodology

How the numbers are produced

Published benchmark results are only credible with a documented, reproducible methodology. All measurements on this site satisfy the four non-negotiable commitments below.

Reference machines

Machine 1 — x86-64 (demos 1–8)

CPU AMD Ryzen 7 3800X (Zen 2) — 8C/16T silicon, SMT disabled, 8 logical CPUs exposed during benchmarks

RAM 32 GB DDR4-3200

Board ASUS ROG STRIX B550-F GAMING

OS Ubuntu Server LTS (dual-boot)

Mode Headless boot — no display server, no graphical desktop. Eliminates compositor and display-server noise from measurements.

Boot isolcpus=1-7 nohz_full=1-7 rcu_nocbs=1-7; benchmarks pin to cores 4–7 via taskset

BIOS Core Performance Boost disabled, SMT disabled

ISA SSE4.2 · AVX · AVX2 · FMA · no AVX-512

Zen 2 executes 256-bit AVX2 as single µops — verified per capture with the retired-µop counter (ex_ret_cops ≈ 1.0 µops/instruction; demo 03), unlike Zen and Zen+, which cracked each 256-bit op into two 128-bit µops. Full lscpu --extended output, kernel version, and compiler version are committed to the repo alongside each benchmark result.

Machine 2 — AArch64 (demo 9+)

CPU Cortex-A76 (BCM2712) — 4 cores, AArch64, NEON baseline ISA, no SVE

Board Raspberry Pi 5 Model B Rev 1.1

RAM 4 GB LPDDR4X

OS Raspberry Pi OS (64-bit)

Boot isolcpus=2,3 in kernel cmdline; benchmarks pinned to core 3 via taskset -c 3; IRQ affinity redirected off cores 2–3

Clock governor = performance; clock pinned at 2400 MHz (MAXMHZ); get_throttled = 0x0 verified (no CPU throttling) — replaces the Zen 2 BIOS turbo-disable with a clock-pin approach

ISA AArch64 · NEON (128-bit) · no SVE

Cross-machine absolute ns/op comparisons are never made between Machine 1 and Machine 2 — different clocks, compilers, and memory subsystems make them meaningless. The only portable quantity across machines is the within-machine speedup ratio. Full lscpu --extended output, kernel version, and compiler version are committed alongside each benchmark result.

Four non-negotiable commitments

CPU governor pinned to performance

Every benchmark run begins with cpupower frequency-set -g performance. Frequency scaling during a measurement run is a leading cause of variance that makes numbers look better or worse than they really are.

Turbo Boost disabled

Core Performance Boost is disabled in BIOS, pinning the cores to their 3900 MHz base clock. Boost state is measured from the kernel, never asserted: each capture reads the per-CPU cpb flag (/sys/devices/system/cpu/cpu*/cpufreq/cpb) — authoritative on this AMD acpi-cpufreq board — and cross-checks it against the top entry of scaling_available_frequencies. The result and the signal it came from are recorded in every JSON as machine.turbo and machine.turbo_source. A hard capture-time gate aborts the run before any benchmark executes if boost is detected as enabled, so a boosted capture cannot be taken silently.

cpuinfo_max_freq / lscpu MAXMHZ is not used to infer boost state: on this board it reports the silicon’s 4560 MHz ceiling whether or not boost is enabled, so it is recorded as machine.freq_max_advertised_mhz (advisory) while machine.freq_max_available_mhz — the highest real P-state, 3900 MHz with boost off — is the value that tracks state. Where no boost signal is exposed (e.g. the AArch64 rig), machine.turbo is recorded as null rather than guessed. Boost obscures the true steady-state throughput the predictor and cache hierarchy deliver at nominal frequency.

An earlier version of this pipeline derived turbo state from an environment variable rather than measuring it; see Corrections.

Core isolation

Cores 1–7 are isolated at the kernel level via isolcpus=1-7 nohz_full=1-7 rcu_nocbs=1-7 boot parameters — scoped to a dedicated GRUB entry (“Ubuntu (benchmark — cores 0-7 isolated)”) distinct from the standard development entry. Within that isolated set, benchmarks are additionally pinned to cores 4–7 via taskset (invoked by the per-demo wrapper scripts), with cores 0–3 absorbing any residual kernel housekeeping the isolation directives cannot redirect. SMT is disabled at the BIOS level — verified via /sys/devices/system/cpu/smt/active returning 0 and lscpu reporting 8 CPUs — to remove SMT-sibling resource sharing (L1, L2, execution ports, frontend) from all measurements. Isolated CPU IDs are recorded in each demo’s JSON machine.isolated_cpus field. On Machine 2 isolation uses isolcpus=2,3 in the Pi kernel cmdline with benchmarks pinned to core 3 via taskset -c 3; the A76 exposes no SMT to disable. See Machine 2 above.

Cross-CCX results. Cpu 0 still carries the system timer and other unmovable kernel work that isolcpus= cannot fully evict, so cross-CCX measurements (cores 0–3 and 4–7 both in the isolated set) carry slightly higher ambient noise than intra-CCX measurements and are labelled accordingly in every post.

Statistical reporting

Each benchmark uses one of three rep-count conventions, depending on what kind of statistic it reports. Every post’s footer states which convention it used.

Throughput / steady-state median (demos 1, 2, 3, 9): ≥20 outer repetitions (Google Benchmark --benchmark_repetitions); aggregates computed across those repetitions.
Tail-latency distribution (demos 4, 5): 5 outer runs × 1 M timed samples per run through a custom latency pipeline; percentiles computed from histograms merged across runs.
Working-set sweep (demos 6, 7, 8): 5 outer repetitions per cell; median ns_per_op reported. Sweep coverage substitutes for higher per-cell rep count.

Every chart states which statistic it shows:

Median — typical-case latency
Min — best the hardware can do (cache warm, predictor trained)
p99 / p99.9 — tail-latency claims
IQR — spread around the median

Timestamping: rdtscp calibration

The custom latency pipelines (demos 4–6) timestamp with the TSC rather than Google Benchmark’s timers. Every timestamp is taken by rdtscp_ordered() (bench/common/tsc_utils.h): an rdtscp instruction followed by an lfence. rdtscp does not read the counter until all prior loads and stores have retired, so a timestamp cannot be taken before the work it closes has finished; the trailing lfence prevents subsequent instructions from executing before the rdtscp itself retires, so following work cannot start ahead of the timestamp. A latency sample is the difference of two such reads — in demo 4’s case taken on different cores (enqueue timestamped on the producer, dequeue on the consumer).

Before any measurement, calibrate_tsc() checks the TSC-stability flags in /proc/cpuinfo: constant_tsc (the TSC ticks at a fixed rate regardless of frequency scaling) and nonstop_tsc (it keeps counting through idle states). Either flag missing aborts the run before any benchmark executes.

Ticks convert to nanoseconds by calibration, not by assuming a nominal frequency: calibrate_tsc() reads CLOCK_MONOTONIC_RAW and the TSC together, busy-waits for a 100 ms window on the monotonic clock, reads both again, and returns ns-per-cycle as elapsed nanoseconds over elapsed cycles. Every nanosecond figure these demos report is a TSC delta scaled by that factor.

The calibration is cross-checked rather than gated: a second calibrate_tsc() is taken and the relative change in ns-per-cycle between the two readings is committed with the results as calibration_drift_pct — no threshold; the value itself is the evidence. Where the second reading lands varies by pipeline: demo 4 recalibrates after its paced runs complete and after each sweep step, demo 5 at each variant’s dispatch, demo 6 back-to-back at startup. On a constant_tsc machine this is a repeatability check of the calibration itself; the committed captures show ≤0.0001% throughout, and that field is the source of footer lines like demo 4’s “TSC drift ≤0.0001% across all 5 runs”.

Additional best-practice items

–Inputs are runtime-loaded (not compile-time-known) to defeat constant folding.
–Outputs are sunk via benchmark::DoNotOptimize to prevent dead-code elimination.
–Inputs are shuffled between iterations where branch-predictor memorisation would distort results.
–Allocations are kept out of hot paths; data structures are built in benchmark setup.
–Machine spec, kernel version, compiler version, and lscpu output are committed to the repo.

Building and reproducing

Each demo lives under bench/demos/<NN-slug>/ with its own README.md documenting the harness, inputs, and any demo-specific build flags. Headline captures use the per-demo orchestration script under bench/scripts/.

git clone https://github.com/GarethCooke/Crucible
cd Crucible/bench
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --target bench_<NN>_<slug>
./bench/scripts/run_one.sh <NN-slug>

run_one.sh requires sudo and the cpuset package (sudo apt install cpuset on Ubuntu); it runs the benchmark binary inside a cset shield on cores 4–7 and tears the shield down automatically. On Machine 2 (the Pi 5) cset is not used — Raspberry Pi OS is cgroup v2, which cset does not target; AArch64 demos isolate with isolcpus=2,3 at boot and pin with taskset -c 3, as documented in each demo’s README.

Source on GitHub: GarethCooke/Crucible ↗. Each demo’s directory has its own README with demo-specific notes; this page covers the conventions that apply across all of them.

Special editions

Occasional posts sit outside the standard methodology. They are not counted as numbered demos and do not affect the four commitments above. Each departure is documented here and in the post itself.

Measuring the Gap — Grover’s algorithm on real quantum hardwarespecial edition

–Not a C++ benchmark — primary implementation is Python/Qiskit on IBM Quantum cloud hardware.
–Reference machine is an IBM QPU, not the Zen 2 or Cortex-A76 rigs.
–Metric is P(success), a dimensionless probability, not ns/op.
–Numbers are not reproducible by re-running — calibration changes. Committed JSON archives the job results.
–NOT counted in the numbered demo total. The demo count is unchanged.

Corrections

2026-06-06 — Boost-state verification in the original Machine 1 corpus.

The benchmarks for demos 01–08, as originally published, carried a machine.turbo field that was not a real measurement: it was echoed from an operator-set environment variable (CRUCIBLE_TURBO) rather than read from hardware, so the corpus’s boost state was in effect unverified. The lscpu output committed alongside each result showed the 4560 MHz boost ceiling — but on this AMD acpi-cpufreq board that ceiling is reported whether or not boost is active, so it neither confirmed nor refuted the “turbo off” claim.

We rebuilt boost detection to read real kernel sysfs signals (the per-CPU cpb flag, cross-checked against the available-frequency list), added a hard capture-time gate that aborts if boost is enabled, and recaptured the entire Machine 1 corpus at a confirmed 3900 MHz base clock.

Effect on results. The recaptured figures round-trip to within 1% of the originals — the compute-bound demos (01, 03, 08) reproduce exactly — which confirms the original captures were already running at base clock and the published numbers were accurate. The same-session speedup ratios each post is built on were unchanged throughout. Two posts, demos 04 and 06, required prose corrections unrelated to clock — demo 04’s over-saturation section described single-draw queue behaviour; re-derivation replaced it with a bistability finding showing each process run settles into one of two stable queue-depth equilibria; demo 06’s DRAM-band non-monotonic dip was attributed to a microarchitectural TLB/huge-page interaction; the recapture produced a near-flat band, reclassifying the dip as a capture-specific page-placement artifact — while their underlying ratios were intact. Demo 02’s L1 cache counter (l1d.replacement) returned null in all twelve original runs and was never cited in the post; the recapture substitutes L1-dcache-load-misses, with no numeric impact.

The raw lscpu capture committed alongside every result is what made this auditable: the underlying data was honest even where the derived field was not. The detection logic that replaced it is described under Turbo Boost disabled above.