Methodology
How the numbers are produced
Published benchmark results are only credible with a documented, reproducible methodology. All measurements on this site satisfy the four non-negotiable commitments below.
Reference machine
Zen 2 implements 256-bit AVX2 as two 128-bit µops — called out explicitly in any SIMD post. Full lscpu --extended output, kernel version, and compiler version are committed to the repo alongside each benchmark result.
Four non-negotiable commitments
CPU governor pinned to performance
cpupower frequency-set -g performance. Frequency scaling during a measurement run is a leading cause of variance that makes numbers look better or worse than they really are.Turbo Boost disabled
run_one.sh via cpupower frequency-info before any benchmark binary runs; the result is exported as CRUCIBLE_TURBO=off and recorded in the JSON machine.turbo field. If turbo state cannot be determined the script exits non-zero rather than silently recording a wrong value. Boost obscures the true steady-state throughput the predictor and cache hierarchy deliver at nominal frequency.Core isolation
isolcpus=1-7 nohz_full=1-7 rcu_nocbs=1-7 boot parameters — scoped to a dedicated GRUB entry (“Ubuntu (benchmark — cores 0-7 isolated)”) distinct from the standard development entry. Within that isolated set, benchmarks are additionally pinned to cores 4–7 via taskset (invoked by the per-demo wrapper scripts), with cores 0–3 absorbing any residual kernel housekeeping the isolation directives cannot redirect. SMT is disabled at the BIOS level — verified via /sys/devices/system/cpu/smt/active returning 0 and lscpu reporting 8 CPUs — to remove SMT-sibling resource sharing (L1, L2, execution ports, frontend) from all measurements. Isolated CPU IDs are recorded in each demo’s JSON machine.isolated_cpus field.Cross-CCX results. Cpu 0 still carries the system timer and other unmovable kernel work that
isolcpus= cannot fully evict, so cross-CCX measurements (cores 0–3 and 4–7 both in the isolated set) carry slightly higher ambient noise than intra-CCX measurements and are labelled accordingly in every post.Statistical reporting
- Throughput / steady-state median (demos 1, 2, 3): ≥20 outer repetitions (Google Benchmark
--benchmark_repetitions); aggregates computed across those repetitions. - Tail-latency distribution (demos 4, 5): 5 outer runs × 1 M timed samples per run through a custom latency pipeline; percentiles computed from histograms merged across runs.
- Working-set sweep (demos 6, 7): 5 outer repetitions per cell; median
ns_per_opreported. Sweep coverage substitutes for higher per-cell rep count.
- Median — typical-case latency
- Min — best the hardware can do (cache warm, predictor trained)
- p99 / p99.9 — tail-latency claims
- IQR — spread around the median
Additional best-practice items
- –Inputs are runtime-loaded (not compile-time-known) to defeat constant folding.
- –Outputs are sunk via benchmark::DoNotOptimize to prevent dead-code elimination.
- –Inputs are shuffled between iterations where branch-predictor memorisation would distort results.
- –Allocations are kept out of hot paths; data structures are built in benchmark setup.
- –Machine spec, kernel version, compiler version, and lscpu output are committed to the repo.
Building and reproducing
Each demo lives under bench/demos/<NN-slug>/ with its own README.md documenting the harness, inputs, and any demo-specific build flags. Headline captures use the per-demo orchestration script under bench/scripts/.
git clone https://github.com/GarethCooke/Crucible cd Crucible/bench cmake -B build -S . -DCMAKE_BUILD_TYPE=Release cmake --build build --target bench_<NN>_<slug> ./bench/scripts/run_one.sh <NN-slug>
run_one.sh requires sudo and the cpuset package (sudo apt install cpuset on Ubuntu); it runs the benchmark binary inside a cset shield on cores 4–7 and tears the shield down automatically.
Source on GitHub: GarethCooke/Crucible ↗. Each demo’s directory has its own README with demo-specific notes; this page covers the conventions that apply across all of them.