2026-05-14
Black-Scholes: same model, four implementations, ~10× spread
Polynomial approximations buy 12% on their own — but they're the gate to a 9× SIMD win that libm's erfc can't reach.
Same pricing model. Same input chain. Same machine. Four implementations — and the
spread from slowest to fastest is around 10×. The interesting part: most of the gap
is SIMD, but you can't get there from libm. std::erfc doesn't vectorise. Swapping
it for a polynomial approximation buys a modest 12% on its own — and is the gate
that makes the SIMD wins reachable at all.
The workload
European call option price under Black-Scholes:
Input: 1,048,576 option structs — arrays of S, K, T, r, σ (all float,
aligned to 32 bytes). Output: array of call prices C.
The audience knows Black-Scholes. The parameters cover a realistic options chain:
S, K ∈ [50, 150], T ∈ [0.05, 2.0], r ∈ [0, 0.08], σ ∈ [0.1, 0.6]. Fixed
seed 0xCAFEBABE for reproducibility.
Four variants
The variants are ordered to isolate where each speedup comes from:
scalar_libm— correctness oracle.std::exp,std::log,std::sqrt,std::erfcfrom<cmath>, compiled with-fno-tree-vectorize -mno-avx2.scalar_poly— same loop, same ISA flags, butexpandN(x)replaced with polynomial approximations.std::logandstd::sqrtstay as libm.sse2_intrinsics— 4-wide__m128, manual intrinsics. Same polynomial math as variant 2, widened to four lanes.avx2_fma_intrinsics— 8-wide__m256, FMA throughout._mm256_fmadd_psreplaces everymul+addin the Horner evaluations.
// One option at a time.
// Same math as the SIMD variant — just scalar.
float bs_call_poly(float S, float K, float T,
float r, float sigma) {
float sig2 = sigma * sigma;
float sqrtT = std::sqrt(T);
float sig_sqrtT = sigma * sqrtT;
float d1 = (std::log(S / K) +
(r + 0.5f * sig2) * T) / sig_sqrtT;
float d2 = d1 - sig_sqrtT;
return S * ncdf_poly(d1)
- K * fast_expf(-r * T) * ncdf_poly(d2);
}// Eight options at a time, FMA Horner throughout.
static void price8(
const float* S, const float* K,
const float* T, const float* r, const float* sigma,
float* C) {
__m256 vS = _mm256_load_ps(S);
__m256 vK = _mm256_load_ps(K);
__m256 vT = _mm256_load_ps(T);
__m256 vSig = _mm256_load_ps(sigma);
__m256 sig2 = _mm256_mul_ps(vSig, vSig);
__m256 sqrtT = _mm256_sqrt_ps(vT);
__m256 sig_sqrtT = _mm256_mul_ps(vSig, sqrtT);
__m256 d1 = _mm256_div_ps(
_mm256_fmadd_ps(
_mm256_fmadd_ps(half, sig2, vR), vT, log_SK),
sig_sqrtT);
__m256 d2 = _mm256_sub_ps(d1, sig_sqrtT);
__m256 Nd1 = vec_ncdf_avx2(d1);
__m256 Nd2 = vec_ncdf_avx2(d2);
__m256 disc = vec_expf_avx2(
_mm256_mul_ps(neg_one, _mm256_mul_ps(vR, vT)));
_mm256_store_ps(C,
_mm256_fnmadd_ps(_mm256_mul_ps(vK, disc), Nd2,
_mm256_mul_ps(vS, Nd1)));
}The CodeCompare shows the most informative diff — same algorithm, different width.
See the repo for all four variants and the
polynomial helpers.
Headline numbers
Two views of the same data: at N=16,384 the entire working set fits comfortably in L2 (per-core), giving the cleanest per-cycle compute comparison. At N=1,048,576 the input is 20 MB — larger than the 16 MB L3 per CCX — but the workload is compute-bound anyway, so throughput per option is essentially identical to the L2-resident case.
Decomposing the speedup
Step 1 → 2: the algorithm gate. Replacing std::exp / std::erfc with
polynomial approximations gives a modest 12% standalone speedup (47.7 ns/option →
42.7 ns/option at N=1M). On its own that's unremarkable. What it actually does is
open the door to the SIMD wins below — there is no _mm256_erfc_ps. libm's erfc
is a branch-and-lookup-table dispatch that doesn't translate to a SIMD register;
you have to switch to a polynomial form before you can widen to four or eight
lanes. The polynomial is the gate, not the prize.
libm's expf and erfcf are designed for full float precision (~1.2e-7 ULP error)
across their entire domain. Black-Scholes only needs option prices accurate to ~1e-4
in absolute terms. A 6th-order Taylor polynomial for exp over [-ln2/2, ln2/2]
(the range-reduced interval) achieves ~1.7e-7 absolute error — sufficient, and
computable without a branch-heavy libm dispatch path or lookup table.
The A&S §26.2.17 rational approximation for N(x) is similarly: 5 polynomial
coefficients, published in 1964, max error < 7.5e-8. It outperforms a
branchy libm erfc call in the hot path.
Polynomial Horner evaluation pipelines well; libm calls break out of the inner loop.
Step 2 → 3: SSE width win. Widening from 1 to 4 lanes with __m128 gives a
4.1× throughput gain at N=1M — slightly above clean 4× width because the scalar
variant carries some loop-overhead and pipeline-fill cost the SIMD variant
amortises. The main brake on a higher ratio is that div and sqrt aren't 4×
cheaper in SSE — their throughput improves with width but the latency-bound
operations don't disappear.
Step 3 → 4: AVX2 + FMA — and the Zen 2 µop-split fingerprint.
Going in: Zen 2 implements 256-bit AVX2 by splitting each instruction into two
128-bit µops internally. One vfmadd256ps consumes two execution-port slots
where Zen 3 would consume one. On µop count alone, this halves the theoretical
width advantage of 8-wide AVX2 over 4-wide SSE. The prediction was that AVX2
would land under 2× SSE on this CPU.
What actually landed: at N=16,384 — the cleanest comparison, with the working set in L2 and per-iteration overhead amortised — AVX2/SSE is 1.995×. Right at the threshold, exactly as the µop-split argument predicts. At N=1,048,576 the ratio rises to 2.252× as the fixed per-iteration harness cost dissolves further.
The IPC counters expose the µop split directly: SSE retires 1.21 instructions per cycle, AVX2 retires 0.91 — a 0.75 ratio, almost exactly what you'd expect if each AVX2 instruction occupies two issue slots. But each retired AVX2 instruction is also doing ~2.5× the work of a retired SSE instruction (8 lanes plus FMA's two flops per multiply-add, vs 4 lanes of separate mul + add). Multiply those out: 0.75 × 2.5 ≈ 1.9× — and the rest is small effects (register pressure, scheduling, amortisation). The headline ratio doesn't come from AVX2 magically being 2× SSE; it comes from FMA reclaiming what µop split costs you.
Precision
All non-libm variants are validated against scalar_libm over the full 1M input
set with the verify_03_simd_blackscholes binary. Threshold: max_abs_error < 1e-4.
| Variant | max_abs_error vs scalar_libm |
|---|---|
| scalar_poly | 6.49 × 10⁻⁵ |
| sse2_intrinsics | 6.49 × 10⁻⁵ |
| avx2_fma | 5.82 × 10⁻⁵ |
The precision bound is discussed honestly: for a production pricing engine, 1e-4 absolute error on vanilla European calls is generally acceptable (spreads are wider than that), but the disclaimer matters — implied volatility inversion, exotic products, and Greeks computed by finite-difference need tighter bounds. Reach for Sleef or a vendor SVML in those contexts.
Why hand-roll the polynomials?
This was a deliberate trade-off. The alternatives:
- Sleef gives a faster, more accurate result with less code. It's the right choice for production.
- Intel SVML is closed-source and not portable to GCC on Linux.
- Agner Fog's Vectorclass is a clean C++ wrapper, similar story to Sleef.
Reasons to hand-roll here: the post's claim is about understanding the
transformation, not shipping fast Black-Scholes. Pulling Sleef would reduce
variants 3–4 to a 20-line wrapper — readers learn nothing about how vectorised
transcendentals actually work. Hand-rolling Horner's method with _mm256_fmadd_ps
is the part a quant dev wants to see you can write. The polynomial-vs-libm
algorithm win (step 1→2) also requires hand-rolled scalar polynomials anyway;
once those exist, vectorising them is a small step. Using Sleef only for variants
3–4 would mean the speedup decomposition stops being clean.
What this doesn't show
- GFLOPS for
scalar_libmis computed against the polynomial-variant flop count (98 flops/option). libm'serfcis internally far more than that — branch-heavy dispatch, lookup tables, and conditional polynomial branches. The reported 2.05 GFLOPS forscalar_libmis therefore "throughput-equivalent flops if libm did the same math as the polynomial variant," not a measurement of libm's actual flop rate. Treat it as a throughput number, not a computational density number. - Per-iteration harness overhead. At N=1024 every variant runs ~5–6× slower per option than at N=1M because the perf-counter start/stop and function-call cost per iteration is fixed and amortises across N. Small-N numbers are measurement-floor-bound, not compute-bound. The charts above use N≥16k for this reason; the full sweep including N=1024 is in the JSON for reproducibility.
- % of theoretical peak. Zen 2 base-clock theoretical peak for 256-bit FMA is roughly 62 GFLOPS per core (2 FMA pipes × 1 256-bit FMA per 2 cycles × 8 lanes × 2 flops × 3.9 GHz). The AVX2 variant lands at 27 GFLOPS = ~44% of peak. Division, square root, and the polynomial cascades (which can't all pipeline as pure FMA) are the gap.
- AVX-512: deferred to a rented-bare-metal post. AVX-512 would allow 16-wide float computation in a single instruction and properly single-issue 512-bit ops on recent server CPUs. Numbers here won't extrapolate.
- Zen 3+: Zen 3 and Sapphire Rapids dispatch 256-bit ops natively; the AVX2/SSE ratio would be closer to 2× there.
- Put options: trivially derived from put-call parity once you have call prices.
- Implied volatility: the inverse problem — a separate post.
- GPU comparison: deferred.
Takeaway
A 10× spread across four implementations of the same model decomposes cleanly:
12% from swapping libm's erfc for a polynomial approximation, then 9× from
widening to 8 lanes of AVX2 + FMA. The 12% step is unremarkable in isolation —
but it's the gate. There is no _mm256_erfc_ps, so the SIMD wins are
unreachable until the algorithm step happens first.
On Zen 2 the AVX2/SSE ratio lands near 2× rather than a theoretical 4× because each 256-bit instruction splits into two 128-bit µops internally — FMA reclaims what µop split costs you. On Zen 3+ and Sapphire Rapids, where 256-bit ops dispatch natively, the same code path would be closer to a clean 4×.
The SoA layout that made this vectorisation possible is its own story — when the struct is wide and the inner loop touches only a few fields, the layout choice determines whether SIMD is reachable at all. Demo 6 quantifies the bandwidth-amplification model that underlies it.
AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 0–7 isolated, benchmarks pinned to 4–7. Headless Ubuntu 24.04. GCC 13.3, per-variant ISA flags as documented in CMakeLists.txt. 20 outer repetitions, median reported (throughput convention).