2026-05-14

Black-Scholes: same model, four implementations, ~10× spread

Polynomial approximations buy 12% on their own — but they're the gate to a 9× SIMD win that libm's erfc can't reach.

Same pricing model. Same input chain. Same machine. Four implementations — and the spread from slowest to fastest is around 10×. The interesting part: most of the gap is SIMD, but you can't get there from libm. std::erfc doesn't vectorise. Swapping it for a polynomial approximation buys a modest 12% on its own — and is the gate that makes the SIMD wins reachable at all.

The workload

European call option price under Black-Scholes:

C=SN(d1)KerTN(d2)d1=ln(S/K)+(r+σ22)TσTd2=d1σT\begin{aligned} C &= S \cdot N(d_1) - K \cdot e^{-rT} \cdot N(d_2) \\[10pt] d_1 &= \frac{\ln(S/K) + \left(r + \dfrac{\sigma^2}{2}\right)T}{\sigma\sqrt{T}} \\[10pt] d_2 &= d_1 - \sigma\sqrt{T} \end{aligned}

Input: 1,048,576 option structs — arrays of S, K, T, r, σ (all float, aligned to 32 bytes). Output: array of call prices C.

The audience knows Black-Scholes. The parameters cover a realistic options chain: S, K ∈ [50, 150], T ∈ [0.05, 2.0], r ∈ [0, 0.08], σ ∈ [0.1, 0.6]. Fixed seed 0xCAFEBABE for reproducibility.

Four variants

The variants are ordered to isolate where each speedup comes from:

  1. scalar_libm — correctness oracle. std::exp, std::log, std::sqrt, std::erfc from <cmath>, compiled with -fno-tree-vectorize -mno-avx2.
  2. scalar_poly — same loop, same ISA flags, but exp and N(x) replaced with polynomial approximations. std::log and std::sqrt stay as libm.
  3. sse2_intrinsics — 4-wide __m128, manual intrinsics. Same polynomial math as variant 2, widened to four lanes.
  4. avx2_fma_intrinsics — 8-wide __m256, FMA throughout. _mm256_fmadd_ps replaces every mul+add in the Horner evaluations.
scalar_poly (variant 2)
avx2_fma (variant 4)
// One option at a time.
// Same math as the SIMD variant — just scalar.
float bs_call_poly(float S, float K, float T,
                 float r, float sigma) {
float sig2 = sigma * sigma;
float sqrtT = std::sqrt(T);
float sig_sqrtT = sigma * sqrtT;
float d1 = (std::log(S / K) +
            (r + 0.5f * sig2) * T) / sig_sqrtT;
float d2 = d1 - sig_sqrtT;
return S * ncdf_poly(d1)
     - K * fast_expf(-r * T) * ncdf_poly(d2);
}
// Eight options at a time, FMA Horner throughout.
static void price8(
  const float* S, const float* K,
  const float* T, const float* r, const float* sigma,
  float* C) {
__m256 vS   = _mm256_load_ps(S);
__m256 vK   = _mm256_load_ps(K);
__m256 vT   = _mm256_load_ps(T);
__m256 vSig = _mm256_load_ps(sigma);
__m256 sig2      = _mm256_mul_ps(vSig, vSig);
__m256 sqrtT     = _mm256_sqrt_ps(vT);
__m256 sig_sqrtT = _mm256_mul_ps(vSig, sqrtT);
__m256 d1 = _mm256_div_ps(
  _mm256_fmadd_ps(
    _mm256_fmadd_ps(half, sig2, vR), vT, log_SK),
  sig_sqrtT);
__m256 d2  = _mm256_sub_ps(d1, sig_sqrtT);
__m256 Nd1 = vec_ncdf_avx2(d1);
__m256 Nd2 = vec_ncdf_avx2(d2);
__m256 disc = vec_expf_avx2(
  _mm256_mul_ps(neg_one, _mm256_mul_ps(vR, vT)));
_mm256_store_ps(C,
  _mm256_fnmadd_ps(_mm256_mul_ps(vK, disc), Nd2,
                   _mm256_mul_ps(vS, Nd1)));
}

The CodeCompare shows the most informative diff — same algorithm, different width. See the repo for all four variants and the polynomial helpers.

Headline numbers

Two views of the same data: at N=16,384 the entire working set fits comfortably in L2 (per-core), giving the cleanest per-cycle compute comparison. At N=1,048,576 the input is 20 MB — larger than the 16 MB L3 per CCX — but the workload is compute-bound anyway, so throughput per option is essentially identical to the L2-resident case.

Decomposing the speedup

Step 1 → 2: the algorithm gate. Replacing std::exp / std::erfc with polynomial approximations gives a modest 12% standalone speedup (47.7 ns/option → 42.7 ns/option at N=1M). On its own that's unremarkable. What it actually does is open the door to the SIMD wins below — there is no _mm256_erfc_ps. libm's erfc is a branch-and-lookup-table dispatch that doesn't translate to a SIMD register; you have to switch to a polynomial form before you can widen to four or eight lanes. The polynomial is the gate, not the prize.

libm's expf and erfcf are designed for full float precision (~1.2e-7 ULP error) across their entire domain. Black-Scholes only needs option prices accurate to ~1e-4 in absolute terms. A 6th-order Taylor polynomial for exp over [-ln2/2, ln2/2] (the range-reduced interval) achieves ~1.7e-7 absolute error — sufficient, and computable without a branch-heavy libm dispatch path or lookup table.

The A&S §26.2.17 rational approximation for N(x) is similarly: 5 polynomial coefficients, published in 1964, max error < 7.5e-8. It outperforms a branchy libm erfc call in the hot path.

Polynomial Horner evaluation pipelines well; libm calls break out of the inner loop.

Step 2 → 3: SSE width win. Widening from 1 to 4 lanes with __m128 gives a 4.1× throughput gain at N=1M — slightly above clean 4× width because the scalar variant carries some loop-overhead and pipeline-fill cost the SIMD variant amortises. The main brake on a higher ratio is that div and sqrt aren't 4× cheaper in SSE — their throughput improves with width but the latency-bound operations don't disappear.

Step 3 → 4: AVX2 + FMA — and the Zen 2 µop-split fingerprint.

Going in: Zen 2 implements 256-bit AVX2 by splitting each instruction into two 128-bit µops internally. One vfmadd256ps consumes two execution-port slots where Zen 3 would consume one. On µop count alone, this halves the theoretical width advantage of 8-wide AVX2 over 4-wide SSE. The prediction was that AVX2 would land under 2× SSE on this CPU.

What actually landed: at N=16,384 — the cleanest comparison, with the working set in L2 and per-iteration overhead amortised — AVX2/SSE is 1.995×. Right at the threshold, exactly as the µop-split argument predicts. At N=1,048,576 the ratio rises to 2.252× as the fixed per-iteration harness cost dissolves further.

The IPC counters expose the µop split directly: SSE retires 1.21 instructions per cycle, AVX2 retires 0.91 — a 0.75 ratio, almost exactly what you'd expect if each AVX2 instruction occupies two issue slots. But each retired AVX2 instruction is also doing ~2.5× the work of a retired SSE instruction (8 lanes plus FMA's two flops per multiply-add, vs 4 lanes of separate mul + add). Multiply those out: 0.75 × 2.5 ≈ 1.9× — and the rest is small effects (register pressure, scheduling, amortisation). The headline ratio doesn't come from AVX2 magically being 2× SSE; it comes from FMA reclaiming what µop split costs you.

Precision

All non-libm variants are validated against scalar_libm over the full 1M input set with the verify_03_simd_blackscholes binary. Threshold: max_abs_error < 1e-4.

Variantmax_abs_error vs scalar_libm
scalar_poly6.49 × 10⁻⁵
sse2_intrinsics6.49 × 10⁻⁵
avx2_fma5.82 × 10⁻⁵

The precision bound is discussed honestly: for a production pricing engine, 1e-4 absolute error on vanilla European calls is generally acceptable (spreads are wider than that), but the disclaimer matters — implied volatility inversion, exotic products, and Greeks computed by finite-difference need tighter bounds. Reach for Sleef or a vendor SVML in those contexts.

Why hand-roll the polynomials?

This was a deliberate trade-off. The alternatives:

  • Sleef gives a faster, more accurate result with less code. It's the right choice for production.
  • Intel SVML is closed-source and not portable to GCC on Linux.
  • Agner Fog's Vectorclass is a clean C++ wrapper, similar story to Sleef.

Reasons to hand-roll here: the post's claim is about understanding the transformation, not shipping fast Black-Scholes. Pulling Sleef would reduce variants 3–4 to a 20-line wrapper — readers learn nothing about how vectorised transcendentals actually work. Hand-rolling Horner's method with _mm256_fmadd_ps is the part a quant dev wants to see you can write. The polynomial-vs-libm algorithm win (step 1→2) also requires hand-rolled scalar polynomials anyway; once those exist, vectorising them is a small step. Using Sleef only for variants 3–4 would mean the speedup decomposition stops being clean.

What this doesn't show

  • GFLOPS for scalar_libm is computed against the polynomial-variant flop count (98 flops/option). libm's erfc is internally far more than that — branch-heavy dispatch, lookup tables, and conditional polynomial branches. The reported 2.05 GFLOPS for scalar_libm is therefore "throughput-equivalent flops if libm did the same math as the polynomial variant," not a measurement of libm's actual flop rate. Treat it as a throughput number, not a computational density number.
  • Per-iteration harness overhead. At N=1024 every variant runs ~5–6× slower per option than at N=1M because the perf-counter start/stop and function-call cost per iteration is fixed and amortises across N. Small-N numbers are measurement-floor-bound, not compute-bound. The charts above use N≥16k for this reason; the full sweep including N=1024 is in the JSON for reproducibility.
  • % of theoretical peak. Zen 2 base-clock theoretical peak for 256-bit FMA is roughly 62 GFLOPS per core (2 FMA pipes × 1 256-bit FMA per 2 cycles × 8 lanes × 2 flops × 3.9 GHz). The AVX2 variant lands at 27 GFLOPS = ~44% of peak. Division, square root, and the polynomial cascades (which can't all pipeline as pure FMA) are the gap.
  • AVX-512: deferred to a rented-bare-metal post. AVX-512 would allow 16-wide float computation in a single instruction and properly single-issue 512-bit ops on recent server CPUs. Numbers here won't extrapolate.
  • Zen 3+: Zen 3 and Sapphire Rapids dispatch 256-bit ops natively; the AVX2/SSE ratio would be closer to 2× there.
  • Put options: trivially derived from put-call parity once you have call prices.
  • Implied volatility: the inverse problem — a separate post.
  • GPU comparison: deferred.

Takeaway

A 10× spread across four implementations of the same model decomposes cleanly: 12% from swapping libm's erfc for a polynomial approximation, then 9× from widening to 8 lanes of AVX2 + FMA. The 12% step is unremarkable in isolation — but it's the gate. There is no _mm256_erfc_ps, so the SIMD wins are unreachable until the algorithm step happens first.

On Zen 2 the AVX2/SSE ratio lands near 2× rather than a theoretical 4× because each 256-bit instruction splits into two 128-bit µops internally — FMA reclaims what µop split costs you. On Zen 3+ and Sapphire Rapids, where 256-bit ops dispatch natively, the same code path would be closer to a clean 4×.

The SoA layout that made this vectorisation possible is its own story — when the struct is wide and the inner loop touches only a few fields, the layout choice determines whether SIMD is reachable at all. Demo 6 quantifies the bandwidth-amplification model that underlies it.


AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 0–7 isolated, benchmarks pinned to 4–7. Headless Ubuntu 24.04. GCC 13.3, per-variant ISA flags as documented in CMakeLists.txt. 20 outer repetitions, median reported (throughput convention).

Methodology →