Black-Scholes: same model, four implementations, ~10× spread

Same pricing model. Same input chain. Same machine. Four implementations — and the spread from slowest to fastest is around 10×. The interesting part: most of the gap is SIMD, but you can't get there from libm. std::erfc doesn't vectorise. Swapping it for a polynomial approximation buys a modest 11% on its own — and is the gate that makes the SIMD wins reachable.

The workload

European call option price under Black-Scholes:

\begin{aligned} C &= S \cdot N(d_1) - K \cdot e^{-rT} \cdot N(d_2) \\[10pt] d_1 &= \frac{\ln(S/K) + \left(r + \dfrac{\sigma^2}{2}\right)T}{\sigma\sqrt{T}} \\[10pt] d_2 &= d_1 - \sigma\sqrt{T} \end{aligned}

Input: 1,048,576 option structs — arrays of S, K, T, r, σ (all float, aligned to 32 bytes). Output: array of call prices C.

The parameters cover a realistic options chain: S, K ∈ [50, 150], T ∈ [0.05, 2.0], r ∈ [0, 0.08], σ ∈ [0.1, 0.6]. Fixed seed 0xCAFEBABE for reproducibility.

Four variants

The variants are ordered to isolate where each speedup comes from:

scalar_libm — correctness oracle. std::exp, std::log, std::sqrt, std::erfc from <cmath>, compiled with -fno-tree-vectorize -mno-avx2.
scalar_poly — same loop, same ISA flags, but exp and N(x) replaced with polynomial approximations. std::log and std::sqrt stay as libm.
sse2_intrinsics — 4-wide __m128, manual intrinsics. Shares variant 2's polynomial exp and N(x), widened to four lanes, and additionally vectorises log (which variant 2 leaves to libm, since a libm log can't be issued into a SIMD register).
avx2_fma_intrinsics — 8-wide __m256, FMA throughout. _mm256_fmadd_ps replaces every mul+add in the Horner evaluations.

scalar_poly (variant 2)

avx2_fma (variant 4)

// One option at a time.
// Same math as the SIMD variant — just scalar.
float bs_call_poly(float S, float K, float T,
                 float r, float sigma) {
float sig2 = sigma * sigma;
float sqrtT = std::sqrt(T);
float sig_sqrtT = sigma * sqrtT;
float d1 = (std::log(S / K) +
            (r + 0.5f * sig2) * T) / sig_sqrtT;
float d2 = d1 - sig_sqrtT;
return S * ncdf_poly(d1)
     - K * fast_expf(-r * T) * ncdf_poly(d2);
}

// Eight options at a time, FMA Horner throughout.
static void price8(
  const float* S, const float* K,
  const float* T, const float* r, const float* sigma,
  float* C) {
__m256 vS   = _mm256_load_ps(S);
__m256 vK   = _mm256_load_ps(K);
__m256 vT   = _mm256_load_ps(T);
__m256 vSig = _mm256_load_ps(sigma);
__m256 sig2      = _mm256_mul_ps(vSig, vSig);
__m256 sqrtT     = _mm256_sqrt_ps(vT);
__m256 sig_sqrtT = _mm256_mul_ps(vSig, sqrtT);
__m256 d1 = _mm256_div_ps(
  _mm256_fmadd_ps(
    _mm256_fmadd_ps(half, sig2, vR), vT, log_SK),
  sig_sqrtT);
__m256 d2  = _mm256_sub_ps(d1, sig_sqrtT);
__m256 Nd1 = vec_ncdf_avx2(d1);
__m256 Nd2 = vec_ncdf_avx2(d2);
__m256 disc = vec_expf_avx2(
  _mm256_mul_ps(neg_one, _mm256_mul_ps(vR, vT)));
_mm256_store_ps(C,
  _mm256_fnmadd_ps(_mm256_mul_ps(vK, disc), Nd2,
                   _mm256_mul_ps(vS, Nd1)));
}

The CodeCompare shows the most informative diff — same polynomial core, different width. See the repo for all four variants and the polynomial helpers.

Headline numbers

Two views of the same data: at N=16,384 the entire working set fits comfortably in L2 (per-core), giving the cleanest per-cycle compute comparison. At N=1,048,576 the input is 20 MB — larger than the 16 MB L3 per CCX — but the workload is compute-bound anyway, so throughput per option is essentially identical to the L2-resident case.

Median ns/option at N=16k — L2-resident, compute-bound (lower is better)

Median ns/option at N=1M — fully amortised (lower is better)

Decomposing the speedup

Step 1 → 2: the algorithm gate. Replacing std::exp / std::erfc with polynomial approximations gives a modest 11% standalone speedup (47.7 ns/option → 43.1 ns/option at N=1M). On its own that's unremarkable. What it actually does is open the door to the SIMD wins below — there is no _mm256_erfc_ps. libm's erfc is a branch-and-lookup-table dispatch that doesn't translate to a SIMD register; you have to switch to a polynomial form before you can widen to four or eight lanes. The polynomial is the gate, not the prize.

libm's expf and erfcf are designed for full float precision (~1.2e-7 ULP error) across their entire domain. Black-Scholes only needs option prices accurate to ~1e-4 in absolute terms. A 6th-order Taylor polynomial for exp over [-ln2/2, ln2/2] (the range-reduced interval) achieves ~1.7e-7 absolute error — sufficient, and computable without a branch-heavy libm dispatch path or lookup table.

The A&S §26.2.17 rational approximation for N(x) is similarly: 5 polynomial coefficients, published in 1964, max error < 7.5e-8. It outperforms a branchy libm erfc call in the hot path.

Polynomial Horner evaluation pipelines well; libm calls break out of the inner loop.

Step 2 → 3: SSE width win. Widening from 1 to 4 lanes with __m128 gives a 4.1× throughput gain at N=1M — slightly above clean 4× width because the scalar variant keeps a libm log the SSE variant replaces with a vectorised polynomial, and carries loop-overhead and pipeline-fill cost the SIMD variant amortises. The main brake on a higher ratio is that div and sqrt aren't 4× cheaper in SSE — their throughput improves with width but the latency-bound operations don't disappear.

Step 3 → 4: AVX2 + FMA — width scaling at the lane-count ceiling.

Going in: Zen 2 widened the entire FP datapath to 256 bits, so a 256-bit AVX2 instruction issues as a single µop — unlike Zen and Zen+, which cracked each 256-bit op into two 128-bit µops. This is measurable on the reference machine with the retired-µop counter (ex_ret_cops): the AVX2 kernel retires 0.99 µops per instruction, indistinguishable from SSE's 1.00, not the ~2× a 128-bit split would force. (These instruction and µop figures are measured with perf over the same pinned binaries but are not fields in the committed JSON capture; per-variant counter fields are planned for the next recapture.) So the honest expectation is plain width scaling — 8 lanes over 4 is 2× — with FMA folding each multiply-add into a single instruction on top.

What actually landed: at N=16,384 — the cleanest comparison, with the working set in L2 and per-iteration overhead amortised — AVX2/SSE is 1.991×, right at the lane-width ratio. At N=1,048,576 it rises to 2.252×, just above 2× as FMA's instruction economy shows through and the fixed per-iteration harness cost dissolves.

The counters show where the win comes from and where it leaks. Per option, the AVX2 kernel retires 2.9× fewer instructions than SSE (≈17 vs ≈50) — roughly 2× from width, the rest from FMA fusing the multiply-adds in the Horner cascades. That instruction economy is partly handed back by lower IPC: SSE sustains 1.21 instructions per cycle against AVX2's 0.91 (a 0.75 ratio), because the remaining AVX2 work is dominated by div and sqrt — which don't get cheaper with width — and by the serial polynomial dependency chains, leaving fewer independent instructions to fill the issue ports. Net it out: 2.9× fewer instructions × 0.75 IPC ≈ 2.2×, the measured throughput gain. The result is honest width scaling plus FMA, not the reclaiming of a hardware tax that does not exist on this core.

Precision

All non-libm variants are validated against scalar_libm over the full 1M input set with the verify_03_simd_blackscholes binary. Threshold: max_abs_error < 1e-4.

Variant	max_abs_error vs scalar_libm
scalar_poly	6.49 × 10⁻⁵
sse2_intrinsics	6.49 × 10⁻⁵
avx2_fma	5.82 × 10⁻⁵

The precision bound is discussed honestly: for a production pricing engine, 1e-4 absolute error on vanilla European calls is generally acceptable (spreads are wider than that), but the disclaimer matters — implied volatility inversion, exotic products, and Greeks computed by finite-difference need tighter bounds. Reach for Sleef or a vendor SVML in those contexts.

Why hand-roll the polynomials?

This was a deliberate trade-off. The alternatives:

Sleef gives a faster, more accurate result with less code. It's the right choice for production.
Intel SVML is closed-source and not portable to GCC on Linux.
Agner Fog's Vectorclass is a clean C++ wrapper, similar story to Sleef.

Reasons to hand-roll here: the post's claim is about understanding the transformation, not shipping fast Black-Scholes. Pulling Sleef would reduce variants 3–4 to a 20-line wrapper — readers learn nothing about how vectorised transcendentals actually work. Hand-rolling Horner's method with _mm256_fmadd_ps is the part a quant dev wants to see you can write. The polynomial-vs-libm algorithm win (step 1→2) also requires hand-rolled scalar polynomials anyway; once those exist, vectorising them is a small step. Using Sleef only for variants 3–4 would mean the speedup decomposition stops being clean.

What this doesn't show

GFLOPS for scalar_libm is computed against the polynomial-variant flop count (98 flops/option). libm's erfc is internally far more than that — branch-heavy dispatch, lookup tables, and conditional polynomial branches. The reported 2.05 GFLOPS for scalar_libm is therefore "throughput-equivalent flops if libm did the same math as the polynomial variant," not a measurement of libm's actual flop rate. Treat it as a throughput number, not a computational density number.
Per-iteration harness overhead. The per-iteration cost (perf-counter start/stop, function-call entry) is fixed, so at N=1024 it inflates per-option time in proportion to how fast the kernel is: AVX2 runs ~5× slower per option than at N=1M, SSE ~3×, and the scalar variants ~25–40% slower. Small-N numbers are measurement-floor-bound, not compute-bound. The charts above use N≥16k for this reason; the full sweep including N=1024 is in the JSON for reproducibility.
% of theoretical peak. Zen 2's base-clock peak for 256-bit FMA is roughly 125 GFLOPS per core (2 FMA units × one 256-bit FMA each per cycle × 8 lanes × 2 flops × 3.9 GHz). The AVX2 variant lands at 27 GFLOPS = ~22% of peak. Division, square root, and the polynomial cascades (which can't all pipeline as pure FMA) are the gap.
AVX-512: deferred to a rented-bare-metal post. AVX-512 would allow 16-wide float computation in a single instruction and properly single-issue 512-bit ops on recent server CPUs. Numbers here won't extrapolate.
Wider ISA, not a newer core. Zen 3, Zen 4, and Sapphire Rapids also execute 256-bit AVX2 natively (as Zen 2 already does), so the AVX2/SSE ratio stays near the 2× lane-width ceiling on them too. Beating it needs AVX-512's 16 lanes — see the AVX-512 note above — not a newer 256-bit core.
Put options: trivially derived from put-call parity once you have call prices.
Implied volatility: the inverse problem — a separate post.
GPU comparison: deferred.

Takeaway

A 10× spread across four implementations of the same model decomposes cleanly: 11% from swapping libm's erfc for a polynomial approximation, then 9× from widening to 8 lanes of AVX2 + FMA. The 11% step is unremarkable in isolation — but it's the gate. There is no _mm256_erfc_ps, so the SIMD wins are unreachable until the algorithm step happens first.

On Zen 2 the AVX2/SSE ratio lands near 2× because that is the lane-width ratio — 8 lanes over 4 — not because of a split tax: Zen 2 executes 256-bit AVX2 as single µops (measured at ~1 µop per instruction, the same as SSE). FMA nudges the ratio just above 2× at large N by cutting instruction count. Zen 3, Zen 4, and Sapphire Rapids also execute 256-bit natively, so the lane-width ceiling is the same 2× there — going wider needs a wider ISA (AVX-512), not a newer 256-bit core.

The SoA layout that made this vectorisation possible is its own story — when the struct is wide and the inner loop touches only a few fields, the layout choice determines whether SIMD is reachable at all. Demo 6 quantifies the bandwidth-amplification model that underlies it.

The same kernel on a Cortex-A76 — 128-bit NEON, no wider unit to reach for — is in Demo 9, where the width ceiling has no second step.

AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 1–7 isolated, benchmarks pinned to 4–7. Headless Ubuntu 24.04. GCC 13.3, per-variant ISA flags as documented in CMakeLists.txt. 20 outer repetitions, median reported (throughput convention).

Methodology →

The workload#

Four variants#

Headline numbers#

Decomposing the speedup#

Precision#

Why hand-roll the polynomials?#

What this doesn't show#

Takeaway#