2026-05-17
Sorted vs unsorted: a branch-prediction deep dive
Same code. Same data. Three variants. Why does one run 7× faster — and why does the branchless version close most of the gap without sorting anything?
Same code. Same data values. Same compiler flags. Yet one version runs roughly seven times faster than the other on identical hardware, and a third version closes most of the gap without sorting anything.
The code
Three runs of the same loop body, with two axes of variation: input order (sorted vs shuffled) and control flow (branching vs branchless).
// The compiler will happily turn this loop into a masked SIMD add or a
// scalar cmov — both of which eliminate the branch we're trying to study.
// Function-level attributes disable the two transformations that defeat
// the experiment, keeping -O3 -march=native everywhere else.
__attribute__((noinline,
optimize("no-tree-vectorize",
"no-if-conversion",
"no-if-conversion2")))
static int64_t sum_threshold(const std::vector<int32_t>& data) {
int64_t sum = 0;
for (auto x : data) {
if (x >= 128) sum += x; // real conditional jump
}
return sum;
}// Same ternary that gives cmov on most compilers will also
// auto-vectorise here — we keep it scalar to stay apples-to-apples
// with the branching variants. Demo 3 covers SIMD properly.
__attribute__((noinline, optimize("no-tree-vectorize")))
static int64_t sum_threshold_branchless(const std::vector<int32_t>& data) {
int64_t sum = 0;
for (auto x : data) {
sum += (x >= 128) ? x : 0; // compiles to cmov
}
return sum;
}The loop body is byte-for-byte identical across sorted and unsorted — the
only difference is whether the 32 M integers were sorted before the run.
The branchless variant is the same arithmetic expressed without a conditional
jump; the compiler replaces the branch with a cmov (conditional move)
instruction.
The asm
Here is the inner loop that actually runs in sum_threshold. The attributes
above are what keep it a real branch at -O3:
; sum_threshold inner loop — seven instructions, one data-dependent branch
bbd0: movslq (%rax),%rdx ; load x (sign-extend i32 → i64)
bbd3: cmp $0x7f,%edx ; x − 127 (sets flags)
bbd6: jle bbdb ; ← the branch: skip add if x ≤ 127
bbd8: add %rdx,%rsi ; sum += x
bbdb: add $0x4,%rax ; advance pointer
bbdf: cmp %rcx,%rax ; ptr == end?
bbe2: jne bbd0 ; loopAnd here is the corresponding inner loop in sum_threshold_branchless.
The jle and the conditional add are replaced by a single cmovg:
; sum_threshold_branchless inner loop — eight instructions, no data-dependent branch
bc00: movslq (%rax),%rdx ; load x (sign-extend i32 → i64)
bc03: mov %rdx,%rsi ; save x in rsi for the compare
bc06: add %rcx,%rdx ; rdx = sum + x (speculatively compute the new sum)
bc09: cmp $0x7f,%esi ; x − 127
bc0c: cmovg %rdx,%rcx ; commit: sum = (sum + x) iff x > 127 ← no jump
bc10: add $0x4,%rax ; advance pointer
bc14: cmp %rax,%rdi ; ptr == end?
bc17: jne bc00 ; loopGCC compiles the ternary into a small surprise: rather than zeroing a temp and
conditionally moving x into it, it pre-computes the candidate new sum
(rdx = sum + x) unconditionally, then uses cmovg to commit that candidate
into the live accumulator only when x > 127. The wrong-path add happens every
iteration and is silently discarded — no branch, no speculation rollback, and
crucially no mispredict penalty even on adversarial input.
Without the
no-tree-vectorizeattribute, GCC 13.3 at-O3 -march=native(which resolves to-march=znver2on the reference machine) turns this same ternary intovpcmpgtd/vpand/vpaddd— an 8-wide masked add that beats the cmov version by a further ~8×. That's a different story (SIMD width, not branch behaviour), and gets its own post.
The numbers
The bar chart shows time for one pass through 32 M elements; the miss-rate chart shows the proximate cause. Sorted input: branch misses near zero — the predictor locks onto the monotone run and never guesses wrong. Unsorted: misses converge on ~50%, the maximum for a fair binary outcome. Branchless: misses also near zero, because there is no data-dependent branch to mispredict at all.
What is happening
Modern out-of-order CPUs predict branches before they are evaluated. The branch predictor watches the history of each conditional jump and guesses which way it will go. When the guess is right, the speculatively executed instructions are committed with no penalty. When the guess is wrong, the pipeline is flushed and the correct path must be re-fetched and re-executed — roughly 19 cycles on Zen 2 (the frontend-to-retire pipeline depth).
Sorted input gives the predictor an easy job. The first half of the
array produces false on every iteration; the second half produces true.
After a handful of iterations the predictor locks onto these runs and
misses almost never. Branch misses are measured at ≈ 0.0% of comparisons
at N = 32 M. IPC reaches 3.0 — the pipeline is full.
Unsorted input is adversarial. The condition resolves true or false
in pseudo-random order (seed 42, uniform over 0–255). Zen 2's TAGE-style
predictor can in principle pick up long-history correlations even on
superficially random data — what the measured ≈ 50% mispredict rate
tells us is that no such pattern exists in this shuffle, and the predictor
converges on the 50% ceiling for a fair binary outcome. IPC drops to 0.45
as the pipeline repeatedly flushes and restarts.
The gap across the cache hierarchy
The 6.7× headline is measured at N = 32 M (128 MiB), where both variants are DRAM-bound. Looking across all six sizes reveals a more interesting story. The same L3→DRAM cliff visible here — where memory bandwidth becomes the binding constraint — is also the headline picture in Demo 6, which asks the layout question that runs on top of this one: when the inner loop only touches a few fields of a wide struct, does the struct layout amplify or hide the bandwidth cliff? The same cache-tier structure shows up again when comparing map data structures — demo 7 asks whether sorted containers beat hash maps at small N, and the cache-tier bends in the lookup curves are the same story.
- N = 1024 (harness floor): both variants land at ~20 ns/element here because per-iteration overhead — timing calls, function entry, loop setup — is fixed per measurement and amortises across N. By N = 10 K it's already negligible. Treat the leftmost point as a measurement floor, not a branch- prediction signal; the real per-element cost emerges from N = 10 K onward.
- Small N (≤ 10 K): 1 K elements fit in L1 (4 KiB), 10 K fits comfortably in L2 (40 KiB vs Zen 2's 512 KiB L2). Cache latency is minimal for both variants, and at small N the predictor can memorise the sequence (see below), so the gap is compressed.
- Cache-resident N (~1 M, L3-bound): sorted is fast — the pipeline is predictable and memory bandwidth is not yet a bottleneck. Unsorted is at peak miss rate with no bandwidth relief. This is where the ratio peaks at roughly 7×.
- N = 32 M (DRAM-bound): both variants are bandwidth-limited; mispredict cost overlaps with memory-wait cycles. The absolute gap stays substantial but the ratio settles at 6.7× — the floor of the effect, not the peak.
"Random" is in the eye of the predictor
The small-N rows in the chart above conceal a finding worth pulling out. At N = 1024, the unsorted variant runs at 1.04× of sorted — essentially no gap — with a measured mispredict rate of just 1.27% and IPC of 1.61.
With only 1024 branches in the loop, TAGE has enough history capacity to effectively memorise the entire seed-42 shuffle. The same pseudo-random sequence that produces 50% mispredicts at N ≥ 100 K is, at N = 1024, almost perfectly learnable. The 50% rate emerges only once N exceeds the predictor's effective history depth — somewhere between 10 K and 100 K on this Zen 2 part:
| N | unsorted miss rate | unsorted IPC | sorted/unsorted ratio |
|---|---|---|---|
| 1,024 | 1.27% | 1.61 | 1.04× |
| 10,240 | 30.3% | 0.63 | 1.91× |
| 102,400 | 49.8% | 0.45 | 5.50× |
| 1,048,576 | 50.0% | 0.45 | 7.17× |
"Random" is not a property of the data alone — it is a property of the data relative to the predictor's memory. Cryptographically random data is random to TAGE; pseudo-random data short enough to fit in its history register is not. For a demo on a textbook-classic benchmark, this is the most surprising row in the table.
Branchless tracks sorted within a few percent at small N, then diverges to
~22% slower as N grows. The cmov path pre-computes sum + x on every
iteration and discards it half the time via the conditional move; well-predicted
sorted skips the add entirely when x < 128. At a 50/50 split with near-zero
mispredicts, that "skip half the work" advantage is exactly the ~22% gap we
measure. Branchless still beats unsorted by ~5.5× — but a perfectly-predicted
branch beats branchless on this workload.
What you'd actually ship
The branchless variant answers a different question than sorting does. Sorting is the right answer when input ordering is free or naturally maintained — time-ordered events, a sorted index column, an on-insertion sorted structure. It is not the right answer for a generic hot data-dependent branch in a tight loop.
When you control the code, not the data, write branchless:
// Replace this:
if (x >= 128) sum += x;
// With this — cmov, no mispredict possible:
sum += (x >= 128) ? x : 0;One caveat: branchless trades a possible-skip for a guaranteed-execute. The cmov path computes the candidate result on every iteration and conditionally commits it; a well-predicted branch skips the work entirely on the not-taken path. For the 50/50 case studied here, branchless soundly beats unpredictable branching — but a perfectly-predicted branch still edges ahead by about 22% on this workload because it does half the adds. Branchless wins when the predicate is unpredictable, not when it's merely binary.
The sort caveat
std::sort on 32 M integers takes ~1.0 s on the reference machine —
measured, not estimated. If the array is consumed once, sorting first is a
net loss. The technique pays when the same data is scanned many times, when
a sorted structure can be maintained incrementally, or when ordering falls
out of the problem naturally (events in time order, an index column).
Takeaway
Two O(n) algorithms with identical instruction counts can differ by 7× in wall time if one is predictor-friendly and the other is not. The classic branch-prediction demo is not really a lesson about sorting — it is a lesson about how much performance is hiding in the control-flow shape of your hot loop, and how readily the modern predictor will reward you for working with it.
Always check the disassembly, always measure, and when the branch is the bottleneck, consider removing it rather than feeding it predictable inputs.
AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 0–7 isolated, pinned to 4–7. Headless Ubuntu 24.04. GCC 13.3, -O3 -march=native. 20 outer repetitions, median reported (throughput convention).
sum_threshold compiled with no-tree-vectorize, no-if-conversion, no-if-conversion2 to preserve the conditional jump. sum_threshold_branchless compiled with no-tree-vectorize to prevent auto-vectorisation (cmov path); both variants do identical scalar arithmetic. Sort cost measured with PauseTiming() to exclude the per-iteration copy.