Sorted vs unsorted: a branch-prediction deep dive

Same code. Same data values. Same compiler flags. Yet one version runs roughly seven times faster than the other on identical hardware, and a third version closes most of the gap without sorting anything.

The code

Three runs of the same loop body, with two axes of variation: input order (sorted vs shuffled) and control flow (branching vs branchless).

Branching (sorted or unsorted input)

Branchless

// The compiler will happily turn this loop into a masked SIMD add or a
// scalar cmov — both of which eliminate the branch we're trying to study.
// Function-level attributes disable the two transformations that defeat
// the experiment, keeping -O3 -march=native everywhere else.
__attribute__((noinline,
             optimize("no-tree-vectorize",
                      "no-if-conversion",
                      "no-if-conversion2")))
static int64_t sum_threshold(const std::vector<int32_t>& data) {
  int64_t sum = 0;
  for (auto x : data) {
      if (x >= 128) sum += x;       // real conditional jump
  }
  return sum;
}

// Same ternary that gives cmov on most compilers will also
// auto-vectorise here — we keep it scalar to stay apples-to-apples
// with the branching variants. Demo 3 covers SIMD properly.
__attribute__((noinline, optimize("no-tree-vectorize")))
static int64_t sum_threshold_branchless(const std::vector<int32_t>& data) {
  int64_t sum = 0;
  for (auto x : data) {
      sum += (x >= 128) ? x : 0;    // compiles to cmov
  }
  return sum;
}

The loop body is byte-for-byte identical across sorted and unsorted — the only difference is whether the 32 M integers were sorted before the run. The branchless variant is the same arithmetic expressed without a conditional jump; the compiler replaces the branch with a cmov (conditional move) instruction.

The asm

Here is the inner loop that actually runs in sum_threshold. The attributes above are what keep it a real branch at -O3:

; sum_threshold inner loop — seven instructions, one data-dependent branch
ba50:  movslq (%rax),%rdx        ; load x  (sign-extend i32 → i64)
ba53:  cmp    $0x7f,%edx         ; x − 127  (sets flags)
ba56:  jle    ba5b               ; ← the branch: skip add if x ≤ 127
ba58:  add    %rdx,%rsi          ; sum += x
ba5b:  add    $0x4,%rax          ; advance pointer
ba5f:  cmp    %rcx,%rax          ; ptr == end?
ba62:  jne    ba50               ; loop

And here is the corresponding inner loop in sum_threshold_branchless. The jle and the conditional add are replaced by a single cmovg:

; sum_threshold_branchless inner loop — eight instructions, no data-dependent branch
ba90:  movslq (%rax),%rdx        ; load x  (sign-extend i32 → i64)
ba93:  mov    %rdx,%rsi          ; save x in rsi for the compare
ba96:  add    %rcx,%rdx          ; rdx = sum + x  (speculatively compute the new sum)
ba99:  cmp    $0x7f,%esi         ; x − 127
ba9c:  cmovg  %rdx,%rcx          ; commit: sum = (sum + x) iff x > 127  ← no jump
baa0:  add    $0x4,%rax          ; advance pointer
baa4:  cmp    %rax,%rdi          ; ptr == end?
baa7:  jne    ba90               ; loop

GCC compiles the ternary into a small surprise: rather than zeroing a temp and conditionally moving x into it, it pre-computes the candidate new sum (rdx = sum + x) unconditionally, then uses cmovg to commit that candidate into the live accumulator only when x > 127. The wrong-path add happens every iteration and is silently discarded — no branch, no speculation rollback, and crucially no mispredict penalty even on adversarial input.

Without the no-tree-vectorize attribute, GCC 13.3 at -O3 -march=native (which resolves to -march=znver2 on the reference machine) turns this same ternary into vpcmpgtd/vpand/vpaddd — an 8-wide masked add. That's a different story (SIMD width, not branch behaviour), and gets its own post.

The numbers

Sorted vs unsorted branch prediction

The bar chart shows time for one pass through 32 M elements; the miss-rate chart shows the proximate cause. Sorted input: branch misses near zero — the predictor locks onto the monotone run and never guesses wrong. Unsorted: misses converge on ~50%, the maximum for a fair binary outcome. Branchless: misses also near zero, because there is no data-dependent branch to mispredict at all.

What is happening

Modern out-of-order CPUs predict branches before they are evaluated. The branch predictor watches the history of each conditional jump and guesses which way it will go. When the guess is right, the speculatively executed instructions are committed with no penalty. When the guess is wrong, the pipeline is flushed and the correct path must be re-fetched and re-executed — roughly 19 cycles on Zen 2 (the frontend-to-retire pipeline depth).

Sorted input gives the predictor an easy job. The first half of the array produces false on every iteration; the second half produces true. After a handful of iterations the predictor locks onto these runs and misses almost never. Branch misses are measured at ≈ 0.0% of comparisons at N = 32 M. IPC reaches 3.0 — the pipeline is full.

Unsorted input is adversarial. The condition resolves true or false in pseudo-random order (seed 42, uniform over 0–255). Zen 2's TAGE-style predictor can in principle pick up long-history correlations even on superficially random data — what the measured ≈ 50% mispredict rate tells us is that no such pattern is within the predictor's reach at this scale, and it converges on the 50% ceiling for a fair binary outcome. IPC drops to 0.45 as the pipeline repeatedly flushes and restarts.

The gap across the cache hierarchy

The 6.8× headline is measured at N = 32 M (128 MiB), where both variants are DRAM-bound. Looking across all six sizes reveals a more interesting story. The same L3→DRAM cliff visible here — where memory bandwidth becomes the binding constraint — is also the headline picture in Demo 6, which asks the layout question that runs on top of this one: when the inner loop only touches a few fields of a wide struct, does the struct layout amplify or hide the bandwidth cliff? The same cache-tier structure shows up again when comparing map data structures — demo 7 asks whether sorted containers beat hash maps at small N, and the cache-tier bends in the lookup curves are the same story.

Sorted vs unsorted branch prediction

N = 1024 (harness floor): both variants land at ~20 ns/element here because per-iteration overhead — timing calls, function entry, loop setup — is fixed per measurement and amortises across N. By N = 10 K it's already negligible. Treat the leftmost point as a measurement floor, not a branch- prediction signal; the real per-element cost emerges from N = 10 K onward.
Small N (≤ 10 K): 1 K elements fit in L1 (4 KiB), 10 K fits comfortably in L2 (40 KiB vs Zen 2's 512 KiB L2). Cache latency is minimal for both variants, and at small N the predictor can learn the sequence (see below), so the gap is compressed.
Cache-resident N (~1 M, L3-bound): sorted is fast — the pipeline is predictable and memory bandwidth is not yet a bottleneck. Unsorted is at peak miss rate with no bandwidth relief. This is where the ratio peaks at roughly 7×.
N = 32 M (DRAM-bound): both variants are bandwidth-limited; mispredict cost overlaps with memory-wait cycles. The absolute gap stays substantial but the ratio settles at 6.8× — the floor of the effect, not the peak.

"Random" is in the eye of the predictor

The small-N rows in the chart above conceal a finding worth pulling out. At N = 1024, the unsorted variant runs at 1.04× of sorted — essentially no gap — with a measured mispredict rate of just 1.37% and IPC of 1.70.

The buffer is generated once and the benchmark replays it for thousands of timed passes, so at N = 1024 the predictor sees the same 1024-branch sequence over and over — and that is enough for Zen 2's TAGE-style tables to associate each branch position with a distinct history context. The same pseudo-random sequence that produces 50% mispredicts at N ≥ 100 K is, at N = 1024, almost perfectly learnable. The 50% rate emerges as N outgrows the predictor's capacity to give every position its own context — and the transition is gradual (30% at 10 K, not a cliff), the signature of finite predictor capacity saturating rather than a fixed history window being overrun:

N	unsorted miss rate	unsorted IPC	sorted/unsorted ratio
1,024	1.37%	1.70	1.04×
10,240	30.4%	0.63	1.91×
102,400	49.8%	0.45	5.51×
1,048,576	50.0%	0.45	7.19×

"Random" is not a property of the data alone — it is a property of the data relative to the predictor's memory. Cryptographically random data is random to TAGE; pseudo-random data short enough — and replayed often enough — for the predictor to learn is not. For a demo on a textbook-classic benchmark, this is the most surprising row in the table.

Branchless tracks sorted within a few percent at small N, then diverges to ~22% slower as N grows. The cmov path pre-computes sum + x on every iteration and discards it half the time via the conditional move; well-predicted sorted skips the add entirely when x < 128. At a 50/50 split with near-zero mispredicts, that "skip half the work" advantage is exactly the ~22% gap we measure. Branchless still beats unsorted by ~5.5× — but a perfectly-predicted branch beats branchless on this workload.

What you'd actually ship

The branchless variant answers a different question than sorting does. Sorting is the right answer when input ordering is free or naturally maintained — time-ordered events, a sorted index column, an on-insertion sorted structure. It is not the right answer for a generic hot data-dependent branch in a tight loop.

When you control the code, not the data, write branchless:

// Replace this:
if (x >= 128) sum += x;
 
// With this — cmov, no mispredict possible:
sum += (x >= 128) ? x : 0;

One caveat: branchless trades a possible-skip for a guaranteed-execute. The cmov path computes the candidate result on every iteration and conditionally commits it; a well-predicted branch skips the work entirely on the not-taken path. For the 50/50 case studied here, branchless soundly beats unpredictable branching — but a perfectly-predicted branch still edges ahead by about 22% on this workload because it does half the adds. Branchless wins when the predicate is unpredictable, not when it's merely binary.

The sort caveat

std::sort on 32 M integers takes ~1.1 s on the reference machine — measured, not estimated. If the array is consumed once, sorting first is a net loss. The technique pays when the same data is scanned many times, when a sorted structure can be maintained incrementally, or when ordering falls out of the problem naturally (events in time order, an index column).

Demo 8 measures what the sort itself costs across three variants — std::sort, pdqsort, and a hand-rolled LSD radix — and shows where radix's data-independent runtime makes it the p99 winner even when its median is not the best.

Takeaway

Two O(n) algorithms with identical instruction counts can differ by 7× in wall time if one is predictor-friendly and the other is not. The classic branch-prediction demo is not really a lesson about sorting — it is a lesson about how much performance is hiding in the control-flow shape of your hot loop, and how readily the modern predictor will reward you for working with it.

Always check the disassembly, always measure, and when the branch is the bottleneck, consider removing it rather than feeding it predictable inputs.

AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 1–7 isolated, pinned to 4–7. Headless Ubuntu 24.04. GCC 13.3, -O3 -march=native. 20 outer repetitions, median reported (throughput convention).

sum_threshold compiled with no-tree-vectorize, no-if-conversion, no-if-conversion2 to preserve the conditional jump. sum_threshold_branchless compiled with no-tree-vectorize to prevent auto-vectorisation (cmov path); both variants do identical scalar arithmetic. Sort cost measured with PauseTiming() to exclude the per-iteration copy.

Methodology →

The code#

The asm#

The numbers#

What is happening#

The gap across the cache hierarchy#

"Random" is in the eye of the predictor#

What you'd actually ship#

The sort caveat#

Takeaway#