The comparison lower bound is a wall. Integer keys let you walk around it.

std::sort is the first thing everyone reaches for, and on an array of plain integers it is almost never the fastest option. Not because the implementation is weak — libstdc++'s introsort is excellent — but because it is playing a game it cannot win. Sort 67 million random 32-bit keys on the reference machine and std::sort spends about 74 ns per element. A hand-rolled radix sort spends about 15. Same machine, same data, same -O3 -march=native. The five-times gap is not a tuning detail; it is the difference between two cost models.

The wall

Any sort that works by comparing pairs of elements is bounded from below. To distinguish all n! possible orderings you need at least log₂(n!) ≈ n log₂ n comparisons, and no comparison sort can do better in the worst case. This is the comparison lower bound, Ω(n log n), and it is a theorem, not an implementation gap. std::sort lives against this wall: its per-element cost grows with log₂ n because each doubling of the input adds, on average, one more level of comparison work per element.

You can see the wall directly in the per-element cost. Across the sweep, std::sort climbs from ~9 ns/element at a thousand keys to ~74 at 67 million — and the shape of that climb is almost entirely the log₂ n term, not cache effects. The algorithm is the dominant cost.

Walking around it

The lower bound only applies if you have to compare. Radix sort doesn't. Given fixed-width integer keys, it sorts by distributing elements into buckets one digit at a time — here, one byte per pass, four passes for a 32-bit key. Its cost is O(n · w/r): linear in the number of elements, scaled by the key width w over the bits processed per pass r. There is no log n. Double the input and the per-element cost barely moves.

That "barely moves" is the whole story, and it has a price tag we'll come back to: radix needs an O(n) scratch buffer to scatter into, so it ping-pongs between two arrays.

std::sort (comparison)

LSD radix (counting)

std::sort(keys.begin(), keys.end());
// introsort: quicksort + heapsort fallback + insertion sort
// bounded by the comparison lower bound, O(n log n)

// 8 bits per pass, 4 passes for a 32-bit key
for (unsigned shift = 0; shift < 32; shift += 8) {
  size_t count[256] = {0};
  for (auto k : a) ++count[(k >> shift) & 0xFF];
  size_t sum = 0;
  for (int b = 0; b < 256; ++b) { auto c = count[b]; count[b] = sum; sum += c; }
  for (auto k : a) tmp[count[(k >> shift) & 0xFF]++] = k;
  a.swap(tmp);   // O(n) scratch buffer — the price of no comparisons
}

The race across N

Here is the per-element cost of all three sorts on random 32-bit keys, from a thousand elements to 67 million.

Per-element sort cost vs N — random u32 keys, log-log

Radix sits on the floor across almost the entire sweep. At 67 million keys it is roughly 5× faster than std::sort and 1.9× faster than pdqsort, and its line is nearly flat where the comparison sorts ramp upward — exactly what "no log n" looks like on a chart.

Two things are worth slowing down for.

The cache knee on the radix line. Setting aside the small-N bump discussed below, radix climbs gently from ~7–8 ns/element in the L1/L2 region to ~11–12 by a few hundred thousand keys, then bends upward to ~15 at the right edge of the sweep. The bend sits just past four million keys, and four million 32-bit keys is 16 MB — the size of this machine's L3 slice. Past that point the scatter writes spill to main memory and radix becomes bandwidth-bound rather than compute-bound. The comparison sorts don't show this knee nearly as clearly, because their log n growth swamps it. The band markers — the same cache-tier bands demo 6 and demo 7 introduced — sit on the radix line for that reason: it's the one variant whose curve is shaped by the cache hierarchy rather than by its own asymptotics. The markers mark the single-array footprint, and that needs a qualification in each direction. At small N, radix's ping-pong scratch buffer doubles its working set, so it hits the lower tiers earlier than the markers suggest — that's the placement-sensitive band just left of the L1 marker discussed in the next paragraph. At the L3 boundary the doubling does not move the knee: the two arrays are streamed sequentially rather than held resident together, so the single-array 16 MB footprint is the line that bites — which is why the knee sits at four million keys, not at two.

The window where radix loses — sometimes. The May capture of this exact code showed radix bumping to ~19–23 ns/element between 4,000 and 8,000 elements while pdqsort dipped to ~16–18 and briefly won, and the capacity arithmetic looked clean: radix ping-pongs two n-element arrays, so at ~4 K keys the doubled footprint is 32 KB — exactly the L1 data cache — while the in-place comparison sorts cross that boundary at twice the element count. Then the recapture removed the window. A rebuild and a reboot later, radix at 4 K–8 K sits at 6.7–7.3 ns/element, on the floor, beating pdqsort by ~2.5×; meanwhile its N = 1 024 cell moved 35% in the other direction between the same two captures. Capacity arithmetic predicts a deterministic effect; two captures of identical code show a placement-dependent one. The doubled footprint puts radix at the L1 boundary in that band, but whether it pays there is decided by where the allocator and transparent huge pages put the two buffers — a conflict between source and destination lines is an address property, not a size property, and the address draw changes per process. The honest statement of the trade: O(n) auxiliary memory buys the linear cost, and in the capacity-critical band below ~16 K elements it exposes radix to placement luck the in-place sorts don't face — costing anywhere from nothing to a brief loss to pdqsort, depending on the draw. Above ~16 K both captures agree to within a few percent and radix is on the floor for good.

Why input shape decides the comparison-sort race

The chart above is random data. Real data usually isn't random — it's partially sorted, or has few distinct values, or arrives in runs. That structure matters enormously to a comparison sort and almost not at all to radix. Here are all three at four million keys across five input shapes:

Sort cost across input distributions — N = 4,194,304 (u32)

The comparison sorts swing wildly. pdqsort — pattern-defeating quicksort — detects sorted runs and few-unique-key inputs and short-circuits them: it sorts already-sorted data at 0.75 ns/element, a ~35× spread from its random-data cost of ~26. std::sort is less aggressive about it but still ranges roughly 9× from its best input to its worst. Radix, by contrast, spans ~2× across all five distributions. It does the same four passes regardless of what the data looks like; the input's shape is invisible to it.

There's a tidy inversion buried in those numbers. The comparison sorts are fastest on ordered input — sorted or reverse-sorted — and slowest on random. Radix is the reverse: sorted and reverse-sorted are its worst inputs (~23–24 ns/element), random is its best (~12), and few-unique sits in between (~15). Sequential keys make the high-order bytes vary across the full range, fanning the scatter writes across all 256 buckets on every pass; few-unique collapses those high passes into a single bucket, so it beats the sorted worst case without matching random. Same operation, opposite preference. It's the cleanest demonstration I know that these two families don't just differ in speed — they're sensitive to completely different properties of the data.

One honest limit: the sawtooth input (eight repeating ramps) defeats pdqsort's pattern detection entirely — at ~27.5 ns/element it's no faster than on random data. "Pattern-defeating" defeats the adversarial patterns it was designed for; it does not make every structured input free.

The argument for a hot path

This is where the integer-key case stops being a microbenchmark curiosity and starts mattering for a latency budget. If you sort on a hot path — rebuilding a sorted book, ordering a batch of fills, anything where a p99 spike is a problem — the question isn't only "what's the median?" It's "what's the worst input going to do to me?"

A comparison sort's answer depends on the data. pdqsort's median is excellent, but its cost ranges 35× across input shapes, and an adversarial or simply unlucky distribution lands you at the bad end with no warning. Radix's answer is the same regardless: 12 to 24 ns/element, full stop. There is no input that blows out its tail, because its runtime doesn't depend on the input's order at all. On a path where the tail is the thing you're paid to control, a data-independent sort is the safe instrument even on the inputs where its median isn't the lowest number on the chart. That property — runtime you can reason about without reasoning about your data — is the same reason the allocator post and the SPSC queue cared about the distribution and not just the average.

Key width is the knob

Radix's advantage is bought with passes, and wider keys cost more passes. A 64-bit key is eight byte-passes instead of four, so radix's pass count doubles — and because the wider keys move twice the bytes per pass, its measured per-element cost nearly triples — while the comparison sorts grow only with log n. At four million random keys the picture shifts sharply: against std::sort, radix's lead falls from 5.2× at 32-bit to 1.8× at 64-bit — and pdqsort overtakes radix entirely, landing at ~26 ns/element to radix's ~35. Radix still beats std::sort on 64-bit keys, but it is no longer the obvious winner. The balance tips on the key width: the narrower your integer keys, the more decisively the counting sort wins.

What this doesn't show

This is a deliberately narrow result, and the boundaries matter more than the headline.

Fixed-width unsigned integer keys only. Radix needs to decompose the key into fixed-width digits. Floating-point, strings, variable-length records, or anything you can only order by comparison — comparison sorts are the only general option, and std::sort is exactly the right default there.
Key width is decisive. As the 64-bit numbers show, the advantage is not a constant. Wide keys erode it; at some width the comparison sorts win outright.
Bare keys, no payload. These are arrays of keys with nothing attached. Sorting key-plus-payload records changes the arithmetic — the scatter moves more bytes, and stability starts to matter (LSD radix is stable; std::sort and pdqsort are not). That's a separate measurement.
Single-threaded, one machine. No parallel sorts, no std::execution::par. All numbers from the single documented reference machine; see methodology.

The takeaway isn't "use radix." It's that std::sort's universality has a cost, and when your keys happen to be narrow fixed-width integers — which, on a hot path moving IDs, timestamps, or prices, they often are — you can trade that universality for a sort whose cost is linear and whose tail is something you can actually reason about. The default is a fine default. It just isn't free, and on this one shape of problem you can do better. It pairs with demo 1 from the other side: demo 1 measured the payoff of sorting — a branch-predictable downstream loop — and this post measures its price. (Demo 1's one-second sort is on its small-range keys — std::sort's easy case; the ~5× headline here is against random keys, the hard one. Same operation, opposite end of the input-shape sensitivity this post is about.)

Reproducing this

The benchmark harness, the radix implementation, and the capture script are in the repository under bench/demos/08-sorting-shootout/. pdqsort is Orson Peters' pattern-defeating quicksort, vendored into that directory as a single header under the zlib licence. The one subtlety worth flagging for anyone reproducing it: sorting mutates its input, so the harness restores a pristine copy from a master buffer outside the timed region on every iteration. Skip that and every iteration after the first measures the cost of sorting already-sorted data — which silently flatters the comparison sorts (radix is unaffected) and produces a plausible, wrong result. All numbers here are median per-element times over five repetitions; see methodology for the machine spec and statistical conventions.

AMD Ryzen 7 3800X, Zen 2 (SMT off), 3.9 GHz base, governor = performance, turbo disabled (BIOS Core Performance Boost off), cores 1–7 isolated (cpu0 cannot be kernel-isolated and carries housekeeping), single thread pinned to core 4 (CCX1), headless Ubuntu 24.04. GCC 13.3, -O3 -march=native -std=c++20. 5 outer repetitions per cell, median ns_per_op reported (working-set-sweep convention).

Source: bench/demos/08-sorting-shootout/ · JSON.

Methodology →

The wall#

Walking around it#

The race across N#

Why input shape decides the comparison-sort race#

The argument for a hot path#

Key width is the knob#

What this doesn't show#

Reproducing this#