(github) / BQN / implementation / primitive

Implementation of ordering functions

The ordering functions are Sort (∧∨), Grade (⍋⍒), and Bins (⍋⍒). Although these are well-studied—particularly sorting, and then binary search or "predecessor search"—there are many recent developments, as well as techniques that I have not found in the literature. The three functions are closely related but have important differences in what algorithms are viable. Sorting is a remarkably deep problem with different algorithms able to do a wide range of amazing things, and sophisticated ways to combine those. It is by no means solved. In comparison, Bins is pretty tame.

There's a large divide between ordering compound data and simple data. For compound data comparisons are expensive, and the best algorithm will generally be the one that uses the fewest comparisons. For simple data they fall somewhere between cheap and extremely cheap, and fancy branchless and vectorized algorithms are the best.

On quicksort versus merge sort

Merge sort is a nicer algorithm: it's deterministic and has optimal worst-case performance. But partitioning is a nicer operation: more parallel, has a stable out-of-place form but is much easier to do in-place. And partitioning can reduce the range of the data enough to use an extremely quick counting sort or other methods. Outside of sorting, partitioning is also a natural fit for binary search, where it's mandatory for sensible cache behavior with large enough arguments.

The fastest methods end up using quicksort for the generic case and merge sort for special cases. One such case being smaller arrays where choosing a pivot is too hard; for these merge competes with other options like insertion sort and sorting networks.

It's the patterns that get you. An adaptive merge sort is good at handling sequences of runs, while a quicksort can handle inputs with few unique values. These can be tested for, but they're also oversimplified. Adaptive merge sort is good if any of its merges are easy, which can happen if the array is scrambled at the small scale and ordered at a larger scale. Quicksort is harder to pin down, as various base cases might exploit many kinds of structure. Ultimately, I don't think you can know what the cost of one method or the other will be exactly, so there will always be cases where you choose the wrong one.

The same applies to the general categories of partitioning sorts (quicksort, MSD radix sort, samplesort) and merging sorts (mergesort, multi-way merge). Of course these aren't the only options. LSD radix sort is one weird outlier that goes fast but doesn't like patterns and doesn't play well with others.

Natural runs

Timsort popularized the idea of checking for "natural runs"—ascending or descending sequences—in the argument to speed up merge sorting, taking care to avoid the O(n²) pattern of repeatedly merging small runs into a large one. Powersort (see also Multiway Powersort) improves on its somewhat ad-hoc strategy for picking the merging order. And Glidesort introduces a mechanism to hybridize from the bottom levels, deferring sorting for non-run segments by virtually merging them instead, to later be sorted all at once with quicksort (Driftsort is a simplification, and its source code makes for easier reading). Vergesort is an earlier effort at quicksort-mergesort hybridization, but is less interesting now that Glidesort looks like a more effective method.

Hybridizing quicksort and merge sort will never go perfectly! First, merge sort is good for more than just natural runs, as discussed above. But second, some cases that do have long natural runs are better handled by quicksort, in particular low-cardinality cases. Glidesort goes a little overboard in claiming that it gets the full benefit of quicksort on low-cardinality data for this reason. It's true that it's O(n log k) average case for a length n array with k unique elements. But the use of average case rather than worst case is traditionally because quicksort relies on good pivot selection, which is expensive. If you swap pdqsort's pivot selection for a magic constant-time median, it'll be O(n log k) worst case. Glidesort will remain O(n log n), because low-cardinality cases with long runs go to merge sort. This shows up in Voultapher's testing for random_p ("different percentages of random values"), where Glidesort's performance falls off as more of the input is taken up by a single value even though quicksort has no problem with that situation. I'll note that a single run of low cardinality is easy to detect, because it's sorted so all the equal elements are next to each other. Harder to find would be if there are many short-ish runs that cover the same set of elements, and so collectively have low cardinality.

Like Timsort, Glidesort should be seen more as correctly identifying an opportunity for algorithms to improve rather than pinpointing the best way to do so. Besides the fundamental problem of choosing a strategy with imperfect information, it has trouble with smaller sorted segments in a field of randomness. If one such segment is detected, it'll be merged with a random segment of about the same size after sorting that segment with quicksort, and then one twice that size, and so on. The amount of work saved on the initial run is pretty small, and can be overwhelmed by the inefficiency of merging, and of sorting the random part in segments rather than all together. The current Glidesort mitigation is to set the minimum run length to the square root of the array length. My also-clumsy strategy, described in more detail here, has been to always search until finding a run (or running out of elements) in order to eventually split the input into alternating runs and non-runs. Then if a run would be merged with a much larger non-sorted segment it's treated as non-sorted instead of merging.

On binary search

Binary searches are very easy to get wrong. Do not write (hi+lo)/2: it's not safe from overflows. I always follow the pattern given in the first code block here. This code will never access the value *base, so it should be considered a search on the n-1 values beginning at base+1 (the perfect case is when the number of values is one less than a power of two, which is in fact how it has to go). It's branchless and always takes the same number of iterations. To get a version that stops when the answer is known, subtract n%2 from n in the case that *mid < x.

Compound data

Array comparisons are expensive. The goal here is almost entirely to minimize the number of comparisons. Which is a much less complex goal than to get the most out of modern hardware, so the algorithms here are simpler.

For Sort, I think Timsort with Powersort-based merging is a solid choice. Hardly different from optimal comparison numbers on random data, and it's good at natural runs. Orson Peters points out as part of his work on Glidesort that quicksort is better at low-cardinality inputs, and that these do show up in real datasets. So Driftsort would be my other out-of-the box recommendation, as it throws out some of Glidesort's complications that are irrelevant to compound data. It could still benefit from Timsort's adaptive merging, I believe. And when I looked at its quicksort pivot selection it seemed pretty weak; I'd replace it with a true median from recursively sorting a sample.

Grade should use the same method as Sort, and can be done either by selecting from the original array to order indices or by moving the data around in the same order as the indices. I think the second of these ends up being substantially better for small-ish elements.

For Bins, use a branching binary search: see On binary search above. But there are also interesting (although, I expect, rare) cases where only one argument is compound. Elements of this argument should be reduced to fit the type of the other argument, then compared to multiple elements. For the right argument, this just means reducing before doing whatever binary search is appropriate to the left argument. If the left argument is compound, its elements should be used as partitions. Then switch back to binary search only when the partitions get very small—probably one element.

Simple data

The name of the game here is "branchless".

For Sort:

For small sizes, use insertion sort, or pick a few fixed sizes to implement faster, and finish off with insertion sort for sizes not in that set.
Then counting sort for 1-byte types (and obviously for 1-bit regardless of length).
Branchless quicksorts are the solid choice for larger types, particularly since they can track ranges and call counting and other distribution sorts when appropriate.
Scanning for runs in order to use a mergesort is probably a good idea though.
But for 2- and 4-byte data, radix sort can be a lot faster? For 2-byte sort, I think it's a better bridge than fluxsort between insertion and counting sort (but scan for sortedness first); for 4-byte, hard to say.

Grade is basically the same (now that fluxsort gives us a good stable quicksort), except moves get more expensive relative to comparisons. Counting sort needs to be changed to the much slower bucket sort.

A branchless binary search is adequate for Bins but in many cases—very small or large 𝕨, and small range—there are better methods.

Sorting small arrays

For small sizes, there's a significant tradeoff between code size and speed. Code that intermittently sorts small arrays can be slowed down significantly if it has to bring a lot of code into the instruction cache to do this. But sorting a large array performs frequent small-array sorts, so it doesn't really have this tradeoff. It may make sense to use different methods for these two cases.

Insertion sort is the method with the smallest code size while maintaining decent performance. On the other end there are fast ways to sort particular sizes, but that's a lot of code if say, sizes under 32 need to be covered. A practical mix is to handle some fixed sizes with sorting networks (see ipnsort) or merging (quadsort), and finish with an insertion sort. Sorting networks compare and swap elements in a fixed pattern, which allows for a branchless implementation that uses instruction-level parallelism well, and also makes them suitable for SIMD implementation.

Distribution sorts

Both counting and bucket sort are small-range algorithms that begin by counting the number of each possible value. Bucket sort, as used here, means that the counts are then used to place values in the appropriate position in the result in another pass. Counting sort does not read from the initial values again and instead reconstructs them from the counts. It might be written (//⁼)⌾(-⟜min) in BQN, relying on the extension of /⁼ to unsorted arguments. Bucket sort can be used for Grade or sort-by (⍋⊸⊏), but counting sort only works for sorting itself. It's not-even-unstable: there's no connection between result values and the input values except that they are constructed to be equal. But with fast Indices, counting sort is vastly more powerful, and is effective with a range four to eight times the argument length.

I developed Robin Hood Sort as an algorithm with similar properties to bucket sort that relies on uniformly-distributed data rather than a small range. It uses a buffer a few times larger than the input array, and inserts values in a manner similar to a hash table with linear probing, shifting large clumps out if they appear—they're merge-sorted back in at the end. A random selection of √n samples is enough to make a good decision about whether to use it (see candidate selection), which makes it a good fit as a base case for quicksorts. Of course this can only be probabilistic, so it's still important that RHsort has decent worst-case performance. When quadsort is used for merging, the worst case appears to be about half as fast as fluxsort, very solid.

All these methods do writes with indices based on the input data—to the count array for counting and bucket sort, and to the result array for bucket sort and result buffer for RH—so if the argument is large enough and not favorable they'll miss caches and get pretty slow. Also some just use a lot of memory, which may not be desirable. For a general-purpose sort they should probably only be used when quicksort or radix partitioning has reduced the size enough to eliminate potential issues. Although perhaps some test checking how often nearby elements indicate the same cache line could also be used to green-light these methods (similar issues appear for Select).

Radix sort

LSD radix sort is really fast, like three times faster than fluxsort on random 4-byte data. The idea is: bucket sort according to the last byte, then the second-to-last, on up to the first byte. Array is now sorted, after most likely having been scrambled substantially (but stably!). It's tricky to implement right though. The sort_inline functions from ska_sort_copy here are good. They count buckets for every step in one pass, and move back and forth from the array to a buffer instead of adding more memory. Radix sort uses memory proportional to the input array length, plus a constant. But that constant is a liability on short arrays, so it's only really useful for sizes above a hundred or so (to get down to this limit, use 1-byte counts and sum with SIMD or at least SWAR).

LSD radix sort suffers from problems of cache associativity. Now, usually (for, say, non-blocked transpose) such problems strike only at power of 2 lengths. But by picking out input bytes, radix sort tends to create its own powers of 2. Consider an input consisting of ascending natural numbers ↕n. Lowest byte is fine: the lengths are around n÷256. Next byte up, problems: this byte only changes once every 256 inputs, so every bucket but one has a multiple of 256 length! And writes will cycle around these buckets, so they stay roughly in sync. This is enough to overwhelm any set-associative cache. I measured a degradation of about 5 times on that pass and 3 times overall. The case with bucket lengths near multiples of 256—they need to be separated by an entire cache line not to conflict—is detectable after the cheap counting pass, but it's not the only way this pattern can arise. For example, put a bunch of zeros at the beginning of the array. The first bucket now has some arbitrary length, but once the zeros are processed the gap between it and the next is back to being a multiple of 256. The good news is that it still requires a lot of space to start kicking out a bunch of cache lines: below 10,000 4-byte elements I could never measure significant degradation. So if the lack of adaptivity (and O(n) memory of course) doesn't bother you, radix sort is kind of the best thing going for 4-byte values in the 500 to 20,000 range.

Quicksort

Fluxsort attains high performance with a branchless stable partition that places one half on top of existing data and the other half somewhere else. One half ends up in the appropriate place in the sorted array. The other is in swap memory, and will be shifted back by subsequent partitions and base-case sorting. Aside from the partitioning strategy, fluxsort makes a number of other decisions differently from pdqsort, including a fairly complicated merge sort (Quadsort) as the base case.

Crumsort introduces the unstable fulcrum partition, which is built on the same branchless partitioning step but is faster for partitioning an array too large for cache where throughput is important (although… isn't this also the use case for a samplesort?). It works from the outside in: first move a few values—16 from each side works well—to temporary space to create a gap between source and destination. The total gap remains the same so one side will always have a gap of 16 values or more, which means it's safe to do 16 partition steps from the other side. The final result will be striped, interleaving sections taken from either side of the input array.

This paper gives a good description of pdqsort, and the Rust version has some improvements. Most aspects have made it into fluxsort, with others like pivot shuffling being superceded by just choosing more candidates. The optimistic insertion sort hasn't, and the in-place partition may still be faster if there are few swaps. Because that partitioning method is easily invertible if comparison results are saved (it's a self-inverse) it's also useful for the partitioning approach to binary search mentioned later.

It's a samplesort rather than a quicksort, but I'll throw in the rather complicated IPS⁴o here. Unstable, but there's also a stable not-in-place version PS⁴o. For very large arrays it probably has the best memory access patterns, so a few samplesort passes could be useful.

In-place partitioning

In-place quicksort relies on a partitioning algorithm that exchanges elements in order to split them into two contiguous groups. The Hoare partition scheme does this, and BlockQuicksort showed that it can be performed quickly with branchless index generation; this method was then adopted by pdqsort. But various bit booleans to indices methods are faster, and they fit well with vectorized comparisons.

It's simplest to define an operation P that partitions a list 𝕩 according to a boolean list 𝕨. Partitioning permutes 𝕩 so that all elements corresponding to 0 in 𝕨 come before those corresponding to 1. The quicksort partition step, with pivot t, is (t≤𝕩)P𝕩, and the comparison can be vectorized. Interleaving comparison and partitioning in chunks would save memory (a fraction of the size of 𝕩, which should have 32- or 64-bit elements because plain counting sort is best for smaller ones) but hardly speeds things up: only a few percent, and only for huge lists with hundreds of millions of elements. The single-step P is also good for Bins, where the boolean 𝕨 will have to be saved.

For binary search 𝕨⍋𝕩, partitioning allows one pivot element t from 𝕨 to be compared to all of 𝕩 at once, instead of the normal strategy of working with one element from 𝕩 at a time. 𝕩 is partitioned according to t≤𝕩, then result values are found by searching the first half of 𝕨 for the smaller elements and the second half for the larger ones, and then they are put back in the correct positions by reversing the partitioning. Because Hoare partitioning works by swapping independent pairs of elements, P is a self inverse, identical to P⁼. So the last step is simple, provided the partitioning information t≤𝕩 is saved.

SIMD sorting

Two high-profile quicksort-based algorithms capable of using AVX-512 are Google's VQSort and Intel's x86-simd-sort. The authors of both prioritize flashy "x times faster" numbers in their messaging over realistic expectations. If a good AVX-512 compress is available then using it for quicksort is really very fast; while these codebases do have implementations for other vector architectures, they're not nearly as impressive. And they use good methods for random data, but not much effort has been put into adaptivity. Voultapher's analysis highlights these issues. Unfortunately it does include some stuff like VQSort's slow random initialization that has since been fixed (but what do you expect when you go to the press before fixing basic issues with your code?). Best would be if the fast SIMD small-array sorting and partitioning could be combined with newer methods of hybrid sorting; for now I would feel good about using either SIMD quicksort on large arrays with a new-ish x86 processor supporting at least AVX2, but don't know enough about other use cases to say anything.

A few people have done some work on merge sorting with AVX2 or AVX-512: two examples. Pretty complicated, and still mostly in the proof of concept stage, but the benchmarks on uniform random arrays are good. Can these be made adaptive? ChipSort seems further along. It uses sorting networks, comb sort, and merging, which all fit nicely with SIMD and should work well together.

Scans and heuristics

Because quicksort does its work before recursing, it's well suited to statistical techniques that allow the algorithm to be changed.

Initial pass

An initial pass for pdqsort (or another in-place quicksort) provides a few advantages:

Recognize sorted and reverse-sorted arrays as fast as possible
Always use unguarded insertion sort
Find and maintain range information to switch to counting sort

The main purpose of the pass is to find the range of the array. For an insignificant additional cost, the smallest and largest elements can be swapped to the edges of the array and sorted arrays can be detected.

To find the smallest and largest elements, compute the range in blocks, on the order of 1KB each. Record the maximum and minimum, as well as the index of the block that contained these. After each block, update the range as well as the indices. Then, after traversing all blocks, search the appropriate blocks for these values to find exact indices. It may also be possible to skip the search and jump straight to counting sort.

Finding an initial run is fast as well. Compare the first two elements to determine direction, then search for the first pair that have the opposite direction (this can be vectorized because overreading is fine). This run can be used as the first range block, because the maximum and minimum are the two elements at the ends of the run.

At the start of sorting, swap the smallest element to the beginning and the largest to the end, and shrink the size of the array by one in each direction. Now the element before the array is a lower bound and the one after is an upper bound. This property can also be maintained as the array is partitioned, by placing a pivot element between the two halves (swap it to one side of the array before partitioning and to the middle afterwards). As a result, it's always safe to use unguarded insertion sort, and an upper bound for the range of the array can always be found using the difference between the elements before and after it. Now finding the range is fast enough to check for counting sort at every recursion.

This is a very simple initial pass; a more sophisticated one might be beneficial. If the array starts with a large run then there could be more of them.

Candidate selection

Traditionally, performance-oriented quicksort algorithms have chosen pivots using a small number of candidates, such as a median of 3 or pseudomedian of 9. Scandum's work with fluxsort showed that this is leaving performance on the table, because it results in less even partitions. My own analysis found an optimal pivot number approximately equal to √n÷1+4⋆⁼n, which also seems to do well empirically. Both costs and benefits here are proportional to √n before an O(n) partition, so there's significant room for variation in this number. The cost of increasing the number of pivots further can also be decreased by maintaining candidates as the array is partitioned.

Candidates should be chosen in a pseudo-random or at least not significantly patterned distribution to avoid getting a useless biased median. Randomness is also important for heuristics that select between different algorithms. These will be faster for sorting some patterns or distributions and slower for others. In order to avoid slow cases, the requirement should be that any particular slow case has a low probability of passing the heuristic. This approach doesn't require a guess at what properties the input is likely to have, while an attempt to minimize expected time or some other average metric would.

Robin Hood sorting has a sharp threshold at about √n where it's feasible to distinguish good and bad cases, that is, reject arrays with many collisions and accept uniformly random ones. This requires the candidate sampling method to be able to pick candidates close to each other, or it'll fail to notice things like an input that consists of many short blocks of similar values.

Inspecting the pivot candidates before or while sorting could also be used to choose a merge-based recursive step over a partition-based one. That is, rather than partitioning then recursing on each half, the algorithm would recurse on each half and then merge them together. If the halves don't overlap by much, the merge can skip a lot of elements with some binary searches and potentially be much than partitioning, which at a minimum needs to compare every element to the pivot. This allows the overall algorithm to exploit large-scale structure for merging even if there's a lot of noise at the small scale that would make merge sorting a bad choice overall. Unlike partitioning, it doesn't reduce the range of the halves.

Binary search

Reminder that we're talking about simple, not compound data. The most important thing is just to have a good branchless binary search (see above), but there are other possible optimizations.

If 𝕨 is extremely small, use a vector binary search as described in "Sub-nanosecond Searches" (video, slides). For 1-byte elements there's also a vectorized method that works whenever 𝕨 has no duplicates: create two lookup tables that go from multiples of 8 (5-bit values, after shifting) to bytes. One is a bitmask of 𝕨, so that a lookup gives 8 bits indicating which possible choices of the remaining 3 bits are in 𝕨. The other gives the number of values in 𝕨 less than the multiple of 8. To find the result of Bins, look up these two bytes. Mask off the bitmask to include only bits for values less than the target, and sum it (each of these steps can be done with another lookup, or other methods depending on instruction set). The result is the sum of these two counts.

It's cheap and sometimes worthwhile to trim 𝕨 down to the range of 𝕩. After finding the range of 𝕩, binary cut 𝕨 to a smaller list that contains the range. Stop when the middle element fits inside the range, and search each half of 𝕨 for the appropriate endpoint of the range.

If 𝕩 is small-range, then a lookup table method is possible. Check the length of 𝕨 because if it's too large then this method is slower—binary search doesn't have to hit every element! The approach is simply to create a table of the number of elements in 𝕨 with each value, then take a prefix sum. In BQN, 𝕩⊏+`(1+⌈´𝕩)↑/⁼𝕨, assuming a minimum of 0.

Partitioning allows one pivot t from 𝕨 to be compared with all of 𝕩 at once. Although the comparison t≤𝕩 can be vectorized, the overhead of partitioning still makes this method a little slower per-comparison than sequential binary search when 𝕨 fits in L1 cache. For larger 𝕨 (and randomly positioned 𝕩) cache churn is a huge cost and partitioning can be many times faster. It should be performed recursively, switching to sequential binary search when 𝕨 is small enough. Unlike quicksort there is no difficulty in pivot selection: always take it from the middle of 𝕨 as in a normal binary search. However, there is a potential issue with memory. If 𝕩 is unbalanced with respect to 𝕨, then the larger part can be nearly the whole length of 𝕩 (if it's all of 𝕩 partitioning isn't actually needed and it doesn't need to be saved). This can require close to 2⋆⁼≠𝕨 saved partitions of length ≠𝕩, while the expected use would be a total length ≠𝕩.

Interpolation search?

Binary search is the optimal approach for truly unknown data. However, if the searched array has an approximately uniform distribution, giving an approximately linear relationship between index and value, interpolation search uses asymptotically fewer comparisons. Because of the high overhead of index computation (a division!!), it could only ever be useful for large 𝕨, and 𝕩 small enough that partitioning isn't viable.

Linear-ish arrays are probably common in practice. As an example, consider the indices for the start of each line in a file, which might be used to get a line number from a character index. If lines don't vary hugely in length and aren't correlated with location in the file, then the indices are sums of random numbers: actually smoother than a uniform distribution. What it would take to harm an interpolation search is a global skew towards one side or the other, and I'd guess that you either get a pretty sizeable skew, or almost none at all. But using only the array endpoints for the first interpolation is a bit ugly: one outlier can really mess things up.

Efficiently Searching In-Memory Sorted Arrays includes code and a paper for interpolation search and related algorithms. Their SIP (Slope reuse InterPolation) avoids division by reusing a single slope determined by 𝕨 only: this means there's one division for any number of searches. It looks like their methods get really valuable starting at around a million entries in 𝕨. But it still seems worse than partitioning at these sizes, indicating it should only be used for small 𝕩.

A major problem with interpolation is how badly it fails for non-uniform data. But the continuous-domain equivalent, bracketed root-finding, is better studied, with hybrid approaches developed to fix this. The recently published ITP method (2020) gets the advantages of interpolation with the same maximum number of comparisons as binary search, plus a configurable constant (even if set to 0, it can take advantage of the gap between the length and the next power of 2, but 1 is a better setting). And once the search range is small enough that memory access stops being very expensive, it should switch to binary search and avoid the division.