Benchmark summary

Not all benchmarks are included on this page. Header links go to pages with full benchmarks and sometimes extra explanation.

The benchmarks found in bencharray are taken on a Tiger Lake i5-1135G7 processor using CBQN compiled with make o3 has='bmi2' (which also enables AVX2 and earlier vector instructions). These options are used instead of the usual make o3n to compile without AVX-512 support, as this CPU supports it but most consumer hardware doesn't.

Arithmetic

Implementation notes

The arithmetic functions +-Γ—βŒˆβŒŠ, and comparisons, are standard SIMD functionality. With +-Γ— an overflow check is needed, and if it happens a result in a larger type needs to be created. Dyadic | has some optimization for integers but is only really fast when 𝕨 is an atom.

Most other primitives, including Γ·βˆšβ‹† and ⋆⁼, require conversion to floats, so will ideally run at the same speed for all types. Γ· and monadic √ have native SIMD support. Libraries to compute others using SIMD exist but CBQN doesn't use anything like this yet.

Fold and Scan

Folds and scans are defined as a sequential evaluation, but for operand functions that are associative and commutative, evaluation can be reordered to give efficient SIMD implementations. These include +-Γ—βŒˆβŒŠ on integers, ⌈⌊ on floats, and many boolean functions. Because floating-point + (along with - and Γ—) isn't exactly associative, we have β€’math.Sum to perform a fast sum with unspecified ordering. CBQN currently has special code for common operands but not rarely-used ones like -. Uncovered cases run at about 5ns/value because that's the overhead of calling a primitive function.

Table

If it had no overhead, Table would have the same per-output timing as scalar-list arithmetic, or sometimes better as overflow checking can be done purely using the range of both arguments. It's close for large right arguments but CBQN still does an extra copy for now. For smaller right arguments the limiting operation is a constant Replicate (/) on the right argument, or sections of it.

Structural

Select

BQN uses AVX2 for most cases of selection, with shuffles on small 𝕩 sizes and the gather instruction on larger ones (unfortunately gather is slower than scalar code prior to Skylake in 2015). AVX2 can shuffle 16 bytes or 8 4-byte values in one instruction, giving the various steps as BQN blends more registers together. The 16KB benchmark shows these small sizes better where the 1e6-element benchmark is limited by write bandwidth. selection from a boolean array is usually slower because it requires reading the right byte from memory and then picking out the right bit within it, but for larger 𝕩 it takes up less cache space and can be faster. For a small 𝕩 but large 𝕨, 𝕩 is temporarily converted to 1-byte values so that it runs at the same speed as 1-byte selection.

Reshape

The speed of Reshape with a large result should essentially be limited by the amount of memory written. The current method is to copy the argument to the result, then repeatedly double it until it reaches a suitable block size for filling the rest of the result. There are special cases for 1-byte and 8-byte blocks; however, the 8-byte case measures slower than the general case, giving the bumps at the left of the first graph. We've left it in on the grounds that it may be specific to my CPU (in fact, the 1-byte case went from slower to faster when I upgraded), and if not, later compiler and libc improvements could fix it.

Replicate

Implementation notes

Replicate is a complicated family of functions, and CBQN does a lot of work to get consistent worst-case performance and take advantage of special cases: sparse input, and booleans with 1s and 0s mostly grouped together. Except for the grouped case, each algorithm takes some amount of time to process each input value, and some to produce an output value, and CBQN chooses between them based on the density. Note that the graphs for a boolean argument use ns/input and those for the general case use ns/output! In the flat sections for the boolean case, the performance doesn't depend on density at all, usually because the same writes are performed regardless, just overlapping more when the density is smaller.

Group

The benchmarks here are not very complete. The first shows how performance scales with the number of Β―1s in the argument; when there are a lot Replicate is used to filter out these values and avoid excessive branching. The second concerns a sorted 𝕨 value, so that the groups are slices from 𝕩.

Transpose

Implementation notes

CBQN has fast AVX2 kernel transposes, which can be used when both axes are at least one kernel long. This gives the lower region in the middle while the sides use scalar code, except for dedicated code if one side has length 2. Other than this length-2 case, booleans are simply converted to 1-byte integers to be transposed. Typically, cache associativity slows transposes down a lot when a side length is a multiple of a large power of 2; to avoid this, there's a method that takes advantage of that even length to write a whole cache line at a time. It's faster than the normal case, giving the downward spikes in the 4- and 8-byte lines.

Searching and sorting

Searching and sorting are the heaviest of BQN's commonly used functions, performance-wise. Element size is very important, as doubling it often slows things down by more than a factor of two.

For 1- and 2-byte elements, lookup tables usually fit entirely in L1 cache (2-byte tables with large elements can be iffy). Lookup tables are so fast that it doesn't make sense to try for adaptivity on these arguments. For larger elements, comparison sorting and hash tables are usually needed, and adaptive sorting can be used. The only case that doesn't have specialized code yet is sorting 8-byte elements.

Sort and Grade

Implementation notes

Currently CBQN has solid counting and radix sort implementations used for 1 to 4-byte elements, and uses the generic Timsort for 8-byte elements. 4-byte sorting is competitive with the state of the art for random elements but isn't adaptive at all.

Search functions

Implementation notes

The x-axis format here is used to display a small searched-in array (𝕨 for ⊐ and βŠ’, and 𝕩 for ∊) on the left and a small searched-for array (the other argument) on the right. "Half hits" means that half the values in the searched-for argument are found at some index, and is tough on branchy implementations; depending on the application all hits could be more relevant.

The generic case uses lookup tables for 1- and 2-byte elements (with SIMD creation and membership for 1-byte), and hashing for 4- and 8-byte. When one argument is short either linear lookups or SIMD binary search may be used.

Self-search functions

Implementation notes

CBQN runs fast on all CPU-native types and booleans. General cases use a mix of direct lookup and hash tables, and there's also all-pairs comparison for short arguments and adjacent-pairs for sorted ones. The function Classify (⊐) is troublesome because of the way indices depend on the number of unique elements seen so far, a problem that dyadic ⊐ (or ⊐˜) doesn't have. ⍷ is mostly just implemented as ∊⊸/, which is hardly slower than ∊.

Bins

Implementation notes

Bins is mostly fast. This graph shows AVX2 binary searches at the left for 1- to 4-byte types, table lookups for 1- and 2-byte types, and branchless binary searches with 4-way interleaving for 4- and 8-byte types. A shortcoming is the 4- and 8-byte case where 𝕨 is very long, which has poor cache usage and could be improved by partitioning 𝕩.